代码生成并非生产力
Codegen is not productivity

原始链接: https://www.antifound.com/posts/codegen-is-not-productivity/

## 生成式人工智能与生产力幻觉 围绕生成式人工智能,特别是大型语言模型(LLM)的炒作,常常集中在它们能够产生的代码数量上。然而,作者认为,庆祝代码行数——无论是人工编写还是人工智能生成——都是衡量程序员生产力的一种根本性错误。 历史上,软件工程专家一直告诫人们不要将代码长度与价值划等号,认识到编程主要在于管理复杂性和表达抽象思想,而不仅仅是为机器生成指令。 LLM 能够加速*编写*代码,但它们无法解决软件开发的核心挑战:设计、规划和理解问题领域。 事实上,代码生成的便捷性可能会适得其反,导致过早实现、增加维护负担(更多代码意味着更多需要维护的内容),以及倾向于重建现有解决方案而不是利用已建立的库。 作者强调,LLM 经常生成需要大量改进的代码,其中充斥着诸如抽象性差、风格不一致以及未能利用现有代码库等问题。 最终,编程仍然是一个协作、由人类驱动的过程,将代码生成速度置于可读性、可维护性和周密设计之上,是对整个软件生命周期的损害。 关键不是*更多*代码,而是*更好*代码,这需要人类专业知识,而不仅仅是人工智能的输出。

## 代码生成与生产力:一则黑客新闻讨论总结 一则黑客新闻讨论引发了关于人工智能代码生成工具是否真正提高生产力的争论。核心观点源于一篇链接文章,认为编写代码并不总是软件开发的主要瓶颈——设计、集成、测试和协作往往耗费更多时间。 许多评论者同意这取决于项目,人工智能可以加速小型、独立项目,在这些项目中编码*是*主要障碍。然而,对于大型项目,更快的代码生成可能会引入更多复杂性并阻碍团队协作,从而可能*减慢*开发速度。 人们对人工智能的炒作及其对就业的潜在影响表示担忧,认为这种威胁更多来自管理趋势而非技术本身。多位用户强调了人类协作的重要性,认为人工智能目前在这一软件工程的关键方面存在困难。另一些人则反驳说,人工智能通过解放开发者,使其摆脱低级决策,从而*增强*协作,让他们专注于更高级别的战略。最终,讨论指出“生产力”是一个可操纵的指标,并且个人体验差异很大。
相关文章

原文

There is a whole lot to say about generative AI. LLMs generate a bunch of code, this much is certainly true. Should we celebrate that? There is a long tradition of trying to measure software development output, and most of it tells us that lines of code is a poor metric of programmer productivity. I have some thoughts.

I have seen many people talk about the productivity they get from LLMs in terms of the code it generates for them. I have seen claims of 10,000 lines of code in a day or hundreds of thousands of lines in a week; these often seem like brags or at least they are presented positively. I do not believe that LLMs and generative AI change anything fundamental about using lines of code as a measure of output or productivity.

This is a rant. This is what I think about when I hear people talking about lines of code, whether generated by an LLM or pouring forth from human hands. I do not think anyone should celebrate code output.

From the preface to the first edition of SICP:

First, we want to establish the idea that a computer language is not just a way of getting a computer to perform operations but rather that it is a novel formal medium for expressing ideas about methodology. Thus, programs must be written for people to read, and only incidentally for machines to execute. Second, we believe that the essential material to be addressed by a subject at this level is not the syntax of particular programming-language constructs, nor clever algorithms for computing particular functions efficiently, nor even the mathematical analysis of algorithms and the foundations of computing, but rather the techniques used to control the intellectual complexity of large software systems.

In other---worse---words, programming is not about writing code that makes the computer do a specific thing, or at least not exclusively or primarily about that. Programming is an exercise in representing abstract ideas and managing complexity while doing that. Programming is as often an exploration of these things as it is an implementation of them.

I will note that none of the ideas below are new or original. I encourage you to check out the appendix of anecdotes and quotes for many takes on this. For just about as long as we have had programming languages, experts and more have argued that code should be thought of as a liability, not an asset; some of the anecdotes are about this and you can find many more online; this is a critical thing to keep in mind.

Code in programming

An average human could probably type about 4,000 lines of code a day. That said, developers do not spend all their time writing code. In fact, developers spend most of their time on activities other than coding.

LOC is a poor predictor of---and is poorly predicted by---other metrics of interest in software development, including defects, effort, and time.

This is important to understand: the generation of code is not the primary work of a programmer by time, nor does the amount of code predict anything useful about the software. And it is doubly important to realize that programmers are not the exclusive participants in the business of software development, whether that be internal solutions, software products for sale, or FLOSS. Code is one component of software development and problem solving, but it does not take the majority of time. Code is not the bottleneck; it never was.

The question comes up everywhere in discussion about generative AI. Some people seem to believe the answer is a firm yes. Others consider the very premise of the question to be absurd. Almost no one asks the question directly, but it is embedded in nearly every take on LLMs.

Should we abandon everything we know?

I do not believe the question is absurd to ask. I think it is absurd to implicitly answer the question without saying you are doing so. I believe everyone should explicitly consider it. I do not know that there is a correct answer. You must answer it, though, and know that you are doing so. And it is probably a good idea to share your answer aloud when you talk about AI, yourself.

Here is my take. Programming is still programming, even if the code is generated by an LLM. Much of the work of programming is not about literally getting code into a source file. Some of the work of programming is getting code into a source file. LLM codegen can accelerate the part of programming that is writing code.

There are other major components to programming which I am not ranting about here, so I will leave it to you to consider whether LLMs can help with those other components.

LLMs are the primary generative AI technology being used for programming, so I focus on these.

LLMs constrain us to primarily text for planning, designing, and implementing. The tooling and norms push us to a markdown planning document and then implementation. This forces us into implementation too soon.

LLMs are trained to generate more output to solve a problem. That is what they are best at. Of course they can do otherwise, but their training and all agent harnesses emphasize generating output. It is an interesting philosophical note that even in reducing, an LLM must generate new tokens.

We get to generated assets and code far too quickly. Code is an incredibly high fidelity prototype, but it is expensive (even with LLMs) to change such a prototype. Who among us has not dealt with the pain of a POC pushed to production prematurely? LLMs encourage this! It is much easier to iterate at the design phase, but LLMs are limited in their ability to operate at that phase. The design iteration loop is not nearly as well supported by LLMs generating text, nor is it emphasized in a meaningful way in agent harnesses. Plan mode does not count.

There is huge value in low fidelity prototypes and designs. The value is psychological and practical. We are not attached to a scribble on a whiteboard, a sketch on some scrap paper, or seven circles and boxes we knocked together in Paint while talking. There is no weight to these things either; their very nature tells us they are disposable. There is no confusion that these things are expert-level creations nor any implicit gravitas. Generated artifacts, on the other hand---even if text-based such as ASCII art or a PlantUML diagram---feel more important and final, and like they are worth holding onto. LLMs confuse our well-honed heuristics about inherent quality in things that would take us longer to reproduce by hand, things that appear impressive on the surface. Plan documents and generated artifacts are too concrete: heavy and so much already set. LLMs rush us through design and promise an implementation now! This locks in too much too soon. The very medium that gives us flexibility also fools us and forces our hand.

And somehow, LLMs bring back the false belief that lines of code mean anything. There is one thing that a high line count guarantees: there are more lines of code that can be changed. Once we are in the code, whether directly or through an agent, we have left the realm of the fastest and easiest iterations, design. It is easier to wipe a line from a whiteboard or throw away a piece of paper than it is to change an implemented solution. Incidentally, it is also easier to do those things than to iterate the same ideas in a planning cycle with an agent.

LLMs entice us with code too quickly. We are easily led.

While LOC is not a good measure of productivity, it does have a direct impact on maintenance. It is hard to find numbers worth citing for the proportion of time that goes into software maintenance; it is easy to find numbers, but the studies have issues. Nevertheless, some searching indicates---and personal experience supports---that maintenance time comprises the majority of development time in software projects.

Humans and LLMs both share a fundamental limitation. Humans have a working memory, and LLMs have a context limit. The techniques to work with these limitations are quite similar. Nevertheless, no matter the technique, more source code is more difficult to deal with than less. There is more to understand. There are more places for things to interact. It is just plain easier to mess things up with more source code; you must be careful and meticulous.

Even if your inference provider of choice offers models with large context windows, context rot comes for all. Whether you like to anthropomorphize your technology or not, maintenance is an area where humans and LLMs benefit from the same things. One of those things is having fewer lines of code. It is good for all.

There is another consideration, as well. There is a common pattern I have seen in my own work and that of many others. Coding agents bring implementation close to hand, too close as I described above. This manifests a different problem: it is too easy to build bespoke solutions. That may sound like a benefit, and perhaps the entire point of coding agents to you. Let me explain further.

When it is so easy to start implementation, it is easy to forget to search for existing solutions. And I have observed in my own interactions with LLMs---and in others' implementations with agents---a distinct lack of sufficient push-back to use established packages and libraries. This is a compounding problem for understand-ability: the source size increases with implementations of things that could be library calls, and the custom implementation must be understood instead of simply referencing library documentation. This yields strictly more code than is necessary, and requires a closer reading of that code than if it used a well known package.

And this is not even considering the LLM-driven solutions that never even needed to be a software project in the first place.

It is worth considering productivity including maintenance, not just vibe-time to first implementation. And with the observation that agent-driven development leads disproportionately to build-over-"buy" decisions, we must consider whether unnecessary solutions delivered quickly count as productivity gains.

I contend that an increased volume of code and pace of change hurts collaboration. I admit that there may be counter-balancing benefits. It has been observed that code is read much more often than it is written, so it is probably worth optimizing for reading, where less is so much more.

LLMs seem mostly to be pitched as---and experience reports I have seen demonstrate---personal productivity enhancements. A big part of collaboration in programming is reading. After all, as observed above, code is primarily a medium for human understanding and incidentally for machine execution. We read each other's code continuously. Good software development practice demands that we peer review every line of code before shipping it. It does not matter how quickly code was generated when it comes time to read and review it. The speed that matters is humans' in review. Less code is more.

And no, Opus reviewing Gemini's code does not count; only when someone from your inference provider takes responsibility for your on-call shift do they get to own code review. We collaborate with one another on more than just getting code into source files. Let me be clear: if I am on the line for production up-time (and I am), then I am personally responsible for every single line of code that could affect that. It does not matter if I used an LLM to help or not; I am responsible for my contributions. And you are responsible for yours. If I wake up to downtime because you did not worry about reviewing what your LLM generated, I will not be happy. "The LLM said it was good," does not recover downtime, nor does it keep me sleeping like a baby.

I do not see much about collaboration from those spouting the gospel of LLM.

There is yet another aspect of collaboration un-addressed. I am engaged in regular dialogue with customers. Customers pay good money for products, and the work of providing products and incorporating feedback is collaboration. If I want customers' trust---and let us be blunt, their money---I must be able to make assertions about the product; simple things such as what it does, how it does it, what they can expect to work, what is coming in the future, and how to deal with any errors in the application. Customer support where I work flows either directly or pretty damn quickly to developers. If the code was written by an LLM without a human understanding it, then this support channel turns into chatbot support by another name: slower and more effortful, but chatbot support nonetheless. And I can tell you with certainty that customers are grateful to get capable humans when they need support. Customers paying good money deserve to get support from a human when they need it.

This is a rant; there is no conclusion.

Maybe ask some questions:

  • what do LLMs provide?
  • (how) should productivity be measured?
  • do LLMs improve this measure?
  • what is the cost of an LLM? (Nota bene: the answer to this one should not be in any denomination of currency)
  • is the value of the LLM worth the cost?

Gratitude

Big thanks to my test readers, Johnny Winter, Gilbert Quevauvilliers, Eugene Meidinger, Bernat Agulló, Daniil Maslyuk, Daniel Marsh-Patrick, and Alex Barbeau. Any errors are, of course, my own.

You should read all of these things

Some may take longer than others.

These are all appeals to authority. That is why they are in an appendix, and not part of the rant itself. Most of these are authorities that you should listen to, though.

  • "My point today is that, if we wish to count lines of code, we should not regard them as 'lines produced' but as 'lines spent': the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.": Dijkstra
  • "One of my most productive days was throwing away 1,000 lines of code": Ken Thompson
  • "Everyone knows that debugging is twice as hard as writing a program in the first place. So if you’re as clever as you can be when you write it, how will you ever debug it?": Kernighan's law
  • "Basically, lines-of-code is a completely bogus metric for anything": Linus Torvalds
  • -2000 Lines of Code
  • "Measuring programming progress by lines of code is like measuring aircraft building progress by weight": Bill Gates
  • "I hate code, and I want as little of it as possible in our product": Jack Diederich: Stop Writing Classes
  • "The most valuable tools in an AI-assisted workflow aren’t the ones that generate the most code, but the ones that constrain it correctly": Anders Hejlsberg
  • What a programmer does

And a recurring theme in my analytics consulting work:

I often get called in for gorgeously gnarly questions of performance or correct behavior, or both. I typically work on the things that others have tried and failed at, repeatedly. I spend so much more time asking questions such as, "Why," "What is that," and, "Why is it that," than I ever spend writing code. Invariably, after asking such questions for hours, the number of lines of code I have had to write is measured in single digits. I often get to delete code counted in tens or hundreds of lines, and in a couple of glorious cases, thousands of lines in a single measure. More often than writing new implementation code, the solution is to add a dimension or a data structure, rather than write any new implementation code.

The hard problems I find in consulting and in programming generally are not questions of implementation, rather they are questions of questions. A well formulated question is often already halfway answered. Understanding the problem domain well enough to know what questions need answers is the hard part. Forming questions and requirements well is the hard part. Writing the code is often a formality once the domain is well understood, the questions are well formed, and the requirements are clear.

Though this is irrelevant to the article, I cannot help but presume many people will care about this. My experience and my work have nothing to do with the content above.

I use generative AI daily in my work and have for the better part of a year, shipping major new features to multiple producs. My job is to deliver software products and systems that other people use. Someone has to operate, maintain, and extend those software products; often that someone is me. I have unlimited access to frontier models from major labs and many open weight models. I consider it important to share that I no longer have any relationship with OpenAI, nor will I again. I have Claude Code and other harnesses. This is not to brag. I am lucky to work in organizations that pass Joel's test.

This is not a rant bred in the brain of a Luddite. I have exposure to and experience with the subjects I discuss.

My use of LLMs can be broken down into four major categories:

Category Approximate quantity of tasks Approximate time spent in use case
rubber duck planning/design 35 15
better search 25 5
digest docs for examples 25 5
generating code: edit or new 15 75

Rubber duck planning and design is everything you would expect from spec-driven development with agent harnesses. I also use it for exploring architectural ideas, researching available libraries and prior art in a given space.

Better search is pretty self explanatory. Given some project context, LLMs can do a quite good job of being a little research helper. I have found that it is necessary to be very explicit about preferred sources, and that I need to ask for direct links and citations. For anything beyond the most trivial, I mostly use the LLM to find things for me to read, rather than use it to digest or interpret.

Digesting docs for examples is primarily when I am using some new library or a construct I am not familiar with. It is helpful to get contextual examples of how something would fit into my existing code. Usually LLMs are pretty good at basic examples. I have found that nuances of semantics for languages other than C# and Python often escape every LLM, even when reading language docs.

Generating code is exactly what it sounds like. And boy howdy, can these things put text into a source file! I have absolutely no trouble believing people who claim that they generate 10,000 lines of code in a day with an LLM. The problem is this: I want absolutely nothing to do with the code that these things generate, or at least not without a massive number of improvements.

Here is a quick list of the various things I need to fix all the time in LLM-generated code:

  • copying code rather than reusing existing abstractions
  • deleting tests or rewriting them to not test the right thing
  • misunderstanding dependencies and invariants
  • writing implementations that hard-code test cases and return only the tested values
  • preserving test cases for code that has been deleted, trying to make them pass
  • design decisions that are routinely the opposite of what I would make, even with specs and context built up over days
  • an absolute incapability to identify opportunities for abstraction and well-typed solutions without exact prompting for the abstraction to use
  • failure to build interfaces that are consistent and reasonable:
    • inconsistent argument ordering
    • inconsistent naming conventions with other parts of the code base
    • routinely writing code that can only be called if you already know how it is written
  • not following my guidance on style, approach, architecture
  • not following authoritative guidance on how to use a library
  • implementing shitty facsimiles where they should use a battle-tested library
  • using a whole framework where they could write a single short function
  • outright refusal to build upon established abstractions, instead glomming new code on, rather than using or extending core components of a solution

These are not specific to any LLM, harness, or environment. These issues have come up for me with C#, F#, elisp, DAX, M, Bash, Ruby, Python, and OCaml. These issues occur in domains ranging from dimensional modeling, data engineering, parsers, compilers, writing CLIs, ASP.NET, general web programming, basic system automation, and building system daemons. An LLM can give a perfectly lucid analysis of an architecture, library, individual function, or data structure; it can identify extension points and places that might need improvement; it can find limits and give a detailed analysis of how to integrate new functionality. After they do all that, they end up doing something that looks like this.

I will note that all of these failure modes end up yielding more code than is necessary. All of my fighting with LLMs is to get them to write less code.

When I deal with LLM-generated code, I have a bad time.

I am mostly frustrated when using LLMs for code generation. It often feels futile to get an LLM to generate code that I would accept as a PR reviewer. I constantly question whether they are a useful tool for the programming I do.

I have to tell you, the value I get from LLMs does not come in any way from getting code into a source file faster.

联系我们 contact @ memedata.com