八个月的特工

八个月的特工
Eight more months of agents

原始链接: https://crawshaw.io/blog/eight-more-months-of-agents

## LLM驱动的编程：2026年更新过去一年，使用大型语言模型（LLM）进行编程的格局发生了巨大变化。虽然代理“控制框架”的发展速度没有跟上，但*模型*本身取得了令人难以置信的质的进步。一年前，LLM大约可以编写作者代码的四分之一；现在，这个比例是十分之九，模型甚至可以处理必要的调整。这种转变是一个重要的经济信号，减少了阅读代码与编写代码的时间（从80-20变为95-5）。有趣的是，传统IDE的使用正在减少。尽管它们潜力巨大，但作者已经回到了Vi，认为代理是一种优于即使是带有LLM辅助（如Copilot）的高级IDE的工具。至关重要的是，*仅*使用最强大的前沿模型（如Opus）至关重要——更便宜的替代方案会阻碍学习和进步。一个关键的挑战仍然是沙箱机制；建议使用虚拟机来实现无约束的代理运行。这种经验推动了作者的新项目`exe.dev`，旨在提供易于访问的、无约束的代理环境。作者认为软件开发的未来在于首先为程序员构建，认识到代理将越来越多地*针对*这些产品编写代码。尽管承认潜在的社会颠覆，作者仍然保持乐观，认为LLM是一种强大的工具，类似于木工中的电动工具——从根本上提高生产力并实现更广泛的探索。

## 黑客新闻讨论：人工智能代理在编码中的作用演变最近的黑客新闻讨论集中在人工智能代理在编码方面日益增强的能力及其对开发者工作流程的影响。虽然承认代理变得越来越强大，但许多评论员表达了对当前方法的担忧。一个关键的争论点是集成开发环境（IDE）中的代理是否优于基于命令行界面（CLI）的代理，一些人认为IDE鼓励更深入的代码理解。另一些人则支持由大型语言模型（LLM）驱动的自动补全，认为它是一种更高效、更少干扰的替代方案，提供更快的反馈循环和更大的控制权。讨论还强调了使用“前沿”LLM（最先进的模型）的重要性，认为使用旧模型可能会养成坏习惯。成本是一个因素，一些用户正在尝试更便宜的模型与Opus等优质选项。除了工作流程之外，评论员还涉及更广泛的影响，包括在代理驱动的世界中对优秀API文档的需求，以及对快速发展的LLM的社会影响的担忧。这场辩论凸显了一个根本的分歧：一些人将LLM视为强大的工具，而另一些人则对其潜在的负面后果表示警惕。

原文

2026-02-08

I wrote up my experiences programming with LLMs a bit over a year ago, and updated it for the world of agents eight months ago. A lot has changed since then, so here is an update.

Agents have improved dramatically in a year

We were prototyping our first agent, Sketch, when Claude Code was released 12 months ago. So I, by good fortune, got to be there and be excited right at the beginning. They could be helpful for some things some of the time!

Agent harnesses have not improved much since then. There are things Sketch could do well six months ago that the most popular agents cannot do today. The agent harness is critical, there is plenty of innovation to be done there, but it is as interesting a space right now as compiler optimizations were during the megahertz explosion of the 1990s.

Right now, it is all about the model.

And on the models: there are plenty of public benchmarks but they have all been gamed to death. Ignore them. Clearly the frontier model companies have good internal evals, because the models have qualitatively changed dramatically. In February last year, Claude Code could write a quarter of my code. In February this year, the latest Opus model can write nine tenths of my code. It all needs to be carefully read, and regularly adjusted, but now I can and do rely on the model to do the adjustments for me.

There has been no obvious change in models. Nothing like when GPT2 started talking back. There has however, clearly been a huge incremental improvement in the ability of coding models to get to useful results. (All of this, admittedly qualitative, progress is the most positive economic signal I see today.)

At a big company, my time was 80-20 reading code to writing code. At a startup, it used to be closer to 50-50. Now it is 95-5.

IDEs are clearly waning

The history of IDEs is so strange.

On the one hand, the IDE is obviously correct. Of course I should have a development environment that provides as much information and assistance as I can effectively use. By far the greatest IDE I have ever used was Visual Studio C++ 6.0 on Windows 2000. I have never felt like a toolchain was so complete and consistent with its environment as there.

Since those glorious moments in 1999, I have spent more of my programming life outside of IDEs than in them. The truth of programming environments is they are a hot mess. Unix was great, the Howl's Moving Castle we have bolted onto an over-taxed set of Unix concepts, not so much. The same thing happened to that win32 API I used to use in VS6.0, still there, with a giant mess atop and around it and entirely unignorable.

Then co-pilot came out and it seemed the IDE was inevitable. It did not matter how miserable it was trying to fit your IDE into your environment, you had to do the work because LLM-assisted auto-complete and edit were too powerful to ignore. They made my typing go 50% further and a large amount of the programming I do is typing limited, so the effect was enormous.

In 2021, the IDE had won.

In 2026, I don't use an IDE any more.

The degree of certainty I felt about a copilot future, and the astonishing whiplash as agents gave me a better tool not four years later still surprises me.

The only IDE-like feature I use today is go-to-def, which neovim is capable of with little configuration. So here I am, 2026, and I am back on Vi.

Vi is turning 50 this year.

Using anything other than the frontier models is actively harmful

A huge part of working with agents is discovering their limits. The limits keep moving right now, which means constant re-learning. But if you try some penny-saving cheap model like Sonnet, or a second rate local model, you do worse than waste your time, you learn the wrong lessons.

I want local models to succeed more than anyone. I found LLMs entirely uninteresting until the day mixtral came out and I was able to get it kinda-sorta working locally on a very expensive machine. The moment I held one of these I finally appreciated it. And I know local models will win. At some point frontier models will face diminishing returns, local models will catch up, and we will be done being beholden to frontier models. That will be a wonderful day, but until then, you will not know what models will be capable of unless you use the best. Pay through the nose for Opus or GPT-7.9-xhigh-with-cheese. Don't worry, it's only for a few years.

Built-in agent sandboxes do not work

The constant stream of "may I run cat foo.txt?" from Claude Code and "I tried but cannot go build in my very-sophisticated sandbox" from Codex is a nightmare. You have to turn off the sandbox, which means you have to provide your own sandbox. I have tried just about everything and I highly recommend: use a fresh VM.

I have far more programs and services than I used to

This is why I am building exe.dev. I need a VM, with an unconstrained agent, that I can trivially start up and type the one liner I would have otherwise put into an Apple Note named TODO and forgotten about. A good portion of the time Shelley turns a one-liner into a useful program.

I am having more fun programming than I ever have, because so many more of the programs I wish I could find the time to write actually exist. I wish I could share this joy with the people who are fearful about the changes agents are bringing. The fear itself I understand, I have fear more broadly about what the end-game is for intelligence on tap in our society. But in the limited domain of writing computer programs these tools have brought so much exploration and joy to my work.

I am extremely out of touch with anti-LLM arguments

New technology brings a lot of challenges and reasonable concerns. I spend my days trying to push the limits of agents, so I see them fail catastrophically several times a week. Significant change also changes labor markets which has many effects, good and bad. In 1900, 33% of Americans lived on a farm, and 40% worked in agriculture. In 2000, less than one percent lived on farms and 1% of workers are in agriculture. That was a net benefit to the world, that we all don't have to work to eat. (The numbers are even more dramatic if you go back another century.) But a lot of pain and heartbreak can and did happen along the way. It is right to be concerned.

But far more than measured analyses of the reality of the changes that are happening, I see hard anti-LLM takes that a year ago I disagreed with, and now I just cannot understand. It sounds like someone saying power tools should be outlawed in carpentry. I deeply appreciate hand-tool carpentry and mastery of the art, but people need houses and framing teams should obviously have skillsaws. To me that statement is as obvious as "water is wet".

A lot has to change

Most software is the wrong shape now. Most of the ways we try to solve problems are the wrong shape.

To give you an example, consider Stripe Sigma. This product is a nice new SQL query system for your Stripe DB. It has a little LLM built into it to help you write queries. The LLM is not very good. I want Claude Code or Codex writing my queries. But Stripe launched a fancy Sigma UI with an integrated helper before their API. There is a private alpha for the SQL REST endpoint that I do not have access to yet. So instead I had my agent do ETL-from-scratch: it used the standard Stripe APIs to query everything about my account, build a local SQLite DB, and now my agent queries against that far better than Sigma can.

I implemented that entire Stripe product (as it relates to me) by typing three sentences. It solves my problem better than their product.

That's the world we are in today. By far the worst product I had to use every day in this new world were clouds, so that's what I'm building over at exe.dev. It's a lot harder than it looks, but the entire point of the product is you should never feel that your agent should rewrite part of it for you.

Along the way I have developed a programming philosophy I now apply to everything: the best software for an agent is whatever is best for a programmer. The practical nature of writing software for customers has traditionally pushed us away from that philosophy. Product Managers have long had to find gentle ways to tell engineers: you are not the customer. Well, that has all been turned on its head. Every customer has an agent that will write code against your product for them. Build what programmers love and everyone will follow.

Hopefully that philosophy will survive the next year of changes wrought by LLMs.