Anthropic/OpenAI 在你支付给他们的每 100 美元上,可能投入了超过 1000 美元的成本。
Anthropic/OpenAI may be spending more than $1000 for every $100 you pay them

原始链接: https://ea.rna.nl/2026/06/07/anthropic-openai-may-be-spending-more-than-1000-for-every-100-you-pay-them/

这份摘要基于作者使用 Claude Code 进行数月实验的经验,分析了使用大语言模型(LLM)进行复杂编程的经济可行性。 对于简单且具备容错能力的查询,大语言模型的使用成本几乎可以忽略不计;但作者指出,“代理式”(agentic)编程则完全是另一回事。在复杂的代码库中实现功能,需要递归且高强度的“思考”模型,这会消耗大量不可见的“暗令牌”(dark tokens)。此类任务计算成本高昂,按 API 费率计算,单次查询往往需要 25 至 75 美元。 当前的消费者订阅方案(如每月 100 美元的计划)在很大程度上补贴了这些使用量。作者估算,其实际运营成本的“补贴系数”是订阅费的 2.5 到 12 倍。因此,尽管大语言模型辅助编程是提升生产力的强大工具,但就目前而言,将其作为独立的商业模式并不具备经济可持续性。作者警告称,随着各公司转向 IPO 并优先考虑盈利能力,这种“暴力计算”的盛宴很可能会结束。他还指出,Anthropic 近期的模型迭代似乎正在降低递归强度,以控制不断膨胀的基础设施成本。归根结底,大语言模型在编程领域的成功并非源于真正的智能,而是通过在具有严格逻辑约束的领域内应用了大规模计算。

最近的一场 Hacker News 讨论对“Anthropic 和 OpenAI 等公司每获得 100 美元订阅收入就要亏损超过 1000 美元”的说法提出了质疑。怀疑者认为该数字缺乏事实依据,且忽略了 API 盈利模式与消费者订阅模式之间的细微差别。 辩论的主要议题包括: * **盈利模式:** 许多参与者认为,人工智能公司将订阅服务作为“亏本引流”或成本价服务,而通过 API 使用获取高额利润。 * **技术通缩:** 评论者认为,随着硬件效率的提高以及模型变得更小且功能更强,人工智能的“真实成本”将不可避免地下降,最终实现成本趋于稳定。 * **硬件演进:** 一些人推测,人工智能的未来可能转向具有“硬编码”权重的专用任务芯片,这可能消除经常性的订阅费用。 * **市场怀疑:** 其他用户则持谨慎态度,认为当前的高额支出反映了一种旨在上市前推高估值的“炒作周期”,并将此与网约车应用曾经的“补贴”时代相提并论。 最终,参与者对于当前的人工智能商业模式能否长期持续,或者行业是否正面临重大市场回调,仍存在分歧。
相关文章

原文

For reasons that will remain hidden, we resume writing about Generative AI/LLM after a hiatus of 15 months (that one from October 2025, and the one from June 2025, don’t really count as serious pieces). Today, the first of two articles about “coding with Large ‘Language’ Models”, as coding with LLMs is positioned as the ‘killer app‘ for LLMs.

And while we’re at it. This is a long and winding piece (I am really having some trouble on this front, apologies), so let’s provide:

Here is a part of a screen shot of the application I am building (and having some fun with). It’s a real application which may support me in creating diagrams I need (so it’s a combination of data and graphics). The goal has been to investigate ‘coding with an LLM’, the thing I am building is just an example I thought I could use because I am somewhat experienced on the subject:

Let’s start with an important observation: Thanks to the combination of Claude Code and my own (rusty, but solid enough) programming background, I have been able to let Claude Code create an application (unfinished as of now, but functional) that I would otherwise not have been able to create in such a short amount of time, which — given time and energy — would have meant: ‘not at all’. But there is a catch. A few even. I will report later on the technical issues and an answer to the question: “To what extent does Claude Code actually ‘code’?”. Today it is simply the economics that struck me.

I built up my ‘vibe coding’ (I really dislike that term, there is little ‘vibe’ about coding with or without an LLM) experiment over — by now — 4 months (not full time, note). I started with a very small project to get a feel of LLM-coding, finally settled on Claude Code using Opus 4.6 on medium effort setting. If I used a higher setting it tended to get lost in the woods more often. Less, and the results were poor (more on managing the quality of the code in that later post). After a while on my final more serious project (the last two months) the results of ‘medium effort’ got worse, so I switched to ‘high’, returning to approximately the same quality.

Managing cost was also part of the experience. First, I took out a $20/month subscription. I quickly ran into usage limits. You have a limit that resets every 5 hours and one that resets every week, you can go beyond the limit by buying tokens at API-pricing. So, I tried adding some extra funds. And I noticed that the use of purchased tokens at API-cost was obviously much more expensive that staying within the usage limits. When I was still on the $20/month plan, and I bought some extra tokens to get some job finished, within a few days I had bought about $80 of tokens. At that point it became clear to me that paying $100/month was a much, much better deal than using the cheapest option and adding funds at API-cost level when needed.

I had already been looking at cost, e.g. training versus inference (generating results by Generative AI) a while back, concluding that not training, but inference/generation is what drives the cost. Having concluded that, I now started to look into the cost of LLM-use to perform a task. And I — disclosure — actually used a few LLM-chatbots (Gemini, Claude) to help me in this research. Now, before you stop reading because you are convinced these LLMs are too unreliable, they really are pretty decent search engines, even if they make mistakes on content they produce themselves. After all, they have ingested all those ArXiv articles and what a classic web search based on page-rank would not have been able to do, LLMs can: they find you information based on its actual content. People using LLMs in office setting will corroborate this.

The first answers I got were the standard quick & dirty answers, like Sam Altman’s famous: inference having become ‘too cheap to meter’. Such statements turned out to be true but also misleading/incomplete, very much so, as you will see. At the end of the day, the price ‘per token’ is not what is relevant for users, it is how much you pay for getting a ‘resolution’, or a ‘task finished’. And we already knew that while the results have improved, the amount of tokens had skyrocketed, especially with the indirect/recursive (‘thinking’ models. [Aside: calling them ‘thinking models’ is extremely misleading. They do nothing of the kind. What they do is above all a massive amount of invisible recursion, indirection, and trial and error. Like an LLM creating 20 or a 100 approaches behind the scenes, doing that time and again, use some additional tools (like first generate a python script, then run that script — having to run it multiple times until there are no obvious bugs in that script), maybe even test it a few times, and then use the output of that as more input for the LLM. And so on. A lot of indirection and recursion, a lot of ‘trial and error’, everything almost invisible to the user.] So, I investigated that trend, starting with “How much tokens did the average query use between Q1 2023 and now?” so I could combine falling per-token cost with rising amount of tokens used. After a few tries, I started a new conversation with a more focused start prompt.

It is important to keep in mind that people do not want the result of a single query, that is only true in the most simple tasks. In reality, they often make multiple queries and there is a back and forth, before they accept the result. So, you also need to estimate how many queries on average it takes to get to a resolution.

And then there are all those tokens you do not see. There are tokens that aren’t part of billing, there are ‘dark tokens’, and with the mis-labeled ‘thinking’ models there is a massive load of indirection/recursion and an incredible amount of ‘trial and error’ in the background going on, things you do not see as input or output, but all using a shitload of tokens. We know that these ‘recursive’ models use amounts of tokens that dwarf what you as a user see as result.

These are also tokens that you do not see, but they are nonetheless billed to you at visible output (generated) token cost. To get an idea of cost, Opus 4.6 when you pay per-token charges you $5/million tokens input (the data you give it, e.g. what it extracts from your code base on every query and what it reads in the background), and $25/million tokens generated data (including that generation that is in those ‘recursive efforts’ in the background.

Anyway, almost all of this is a lot of estimation. There is precious little hard data. This post is the result based on whatever best and reasonable estimates I could come up with, which in part is hard data: my own experience so far.

To show the development of use-cost, I’ve come up with a little framework:

The first element is how to measure per-token cost. We will look at subscriptions below, but the standard we follow is how ‘API per-token’ costs have developed over time. This is the best estimate we have of real token cost, but it is fragile, because as long as the vendors do not open up about this we must make an assumption. Do they make a profit at API-pricing? Or is that use still actually loss-making and subsidised? Either is a believable scenario at this time. And weird things have happened to API-pricing (see below). But we do not have better. Today, a top model like Claude Opus 4.6 costs you $5 for every million token you put in (your code, your documents, your queries) and $25 for every million tokens that Claude Opus 4.6 generates. That sounds very cheap, but here the ‘stuff you do not see’ is going to play a role.

Then there are two kinds of tasks people use LLMs for. One is fault-tolerant: it isn’t extremely important that the result is very accurate, this is the use that may be compared to ‘cheap clothes’: not very good, but cheap (an example taken from the start of the Industrial Revolution when physical labour started to get automated in factories, something the Digital Revolution is now doing to some mental labour). Such requests will also be followed up with far fewer follow-up queries and replies, people simply accept what the LLM produces.

But there are areas where accuracy is extremely important: coding is an example. IT is extremely brittle. One small error can bring the airline down, after all. Such quests are therefore fault-intolerant, we need correct results. Other examples may be health, finance, etc. where small errors may have large negative consequences. The amount of accuracy is especially an issue you have with Generative AI. It is large-scale statistical estimation, so inaccuracy is an unavoidable risk.

In other words, there is not a single kind of ‘quest’ (a series of queries until resolution, i.e. a task) that people use LLMs for, it ranges from a quick question where high accuracy is not really that important to tasks that are extremely dependent on actually correct results. The (‘fast’) budget models are quick, and indeed ‘too cheap to meter’ for simple stuff, and as long as the resolution is well represented in the data, and as long as the issue is not too complicated, the models do a ‘good enough’ job for many. Let these budget models loose on anything serious, however, and their results are by definition unreliable and thus often poor. But frontier models and — especially — recursive (‘thinking’ — ugh) models can be applied to such more complex tasks, and more difficult requests.

Today’s frontier models, such as Claude Opus 4.x or GPT 5.x do much more estimation, have internal parallelism, work with larger token dictionaries, have more dimensions, and the questions they are used for are generally larger. Expect several thousands of tokens on average for a quest, and more calculations per token.

And then there are the (frontier) — repeat ad nauseam: misnamed — ‘thinking’ models. These models may use tens of thousands, even millions of tokens, on a single quest depending on the task at hand. I actually ran into something like that with Claude 4.6 Opus in my coding experiment which — on a code base grown to about 36000 lines — used ~1 million tokens on a single (quite simple, for a human) query. That’s $25 for a single query of a single task when billed at API-prices. An outlier, certainly, but not that much of an outlier (a few hundred thousand tokens for a single query I saw much more often).

There is a third aspect that is easy to overlook. Because there is a difference between ‘correct‘ results and ‘accepted‘ (as correct) results. There is a difference between ‘when am I satisfied with the result’ versus ‘is the result actually correct’ In some cases it is much easier to verify correctness (think math, partly code — e.g. “does it compile?”, “does it run?”) than in others, like the research I have done for this article. At some point during my research I got results where my ‘common sense’ told me: “This cannot be right.”. And by digging in I found out that the LLM had taken quite a weird shortcut. For which it — of course — apologised when I pointed that out. [Aside: apologies by LLMs are as irritating as apologies with the help of LLMs.]

It is important to note that we are looking at a moving target here. When ChatGPT arrived late 2022, the average query may have been around 200 tokens, the reply (inference) around 400. But the things we do with ‘frontier’ models today was completely impossible with the frontier models of 3 years ago. What was ‘frontier’ capability then is ‘budget’ now. So, we look not so much at the history of ‘cost per token‘, but the history of ‘cost per task‘, where — as LLM systems have grown (with lost of scaling and engineering and little fundamental improvement) — the most difficult quests have become more complex too. Our use has grown with the growth of brute force these systems can bring to the table. And that leads to the question I was really interested in: How much more expensive has our use become? And how much do we actually pay?

I ended up plotting the estimated history of ‘cost per task’ for the following categories in my framework:

  • Budget conversation (fault-tolerant). Reliability isn’t extremely important, it’s simple stuff that is hard to get wrong, using a budget model is fine.
  • Frontier conversation (fault tolerant). Using a top model for a back and forth, but the result doesn’t need to be perfect.

The others are all based on using the recursive/indirect models. The vendors call these ‘thinking’ models but as said above that is utter bullshit. I will use ‘recursive models’ as a shorthand.

  • Complex reasoning — correct. A conversation about a complex issue and the reasoning is correct. How much does that cost?
  • Complex reasoning — accepted. A conversation about a complex issue and the reasoning is accepted by you, but it isn’t necessarily correct. How much does that cost?
  • Simple coding — correct. You ask a model to create a simple script, maybe a few hundred lines. Or you ask it to change something in such a script. You get a correct result.
  • Simple coding — accepted. You accept the result, but there are issues you haven’t noticed. Again, the gap between these two in coding is much smaller than for complex reasoning, for the simple reason that it is easier to find out if a result is incorrect with something like code.
  • Complex multi-file coding — correct. You are asking for a substantial change in a larger code base. E.g., my experiment is about a code base now grown to 40k lines of code (mostly C++). We’re talking about single change that touches multiple files in the code base. Doing this without the code turning into a mess in my experience requires the most powerful models such as Claude Opus 4.6 in extended (recursive) mode. Note that this quest may take several back-and-forths between you and the model, possibly with some testing in between.
  • Complex multi-file coding — accepted. You accept the change, but actually it isn’t correct (and you find out later). Actually my experience is that a lot of the time I am (together with Claude Code) busy debugging the code Claude Code has generated. Say, it doesn’t work as planned, Claude convincingly says it has found the problem. It creates a fix. Which doesn’t work, etc. Note: the result is still far more than I ever could have done without Claude Code (but details and caveats in another story).

As the frontier models have applied ever more brute force to their method, they have been able to perform harder tasks with acceptable quality. So our use has shifted with their capabilities. And so has the cost. The stuff we did early 2023 — and still do today — has indeed become ‘too cheap to meter’ as budget models can handle them well enough. But what about the other types of tasks?

Here is the first graph. It shows the development of cost (again: while results improve, coding with LLMs in 2023 and 2024 was almost impossible) of the standard queries with budget and frontier models, as well as complex reasoning and simple coding tasks with the recursive models.

So what do we see here? Well, while cost for simple things has come down — indeed this may be ‘too cheap to meter, and the growth of brute force has made things feasible that weren’t feasible before, like coding (within limits). A simple coding task in a small code base (or none) might cost something like $2-$4. But the effort — and with that the cost — per more complex tasks/quests has risen dramatically. There is a huge dip, which is when Claude Opus suddenly dropped to a third of the per-token price. (Why this happened is an interesting question (it doesn’t seem extremely likely to me this was all a 300% efficiency breakthrough), but the trend has stayed the same even if the sudden price drop did reset the graph).

Interestingly, because because it is much harder for the user to judge the correctness of ‘hard reasoning’ tasks than it is to judge the correctness of code (code is brittle and many errors will soon show when the result is used), the gap between ‘accepted’ and actually ‘correct’ for ‘hard reasoning’ is larger than for coding. (These gaps come from estimation on how many back-and-forths you need to get an accepted or correct result, for which Claude Opus 4.6 high produced some data).

But serious coding on reasonably sized code bases is quite a different story:

The really hard stuff — for which making correct changes in a medium-sized code base of about 40k lines (my test project) is a good proxy (though with caveats) — seems to have completely exploded in cost, because doing that reasonably well requires exponentially more calculations, a growth that completely dwarfs the cheaper per-calculation cost.

Which implies that actual coding cost per task has exploded. Note that we are talking about solving a single task/quest in a code base at API-pricing costing somewhere in the neighbourhood of $65. A programming job with humans in the loop (the only one I think that makes sense given the risks) will see several of those in a day.

Let’s go back to my experiment. When I moved to the $100/month subscription, I almost never did hit my usage limits anymore (mostly because I’m not doing this full time, on the contrary). But at one point — after a massive change in the entire code base that meant a real big effort for Claude Code — I hit my 5-hour limit again. That gave me the possibility to continue with the same I had been doing but now at API-cost (which is the cost-type we have been talking about until now): I gave it $20 to spend at API-rate and 20 minutes later that had been spent.

Yes: $20 in 20 minutes. And note, that did not complete a full resolution of the — large but superficial — change that had been planned and that Claude was busy implementing. Recall, I have seen a single query use a million tokens before, and at API output token rates that single query would have cost me $25. A serious software engineer working full time on a reasonably sized code base may do 5-10 such tasks on a day.

At the beginning of 2023, we were talking about, say, 200 tokens in and 400 tokens out for a query. And not a lot of hidden or dark tokens. And now, the scaling has exploded, the use of water and energy has too. [Aside: these days we do not measure data centers in TFLOPS (amount of actual performance you get) but on GigaWatt, so not the performance of the car, but how much gasoline it slurps. That is remarkable in itself, and it is reminiscent of “a race to build the world’s heaviest airplane…] (As I wrote earlier in simpler LLM times: we have never seen a linear link between scale and performance, it’s more ‘exponential brute force growth delivers linear — after a while even slower — improvements’. Scaling doesn’t solve this. (Real fun: I actually roughly estimated then — from GPT-3 numbers — that to get to human-level performance you would need to scale to ~3,500,000,000,000,000 the size of GPT-3. Ouch. By the way: I have seen nothing yet that changes that estimate substantially. So some will get useful tools, but we certainly won’t get AGI from scaling — and that includes scaling via trial-and-error generation and use of symbolic parts).

Anyway, back to my experiment and the final test project. Apparently I have been doing stuff for $100 a month that would have cost me that several times over if I had been paying API-rates. Now my experiment is still ongoing, but I have drawn/made some tentative economic conclusions/estimates (some corroborating what others have been saying already, this is of course not entirely new — though the intersection of engineers really using this technology to build something serious with those that are critical/skeptical is probably not that large).

My test scenario is by far not the worst one. I am doing ‘human-in-the-loop’ (for good reason), which means I can’t burn tokens as fast as possible. I am working from a greenfield start, so there is not a lot of crud making things hard for Claude Code. Even with ‘human-in-the-loop’ and not doing this fulltime, using the stats Claude Code can show me, I have paid $100 for 4M tokens at API-price, and $180 for subscription, and that subscription used tokens that at API-pricing would have cost me about $450. This is a subsidy factor of 2.5, I can pretty confidently say, will be a bottom line for the subscription level subsidy for a Max account that is actually used for coding. But I also know how much of my weekly limits I hit, which is about 20% currently. So, max up to the weekly limit (using unrestrained ‘agentic’ use with no permanent human-in-the-loop for instance — a really bad idea, but that is for the other story) and the max subsidy factor will become roughly 12. (By the way, ‘full agentic’ means simply letting the LLM loose on your system without supervision, letting it generate and execute scripts at will there, something I think you really only should do if you have isolated that system from anything valuable).

In short: brute-forcing that really complex stuff, like editing code, might be the ‘killer application’ that currently is used to sell the Generative AI business as a usable tool to the world (and even as a road to AGI with Anthropic’s suspicious blog post on ‘recursive self-improvement’). But the actual costs are kept hidden from view for many users if they use subscriptions.

So what are my tentative conclusions?

  1. This ‘brute force code editing’ party cannot last. Really, it can’t. Would I have started this experiment for ~$1200/month? Nâh. Can they keep subsidising coding-support at this level? Definitely not. Can we do with far less programmers if you give them this tool at real cost? That is something for the other story if I get around to it. For now: the party probably lasts until the IPOs when reality must eventually hit.
  2. I suspect that I understand one driver for Anthropic’s developments since Opus 4.6. Both Opus 4.7 and Opus 4.8 clearly are not applying as much recursive brute force as Opus 4.6 can. See this Reddit thread for some of the comments after 4.8 was released. Opus 4.6 is in my experience the best so far (and, sure, it has enabled me to build that system that I would otherwise not have been able to in such a short timeframe), but my experience too is that 4.8 is a step down from 4.6. My personal guess is that the subsidy factor is already a very nasty problem for Anthropic (and OpenAI) and that they are struggling to keep close to the best ‘oversized brute force’ qualities while not burning that much cash that they won’t make it to the IPO in one piece. It definitely does smell a bit like a combination of desperation and ‘pump and dump’.
    • Not providing Mythos to the general public may in part have simply been because it is far too expensive for them to run. I listened to a presentation of one of the Glasswing partners recently, and they mentioned as an aside that single tasks investigating their own code, could cost them $35k in API-cost. That is 1.4 billion — with a ‘b’ — tokens for a single task.
  3. There are more reports about actual cost versus what people pay and I tend now to find these ever more convincing. Complex coding has become the poster child of Generative AI, but if I really do this in a serious way (decent code base, and I know code, so I can actually see if the system starts to mess up the quality) I need to use a lot of brute force using a recursive/extended model, and they may be making $10 loss on any $1 I spend, if not more, if I do this full time.
  4. Situations that are fault-tolerant, like trying to find vulnerabilities and exploits in existing code based on the patterns encoded in LLMs, might be affordable for large players such as state actors.
  5. Anthropic’s post on ‘recursive self-improvement’ lacks even the most basic critical approach as well as any number on the cost of all that edit-code-inference. But they need it on the way to their IPO, I guess.

Which leaves open the other key question: how good is that code anyway? Which is a different subject for a different time.

So:

  • Enjoy it while the party lasts (and for me it probably either ends when they retire Opus 4.6 or have their IPO and economic reality comes knocking).
  • And prepare for when it ends, and you maybe have maintain the code that heavily subsidised ‘brute force’ has gotten you. (IT was one of such actions where a single query cost me a million tokens).

Caveats

The preparation of this article has differed a lot from my normal way of working. For one, I don’t really use an LLMs for serious analysis (search is really helpful, though). Too unreliable, and that was clear now too: I had to travel through several erroneous scenarios, e.g. where I was presented with Opus 4.6 data combined with (three times as expensive) Opus 4.1 pricing (“My apologies…” said Claude. Again.). But the overall gist is, I suspect, more or less correct. Now, if you are a big believer, and thus my critical story must be wrong, remember it is based on what Claude Opus 4.6 ‘thinking’ at ‘high effort’ produced/found for me…

Good data is almost impossible to get. You can look at Opus 4.6 pricing, but there are various levels of ‘effort’ that influence how many tokens are used. What we do know is that is that the hidden ‘recursive’ tokens are billed at the price of output tokens. But how high you have to set your effort to get the sweet spot between ‘dumb/bad code’, ‘good enough results’ and ‘careening off track and lost in the woods’ cannot really be estimated. So, I have used by own experience of the last few months also in the back of my mind. It is all an educated guess, nothing more. But my stats from my personal use are pretty hard, so we have at least some hard, but anecdotal, evidence.

Am I someone who has a year of experience with everything ‘vibe coding’? No. I’m still (the rusty remains of) a classical software designer/engineer. Enough of a software engineer, by the way, to be able to recognise good and bad decisions by Claude Code, and to experience a real speedup in my experiment. At this level of subsidy, it is definitely worth it for someone like me, with the project I am doing.

Final thought — a AI-lesson from 30 years ago, applicable today

Why has ‘brute force code editing’ become so good? (Again: for the caveats on ‘good’, wait for the other shoe (i.e. article on this blog) to drop.)

Here is something I remember (warning: memory is in part fantasising about the past to support your current convictions) that offers a perspective. When, in the second half of the 1990s, I worked for the Dutch Advisory Council for Science and Technology Policy (a government think tank filled with top people from academic science and technological business, I was a scientific staff member — they actually hired people with real technical experiece and experience and not just powerpoint shamans) part of my job was to read a lot of science journalism. And I came across a short research note in a US science magazine — I think it was Scientific American — that said researchers at USPS had been able to increase automatic handwriting recognition to something like 76% success rate, up from something like 71%, the top in the world at that time. Now, such technology was in hot demand at postal operators so they could automate most of the sorting of letters, saving a lot of cost. 76% was of course far from enough to actually use this.

But the note surprised me, because I knew the Dutch postal service (PTT Post at the time) already was using automatic address recognition in production with over 99% reliability. So, the question obviously was: how? It turned out that there were two basic technologies at the time. One was based in pixel-based pattern recognition, the other was based on extracting a vector-based representation of the handwritten address. The pixel-one had a lot of trouble with sizes, the vector one had trouble in other ways. Both had that same slightly over 70% success rate. But what PTT Post had done was combine the two. Which still did not give you accurate results, but then some brilliant engineer had the following idea (I think comparable in brilliant creativity to Vaswani’s trick that led to the current LLM-boom): the actual combinations of different parts of addresses were a far more limited target space than writing in general. Not just that some cities and street names simply do not exist, but especially that combinations do not exist. It went like this: suppose your top matches for the city are ‘Amsterdam’ and ‘Amstelveen’ (all checked as placenames that have to exist somewhere). And your top guesses for postal code characters were ‘1/2/8/4/A/B’ and ‘7/5/3/8/H/B’ and your top guesses for street name were ‘Westerstraat’ and ‘Weesperstraat’ (Note: all these except the city names are not real examples), then together these low-reliability guesses could turn into a single high-reliability result. E.g. when there wasn’t a ‘Westerstraat’ in ‘Amstelveen’, this combination could be skipped. And if no postal code of any of the top city/street matches would start witha ‘7’ that could be dropped too. Etc. What they did was limit the actual freedom for results so much that the mediocre algorithms of the day could cope.

I am thinking about this because math and code are areas that have some of that. Code is not poetry — though some code is elegant and beautiful — and there are a lot of constraints that make the area have much less degrees of freedom that, say, a normal everyday question you can ask an LLM. Especially with the indirect/recursive use of LLM, and the absolute checks possible (“does this compile?” “does it produce exactly this result in a test?”) the essential inaccuracy of Generative AI can be effectively constrained (though at a high cost in — invisible — trial and error). Hence, code and math are ‘specially constrained domains’ and the fact that we can do much more in them that others is not a sign that Generative AI is getting more intelligent, but that it is being applied in massive scaling and recursion in a domain that has all these combinatory restrictions that help weed out undesired results. Just like 30 years ago when trying to read hand-written addresses.

Methodology & assumptions (by Claude)

Resolution = total cost until the task is done, including retries, backtracking, failed attempts, and context resend across all turns. “Accepted” = user stops because output looks right. “Correct” = verified to actually be right (tested, audited, cross-checked).

No caching. Prompt cache TTL is 5 minutes (1-hour option at 2× price exists). For human-in-the-loop workflows where turns are typically 10–30 minutes apart, cache misses are the norm. All estimates assume full-price input on every turn.

Top model at each period: GPT-4 (early 2023) → GPT-4 Turbo (late 2023) → GPT-4o / Claude 3 Opus at $15/$75 (mid 2024) → o1 at $15/$60 (late 2024) → Opus 4.0 at $15/$75 (early 2025) → Opus 4.5 at $5/$25 (mid 2025, 3× price drop) → Opus 4.6 at $5/$25 (late 2025) → Opus 4.7 at $5/$25 with +35% tokenizer inflation (early 2026).

Thinking tokens are billed as output tokens ($25/MTok for Opus 4.5+). On high effort, thinking can be 30–50K tokens per turn, invisible to the user (redacted on Opus). On medium effort, ~76% fewer thinking tokens with similar quality on most tasks.

Complex multi-file coding assumes a 35K+ line codebase, 15–25 tool-call turns, cumulative context resend without caching, high effort extended thinking. The $55–75 range at early 2026 reflects Opus 4.7 pricing ($5/$25) plus 35% tokenizer inflation on ~5–7M cumulative input tokens.

The mid-2025 dip is real: the 3× Opus price drop ($15/$75 → $5/$25) temporarily reduced costs even as task complexity grew. By late 2025, harder tasks and higher thinking budgets pushed costs back up. The 35% tokenizer inflation on Opus 4.7 accelerates this.

The reasoning gap between accepted and correct is ~3× because most reasoning faults are invisible to the user. Verification requires independent methods, cross-model checking, or domain expertise — all of which cost additional tokens.

Sources: Anthropic pricing page (platform.claude.com/docs/en/about-claude/pricing), OpenAI pricing, Anthropic engineering blog (April 2026 postmortem on effort settings), claudecodecamp.com session tracking data, Branch8 team cost analysis, Faros.ai developer cost averages, Verdent/Finout/MorphLLM pricing guides, Epoch AI energy analysis, OpenRouter State of AI 2025.

P.S.

I have looked shortly at OpenAI/GPT-5.x/Codex pricing and it seems the numbers are roughly comparable.

Full transparency: the conversation with Claude that was used for the production of this article can be read here.

联系我们 contact @ memedata.com