开发者正在选择较旧的AI模型,数据解释了原因。
Developers are choosing older AI models

原始链接: https://www.augmentcode.com/blog/developers-are-choosing-older-ai-models-and-16b-tokens-of-data-explain-why

## 模型采用呈现碎片化:人工智能专业化早期迹象 Augment Code 对数百万实时编码交互的分析显示,人们正在从单纯追求最新的 AI 模型转向策略性地将模型*匹配*到特定任务。虽然 Sonnet 4.5 最初占据主导地位,但其份额在 2025 年 10 月初下降至 52%,而 Sonnet 4.0 上升至 37%,GPT-5 保持在 10-12% 左右的稳定水平。这并非典型的版本升级——开发者正在积极地在不同版本之间进行选择。 关键行为差异正在显现。Sonnet 4.5 优先考虑更深入的推理和更少的工具调用,生成比 Sonnet 4.0 多 37% 的输出,但延迟略有增加。Sonnet 4.0 倾向于频繁、快速的操作,而 GPT-5 平衡了推理和自然语言。 这会影响系统负载;Sonnet 4.5 由于其广泛的上下文使用,具有更高的缓存读取量,从而将重点转移到管理和重用信息上。因此,模型正在找到利基市场:Sonnet 4.5 擅长重构等复杂任务,Sonnet 4.0 擅长自动化,GPT-5 擅长代码解释和文档编写。 这一趋势表明,人们正在转向“模型合金”——针对特定工作流程定制的集成体——这反映了数据库技术的多样化。未来不是关于单一“最佳”模型,而是为每个工作选择正确的*认知风格*。

最近的 Hacker News 讨论强调了一种趋势,即开发者选择使用较旧的 AI 模型,如 Sonnet 4.5 来工作。虽然较新的“前沿”模型,如 GPT-5 和 Gemini 2.5 Pro 仍然被使用,但开发者发现旧版本具有稳定性和潜在的成本效益。 一位评论员指出,他经常使用多个模型,包括旧模型,来进行开发任务。然而,另一位评论员指出,需要考虑整体使用数据才能对这种转变得出准确的结论。讨论表明,对于特定的开发用例,最新的技术并不总是必要或更优的。 该帖子链接到一篇文章 (augmentcode.com),探讨了这种偏好的数据,但该数据的细节并未包含在 HN 摘要中。
相关文章

原文

At Augment Code, we run multiple frontier models side by side in production. This gives us a unique vantage point into how different models behave in real coding workflows. Usage patterns suggest developers are no longer just chasing the newest model; they are matching models to specific task profiles.

This post shares data from millions of live interactions and discusses what it may reveal about model adoption, behavioral differences, and system-level trade-offs.

Model Adoption Is Fragmenting

Over the first week of October 2025, Sonnet 4.5’s share of total requests declined from 66% → 52%, while Sonnet 4.0 rose from 23% → 37%. GPT-5 usage stayed steady at about 10–12%.

DateSonnet 4.5Sonnet 4.0GPT-5
2025-09-3066.18%23.26%10.57%
2025-10-0159.39%30.28%10.33%
2025-10-0255.77%33.54%10.69%
2025-10-0354.16%35.36%10.48%
2025-10-0456.66%31.70%11.64%
2025-10-0556.54%31.02%12.44%
2025-10-0652.29%37.38%10.33%

At first glance this could look like short-term churn after a new release. But if developers were simply upgrading, Sonnet 4.5’s share would continue rising while 4.0’s declined. The opposite happened. Both models retained significant usage, suggesting that teams are choosing models based on the kind of task, not on version number. In other words, upgrades are beginning to behave like alternatives rather than successors. That shift marks the early stages of specialization in production environments.

Diverging Behaviors: Reasoning Depth vs. Action Frequency

Despite producing larger total outputs, Sonnet 4.5 makes fewer tool calls per user message than 4.0.

ModelAvg Tool Calls / User Message
Sonnet 4.512.33
Sonnet 4.015.65
GPT-511.58

Higher verbosity combined with fewer actions suggests that Sonnet 4.5 performs more internal reasoning before deciding to act. By contrast, 4.0 issues more frequent tool calls, favoring quick task execution over extended deliberation. GPT-5 falls close to 4.5 in call frequency but tends to favor natural-language reasoning over tool use.

We are monitoring whether this behavioral difference aligns with prompt success rates. If higher internal reasoning correlates with improved completion, it would confirm that Sonnet 4.5’s “think more, act less” tendency leads to better outcomes.

Throughput and Token Economy

Sonnet 4.5 generates more text and tool output per message—about 7.5 k tokens on average compared with 5.5 k for 4.0. That is a 37 % increase in total output per interaction.

ModelText OutputTool OutputTotal Output
Sonnet 4.52,4975,0187,517
Sonnet 4.01,1683,9485,481
GPT-53,7401,7295,469

Richer reasoning leads to more contextual responses but introduces additional latency. We do not yet have per-request tokens-per-second data, but qualitative traces suggest throughput is slightly lower, consistent with the extra compute required for deeper reasoning chains.

Compute Footprint and Cache Utilization

To understand how reasoning depth affects system load, we sampled a small subset of production data covering several billion tokens and corresponding cache operations.

Sonnet 4.5 still accounts for the majority of processed volume, with roughly one-third more cache reads than Sonnet 4.0. GPT-5 shows a much lighter footprint overall.

ModelInput TokensText OutputTool OutputTotal OutputCache Reads
Sonnet 4.50.25 B0.75 B1.55 B2.30 B240.0 B
Sonnet 4.00.13 B0.20 B0.72 B0.92 B135.0 B
GPT-50.16 B0.22 B0.10 B0.32 B28.0 B
Grand Total0.54 B1.17 B2.37 B3.54 B403.0 B

The higher cache-read volume for Sonnet 4.5 likely comes from heavier use of retrieval-augmented workflows and longer context windows. This suggests a system-level shift: more compute is being spent on managing and reusing context rather than on token generation itself.

Emergent Specialization: Where Each Model Excels

Even though developers can freely choose models, their behavior reveals clear preferences by task type. Usage data and qualitative feedback show early signs of specialization.

ModelObserved StrengthsTypical Workflows
Sonnet 4.5Long-context reasoning, multi-file understanding, autonomous planningRefactoring agents, complex debugging, design synthesis
Sonnet 4.0Deterministic completions, consistent formatting, tool-friendly outputsAPI generation, structured edits, rule-based transforms
GPT-5Explanatory fluency, general reasoning, hybrid coding + documentationCode walkthroughs, summarization, developer education

Each model appears to emphasize a different balance between reasoning and execution. Rather than seeking one “best” system, developers are assembling model alloys—ensembles that select the cognitive style best suited to a task.

Community discussions of Sonnet 4.5, 4.0, and GPT-5 align closely with the production data:

  • Sonnet 4.5: Users describe it as thoughtful and reliable for multi-file reasoning but occasionally verbose or slower for simple edits. It handles refactors and architectural planning effectively but can over-explain.
  • Sonnet 4.0: Praised for tool integration stability and predictable formatting. It is quick and consistent, ideal for automation or rule-based coding tasks. Teams often select it as the “safe default” model.
  • GPT-5: Recognized for fluency and clarity in explanations. It performs well in hybrid reasoning-plus-writing contexts such as code reviews and documentation but lags in heavy tool execution.
ThemeSonnet 4.5Sonnet 4.0GPT-5
Reasoning Depth⭐⭐⭐⭐ — Deep planning, sometimes overthinks⭐⭐ — Direct, task-driven⭐⭐⭐⭐ — Analytical and expressive
Latency / ResponsivenessSlowerFastModerate
Output DeterminismMediumHighMedium
Code Generation QualityExcellent for multi-fileStrong for single-fileGreat for hybrid code + docs
Ideal Use CasesRefactors, architectureAutomation, structured tasksWalkthroughs, learning, synthesis

Takeaways: The Early Signals of Behavioral Specialization

Three main insights emerge from this dataset:

  1. Adoption is diversifying, not consolidating. Newer models are not always better for every workflow.
  2. Behavioral divergence is measurable. Sonnet 4.5 reasons more deeply, while 4.0 acts more frequently.
  3. System costs are shifting. Reasoning intensity and cache utilization are now central performance metrics.

The story here is not about one model surpassing others but about each developing its own niche. As capabilities expand, behaviors diverge. The industry may be entering a stage where functional specialization replaces the race for a single “best” model—much like how databases evolved into SQL, NoSQL, and time-series systems optimized for different workloads. The same dynamic is beginning to appear in AI: success depends less on overall strength and more on the right cognitive style for the job.

As reasoning depth increases, these behavioral distinctions could define the next phase of AI tooling. The key question for builders is no longer “Which model is best?” but “Which model best fits this task?”

联系我们 contact @ memedata.com