推理不是模型改进

推理不是模型改进
Reasoning Is Not Model Improvement

原始链接: https://manidoraisamy.com/reasoning-not-ai.html

## AI进步停滞：从推理到资源工具的使用最近的AI进展，例如OpenAI的o1，并非在*思考*方面的突破，而是在*工具使用*方面的进步。这些模型不是在内部解决问题，而是生成代码来委派任务——网络搜索、计算等——有效地成为复杂的协调者，而不是真正智能的系统。这种趋势在GPT-5中延续，它在核心代码生成方面几乎没有改进，而代码生成是驱动所有后续能力的基础。 OpenAI的转变反映了这种停滞。他们正在从基础研究转向应用——应用商店（ChatGPT Apps）和消费产品（Atlas Browser）——优先考虑盈利，而不是模型改进。这可能是由于达到规模限制，或者仅仅是认识到应用更有利可图。核心问题在于大型语言模型（LLMs）的底层架构。由于分词和有限的上下文窗口，它们难以进行语义理解，需要像工具使用这样的变通方法。真正的进步需要架构创新——例如基于图的系统或稀疏注意力——来保存信息，而不是压缩信息。 AI的未来取决于这个选择：继续在有缺陷的基础上构建，还是投资于根本新的方法。AI编码工具的爆炸式增长依赖于持续的模型改进，而未能实现改进可能会危及预计的3万亿美元的经济影响。

## Hacker News 讨论摘要：LLM 中的推理 vs. 工具使用这次 Hacker News 讨论围绕一篇博文展开，该博文认为近期大型语言模型 (LLM) 的进步并非真正的“模型改进”，而是巧妙的工程规避，具体依赖于工具使用（如 Python 解释器），而非固有的推理能力。作者质疑像 o1 这样的模型是否真的*计算*，还是仅仅生成代码来解决问题，并认为基础代码生成能力已经达到瓶颈。许多评论者不同意，他们举例说明 GPT-5 和 Claude 等模型在*没有*工具的情况下也能在编码方面表现出色，并指出工具的显著改进本身就是一项关键进展。核心争论在于，将任务委托给工具是弱点的表现，还是模仿人类智能的复杂策略。一些人认为，工具使用扩展了 LLM 的能力，并能产生更可靠的结果，而另一些人则认为真正的进步在于提高模型内部的推理能力。还有关于替代架构的讨论，以及当前方法是否正在达到极限。最终，这次对话突出了模型架构、工具使用以及追求通用人工智能之间复杂的关系。

原文

How tool use became a substitute for solving hard problems

When OpenAI released o1 in 2024 and called it a "reasoning model," the industry celebrated a breakthrough. Finally, AI that could think step-by-step, solve complex problems, handle graduate-level mathematics.

But look closer at what's actually happening under the hood. When you ask o1 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result. Unlike GPT-3, which at least attempted arithmetic internally (and often failed), o1 explicitly delegates computation to external tools.

This pattern extends everywhere. The autonomy in agentic AI? Chained tool calls like web searches, API invocations, database queries. The breakthrough isn't in the model's intelligence. It's in the orchestration layer coordinating external systems. Everything from reasoning to agentic AI is just a sophisticated application of code generation. These are not model improvements. They're engineering workarounds for models that stopped improving.

This matters because the entire AI industry (from unicorn valuations to trillion-dollar GDP projections) depends on continued model improvement. What we're getting instead is increasingly elaborate plumbing for fundamentally stagnant foundations.

GPT-5: The Emperor's New Reasoning

August 2025 should have been a triumph. OpenAI promised "PhD-level intelligence in everyone's pocket." What they delivered barely moved the needle on code generation, the one capability that everything else depends on. And that's the bottleneck: code generation is the kingpin. Better code → better reasoning (via execution) → better agents → higher productivity → trillion-dollar markets. When the kingpin doesn't move, the entire chain stalls.

The disappointment was palpable among developers using AI coding tools. AI coding tools building on OpenAI's models (like Cursor, GitHub Copilot, Replit) had bet billions on exponential improvement with each release. GPT-5 broke that pattern. This wasn't supposed to happen. From GPT-3's fumbling arithmetic to GPT-4's coherent code generation, progress seemed unstoppable. An entire industry had emerged betting on continued gains. But in the past year, that progress visibly stalled.

From Research Lab to App Store

In the meantime, OpenAI is leaning more on applications than on model research. Watch OpenAI's trajectory over just the past few months, and the pattern becomes undeniable:

October 6, 2025: ChatGPT Apps Launch

Third-party applications running directly inside ChatGPT. Book flights through Expedia, design graphics in Canva, browse real estate on Zillow without leaving the chat. The Apps SDK gives developers access to 800 million users. This is OpenAI becoming an app store.

October 21, 2025: Atlas Browser Release

An AI-powered web browser attacking Chrome's dominance. Browser memory, agent mode, integrated AI assistance throughout. This is OpenAI becoming a consumer products company.

They are slowly sliding from research to applications:

Reasoning models (adjacent to core research)
Agentic AI with workflow builders (further from core research)
ChatGPT Apps (pure distribution play)
Atlas browser (entrench ChatGPT in browser)

Each step moves further from "how do we build better models?" toward "how do we monetize the models we have?"

Why OpenAI Pivoted?

Theory 1: They Hit a Wall and Won't Admit It

Scaling has stopped working. Despite billions in compute and the world's best researchers, qualitative improvements are disappearing. The models aren't getting fundamentally smarter: they're just getting better at coordinating external tools.

Rather than announce "we don't know how to make better models anymore," they pivot to monetization. ChatGPT Apps generate revenue without requiring research breakthroughs. A browser creates user lock-in without needing GPT-6. Applications buy time while they figure out what comes next. This is the pessimistic interpretation: stagnation disguised as strategy.

Theory 2: Applications Are Simply More Profitable

Training frontier models costs billions and takes years. Building apps on existing models is cheap and fast. The margins are better. The risk is lower. The path to revenue is clearer.

Maybe OpenAI rationally calculated that more money can be made from applications with less effort, and shifted resources accordingly. Why spend $5 billion training GPT-6 when you can build a browser in six months? This is the cynical interpretation: profit over progress.

Both might be partially true. Either way, the result is identical: the leading AI lab is investing less in fundamental model improvement precisely when the ecosystem needs it most.

The Architectural Problem Nobody Wants to Face

Tool orchestration is impressive engineering. Coordinating web searches, code execution, database queries, and API calls requires sophisticated software architecture. Agentic frameworks that manage complex workflows have genuine practical value. But none of this addresses why models need tools in the first place.

Earlier models like GPT-3 struggled with token fragmentation (for example, splitting "strawberry" into "straw" and "berry" which has different meaning). Modern tokenizers mitigate this, but the broader architectural issue remains: LLMs still lack true semantic understanding. These semantic issues are particularly problematic for code generation, where precision is critical. When models hallucinate facts or lose coherence over long contexts, adding web search doesn't solve the root cause. Fixed-size embeddings compress meaning losingly. Attention windows create hard boundaries on context. These are architectural constraints, not engineering problems.

It's like building a skyscraper on a foundation designed for a three-story building. You can add reinforcements, redistribute weight, install sophisticated support systems. But eventually, you need a different foundation. No amount of clever engineering around the foundation's limits will let you build taller.

What Actually Needs to Happen?

Path 1: Keep Optimizing the Plumbing

Continue the current trajectory: slightly larger models, more sophisticated tool orchestration, deeper integration into application platforms. Launch browsers and app stores. Build better agentic frameworks. Improve the engineering around architectural limitations.

This path offers predictable short-term revenue. This is true for many other domains that have not yet caught up to AI coding tool-level capability. Since AI coding tools were created by developers for developers, they understood the problem and how to solve it for themselves. The same kind of progress will happen in other domains, and the VC money train will continue for some time. But the $3 trillion GDP projection from a16z depends on a doubling of productivity, not the 20% improvement that has stalled in AI coding tools. Improving that requires admitting that the fundamental approach has stalled.

Path 2: Acknowledge We Need Different Foundations

Admit that scaling has hit limits. Invest in architectural innovations that address root causes rather than symptoms. This means:

Graph-based architectures that preserve structural relationships instead of fragmenting them through tokenization, preventing the semantic fragmentation that plagues transformer-based LLMs.
Sparse attention mechanisms that maintain longer contexts more efficiently.
Neuromorphic approaches inspired by biological neural organization.

The solution is to build architectures that preserve information rather than compress it losingly.

Current models are, as AI researcher Andrej Karpathy puts it, "lossy compression of the internet." Real progress requires moving toward lossless representation: preserving structure, maintaining relationships, keeping semantic hierarchies intact.

This path is expensive, uncertain, and slow. It requires admitting current approaches have failed. It means years of research with no guarantee of success. But it's the only path that actually addresses the problem rather than working around it.

Summary

Right now, the AI coding tool market is scaling explosively:

Cursor: $500M ARR in 15 months, $10B valuation
GitHub Copilot: millions of users, hundreds of millions in revenue
Windsurf: acquired for $2.4B
Dozens of startups raising nine-figure rounds

All betting on the same assumption: models will keep getting better at generating code. If that assumption is wrong, the entire market becomes a house of cards. The $3 trillion GDP projection evaporates. The unicorn valuations can't be justified. The productivity revolution gets postponed indefinitely.

Conversely, whoever solves the architectural problem wins everything. Even modest fundamental improvements would cascade through the entire ecosystem:

Better code generation → better reasoning (via execution)
Better reasoning → more capable agents
More capable agents → actual productivity doubling
Actual productivity doubling → the $3T market becomes real

The value creation would be astronomical. The question now is whether any lab will prioritize the hard path of fixing the foundation over the easy path of building apps on stagnant foundations. The answer will determine whether the $3 trillion productivity revolution is real or fantasy.

(What would an architectural innovation look like? In my next article, I will explore graph transformers that preserve semantic meaning at the word level rather than fragmenting it through tokenization. Regardless of whether that approach works, we need different foundations, not better engineering on top of the ones that have already failed.)