How tool use became a substitute for solving hard problems
When OpenAI released o1 in 2024 and called it a "reasoning model," the industry celebrated a breakthrough. Finally, AI that could think step-by-step, solve complex problems, handle graduate-level mathematics.
![]()
But look closer at what's actually happening under the hood. When you ask o1 to multiply two large numbers, it doesn't calculate. It generates Python code, executes it in a sandbox, and returns the result. Unlike GPT-3, which at least attempted arithmetic internally (and often failed), o1 explicitly delegates computation to external tools.
This pattern extends everywhere. The autonomy in agentic AI? Chained tool calls like web searches, API invocations, database queries. The breakthrough isn't in the model's intelligence. It's in the orchestration layer coordinating external systems. Everything from reasoning to agentic AI is just a sophisticated application of code generation. These are not model improvements. They're engineering workarounds for models that stopped improving.
GPT-5: The Emperor's New Reasoning
August 2025 should have been a triumph. OpenAI promised "PhD-level intelligence in everyone's pocket." What they delivered barely moved the needle on code generation, the one capability that everything else depends on. And that's the bottleneck: code generation is the kingpin. Better code → better reasoning (via execution) → better agents → higher productivity → trillion-dollar markets. When the kingpin doesn't move, the entire chain stalls.
The disappointment was palpable among developers using AI coding tools. AI coding tools building on OpenAI's models (like Cursor, GitHub Copilot, Replit) had bet billions on exponential improvement with each release. GPT-5 broke that pattern. This wasn't supposed to happen. From GPT-3's fumbling arithmetic to GPT-4's coherent code generation, progress seemed unstoppable. An entire industry had emerged betting on continued gains. But in the past year, that progress visibly stalled.
From Research Lab to App Store
In the meantime, OpenAI is leaning more on applications than on model research. Watch OpenAI's trajectory over just the past few months, and the pattern becomes undeniable:
October 6, 2025: ChatGPT Apps Launch
Third-party applications running directly inside ChatGPT. Book flights through Expedia, design graphics in Canva, browse real estate on Zillow without leaving the chat. The Apps SDK gives developers access to 800 million users. This is OpenAI becoming an app store.
October 21, 2025: Atlas Browser Release
An AI-powered web browser attacking Chrome's dominance. Browser memory, agent mode, integrated AI assistance throughout. This is OpenAI becoming a consumer products company.
They are slowly sliding from research to applications:
- Reasoning models (adjacent to core research)
- Agentic AI with workflow builders (further from core research)
- ChatGPT Apps (pure distribution play)
- Atlas browser (entrench ChatGPT in browser)
Each step moves further from "how do we build better models?" toward "how do we monetize the models we have?"
Why OpenAI Pivoted?
Theory 1: They Hit a Wall and Won't Admit It
Scaling has stopped working. Despite billions in compute and the world's best researchers, qualitative improvements are disappearing. The models aren't getting fundamentally smarter: they're just getting better at coordinating external tools.
Rather than announce "we don't know how to make better models anymore," they pivot to monetization. ChatGPT Apps generate revenue without requiring research breakthroughs. A browser creates user lock-in without needing GPT-6. Applications buy time while they figure out what comes next. This is the pessimistic interpretation: stagnation disguised as strategy.
Theory 2: Applications Are Simply More Profitable
Training frontier models costs billions and takes years. Building apps on existing models is cheap and fast. The margins are better. The risk is lower. The path to revenue is clearer.
Maybe OpenAI rationally calculated that more money can be made from applications with less effort, and shifted resources accordingly. Why spend $5 billion training GPT-6 when you can build a browser in six months? This is the cynical interpretation: profit over progress.
Both might be partially true. Either way, the result is identical: the leading AI lab is investing less in fundamental model improvement precisely when the ecosystem needs it most.
The Architectural Problem Nobody Wants to Face
Tool orchestration is impressive engineering. Coordinating web searches, code execution, database queries, and API calls requires sophisticated software architecture. Agentic frameworks that manage complex workflows have genuine practical value. But none of this addresses why models need tools in the first place.
Earlier models like GPT-3 struggled with token fragmentation (for example, splitting "strawberry" into "straw" and "berry" which has different meaning). Modern tokenizers mitigate this, but the broader architectural issue remains: LLMs still lack true semantic understanding. These semantic issues are particularly problematic for code generation, where precision is critical. When models hallucinate facts or lose coherence over long contexts, adding web search doesn't solve the root cause. Fixed-size embeddings compress meaning losingly. Attention windows create hard boundaries on context. These are architectural constraints, not engineering problems.
It's like building a skyscraper on a foundation designed for a three-story building. You can add reinforcements, redistribute weight, install sophisticated support systems. But eventually, you need a different foundation. No amount of clever engineering around the foundation's limits will let you build taller.
![]()
What Actually Needs to Happen?
Path 1: Keep Optimizing the Plumbing
Continue the current trajectory: slightly larger models, more sophisticated tool orchestration, deeper integration into application platforms. Launch browsers and app stores. Build better agentic frameworks. Improve the engineering around architectural limitations.
This path offers predictable short-term revenue. This is true for many other domains that have not yet caught up to AI coding tool-level capability. Since AI coding tools were created by developers for developers, they understood the problem and how to solve it for themselves. The same kind of progress will happen in other domains, and the VC money train will continue for some time. But the $3 trillion GDP projection from a16z depends on a doubling of productivity, not the 20% improvement that has stalled in AI coding tools. Improving that requires admitting that the fundamental approach has stalled.
Path 2: Acknowledge We Need Different Foundations
Admit that scaling has hit limits. Invest in architectural innovations that address root causes rather than symptoms. This means:
- Graph-based architectures that preserve structural relationships instead of fragmenting them through tokenization, preventing the semantic fragmentation that plagues transformer-based LLMs.
- Sparse attention mechanisms that maintain longer contexts more efficiently.
- Neuromorphic approaches inspired by biological neural organization.
The solution is to build architectures that preserve information rather than compress it losingly.
Current models are, as AI researcher Andrej Karpathy puts it, "lossy compression of the internet." Real progress requires moving toward lossless representation: preserving structure, maintaining relationships, keeping semantic hierarchies intact.
This path is expensive, uncertain, and slow. It requires admitting current approaches have failed. It means years of research with no guarantee of success. But it's the only path that actually addresses the problem rather than working around it.
Summary
Right now, the AI coding tool market is scaling explosively:
- Cursor: $500M ARR in 15 months, $10B valuation
- GitHub Copilot: millions of users, hundreds of millions in revenue
- Windsurf: acquired for $2.4B
- Dozens of startups raising nine-figure rounds
All betting on the same assumption: models will keep getting better at generating code. If that assumption is wrong, the entire market becomes a house of cards. The $3 trillion GDP projection evaporates. The unicorn valuations can't be justified. The productivity revolution gets postponed indefinitely.
Conversely, whoever solves the architectural problem wins everything. Even modest fundamental improvements would cascade through the entire ecosystem:
- Better code generation → better reasoning (via execution)
- Better reasoning → more capable agents
- More capable agents → actual productivity doubling
- Actual productivity doubling → the $3T market becomes real
The value creation would be astronomical. The question now is whether any lab will prioritize the hard path of fixing the foundation over the easy path of building apps on stagnant foundations. The answer will determine whether the $3 trillion productivity revolution is real or fantasy.
(What would an architectural innovation look like? In my next article, I will explore graph transformers that preserve semantic meaning at the word level rather than fragmenting it through tokenization. Regardless of whether that approach works, we need different foundations, not better engineering on top of the ones that have already failed.)