推断经济:为什么需求比供给更重要。
The Inference Economy: Why demand matters more than supply

原始链接: https://frontierai.substack.com/p/the-inference-economy-part-ii

## 人工智能成本上升:更多Token,价格趋于平稳 近期趋势显示,人工智能推理经济正在发生转变。虽然Token成本正在稳定,但*Token消耗量*却在迅速增加——这不仅仅是由于用户数量增加,而是因为应用程序为了获得更高质量的结果,正在使用*每个请求更多的Token*。 这受到更丰富上下文的需求驱动,通常通过使用LLM来预处理和评估数据来实现(例如,重新排序搜索结果)——这种做法越来越普遍。 需求的增加正在给基础设施带来压力,但成本降低的速度跟不上。为了管理不断上涨的费用,应用程序开发者应优先考虑:为每个任务使用适当大小的模型,多样化模型提供商,仔细评估对昂贵“推理”模型的需求,以及谨慎对待微调,这需要大量高质量的数据。 最终,重点应该从单纯降低成本转向最大化价值。随着人工智能应用程序的成熟并证明明确的投资回报率,可能会出现提高定价能力的机会,特别是对于那些提供卓越质量的应用。 尽管持续努力提高效率,但随着Token使用量的增加,预计将继续密切关注人工智能账单。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交 登录 推理经济:为什么需求比供给更重要 (frontierai.substack.com) 29 分,cgwu 1 天前 | 隐藏 | 过去的 | 收藏 | 3 条评论 roxolotl 1 天前 | 下一个 [–] 题外话,看到图片下方标注 `来源:GPT-5`,而我却只能猜测图片想要表达什么,这让我意识到图像提示词基本上就是完美的图像标题。它们包含了图像的意图和框架。 回复 spprashant 1 天前 | 上一个 [–] 整个数据中心建设让我联想到《银河系漫游指南》中建造一个巨大的超级计算机来寻找答案,但却从未思考过问题是什么的那一章。 回复 estimator7292 1 天前 | 父评论 [–] 三部曲中的四本书都是关于寻找问题的。 回复 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系 搜索:
相关文章

原文

Last week, we wrote about some of the trends in the inference economy: plateauing token costs, using the right model for the right task, managing different modes of compute, and the impacts on pricing. Our perspective for the purposes of that that post was derived from observing what was going on with LLM inference in general in recent months. Of course, we’re always thinking about how that affects our decision-making as application builders, and we touched on that briefly but mostly focused on what was actually happening to token costs.

Looking back on that post, it occurred to us that the underlying assumption was that the dynamics of token consumption — from the perspective of application builders — is changing as well. Those changes are likely what’s driving the data center buildouts in the first place and are driving the demand side of the equation, which is forcing supply to keep up. As a result, today’s post is about what’s driving token token demand and how you should think about managing your demand as a token consumer.

What do we mean when we say that there’s a change in the demand side of the equation? Simply put, we’re all using more tokens to process more data. The increased token volume is partially driven by increased usage, but that’s far from the whole story. Yes, of course AI applications have more usage, but what’s much more interesting — and challenging — is that we’re seeing a trend in increased token consumption per-request, not just an increase in the aggregate number of requests processed. The driver behind that is ultimately quality.

As we’ve said many times on this blog, getting the behavior you want out of LLMs is all about providing the right information at the right time to a model. If you have the wrong context, you’re going to get poor results. That means the million dollar question is how you find the right information.

Search was the first solution that we all turned to — first, vector search, then returning back to more traditional text search mechanisms. Very quickly, however, we all turned to having an LLM read an input and evaluate how relevant it was to the problem we were solving (”reranking”). Presciently, Vik Singh, now a CVP at Microsoft, said this to us over 2 years ago: “If LLMs were fast enough… why not use the LLM to do a much more advanced similarity search… I think that’s what people actually want.”

The LLM-preprocesses-data paradigm is pervasive in our systems today. At RunLLM, we pre-read data at ingestion time to organize it properly, we read the results of text + vector search to analyze their relevance to a question, we analyze logs and dashboards in real-time with LLMs, and so on. Each one of those tasks is done by a model call in isolation to understand whether that data should be used for future decision-making. Without the involvement of LLMs in these stages, it would be almost impossible for us to provide high-quality results to our customers. We have a joke internally at RunLLM: The solution to every problem in computer science is another layer of abstraction, and the solution to every problem in AI is another layer of LLM calls.

That means that median — and perhaps more importantly, p99 — token consumption (and therefore request costs) are going up very quickly. We’re all solving harder problems, which means we’re throwing more data into LLMs and ultimately consuming dramatically more tokens. In our minds, this is one of the key drivers of increased token demand.

Luckily for the data center builders, this trend is not going anywhere. We might get more efficient and cheaper inference (though we’re skeptical, as we talked about last week), but as LLMs get more integrated into every application and workflow, per-request token use is only going to go up, not down. As a single data point, we have tons of ideas for how we can throw more LLMs at some of the challenges we face in a single investigation at RunLLM, but we’re primarily limited by cost, latency, or evaluations at the moment.

If you’re going to inevitably use more tokens, it’s worth thinking about how to be as thoughtful as possible about those tokens.

We’re pretty confident token demand is going up, and as we discussed last week, token costs are plateauing. Depending on how long it takes to build and power these new datacenters, that means we should all be thinking about how to manage our token usage, especially as models get better and more expensive. We’ve been experimenting with many of these techniques for a while now at RunLLM, so we thought we’d share some early lessons.

  • Model size is your best friend. Not all models are created equal, and neither are all tasks. Throwing your largest model at every task will probably maximize quality, but it will burn through your budget faster than you can imagine. (We mistakenly spent $63 on a single investigation at RunLLM last month. 😱) There are plenty of things that we do — gating questions, filtering documents, synthesizing logs — that aren’t hard but just require processing data efficiently. For simple tasks, there’s really no reason to use a state-of-the-art-model — GPT-4.1 Mini (one of our current favorites) or a smaller open-source equivalent will get the job done just fine. Unfortunately, we don’t have a cut-and-dry rule for when to use what model. It’s more of an art than a science right now, but evaluation frameworks for specific tasks certainly will help guide you in the right direction.

  • Be flexible with your providers. We’ve long believed that LLM inference is a race to the bottom. If models get better, the main question becomes who can give you that model for as cheap as possible — especially with open-weight models. However, we touched on the fact that switching model providers is harder than it used to be last week because model providers are making stronger assumptions, but tools like DSPy make prompt optimization easier than ever, which should alleviate some of that tension. While you might not want to be ready to switch between every model provider on the market (there are a lot!), it’s probably worth your time to be ready to use one of a few different providers — or even using features like batch mode within individual providers — when you have the opportunity. The biggest issue with this is actually security & compliance: More data subprocessors create more data exposure risks and make your vendor approvals that much harder. But if you’re in an area where this is less of a concern, keeping your options open is definitely an option to reduce costs.

  • Do you need reasoning? Reasoning models use a lot more tokens than regular LLMs, and it’s correspondingly quite difficult to control the output costs. It’s worth asking whether and when you need a reasoning model. For daily personal use, we tend to default to ChatGPT 5 Thinking, but we actually are not currently using any reasoning models in production at RunLLM. We’ve had much better luck with breaking problems down into fine-grained steps, using regular Python for orchestration and tool calling, and picking the right model for the right task (see above). Interestingly, this mirrors some of the reasoning-based task planning workflows we see in our daily usage, but with much stronger guardrails. Of course, we’re not solving problems with the generality of a consumer app like ChatGPT, so we have narrower score and much stronger guardrails. But for many workflow/task automation-oriented applications, you might be able to be much more efficient than you realize.

  • Don’t run straight to fine-tuning/post-training. RL is a hot topic right now. As we mentioned last week, Cursor’s new custom autocomplete model has rekindled the excitement around fine-tuning models for custom tasks. It’s certainly appealing: You take a smaller model, feed it a bunch of data, and voilà — cheaper, faster inference. Unfortunately, reality is not quite that simple. For one thing, RL and post-training are hard. The promise of the recent wave of RL environment startups for post-training is that they will help remove the complexity here, but we’re not convinced that that’s a viable solution. The hard part isn’t running an algorithm to update weights — it’s framing the problem in a way that’s actually going to yield the results you want and getting enough data that fits that problem framing. (This is not a new challenge in RL.) What’s lost in the hype around Cursor is that they collected an immense amount of data well-suited to RL from the natural use of the product — every tab complete suggestion was accepted or rejected, which is a very friendly RL framing. By way of contrast, we have over 1MM question-answer pairs from RunLLM but a tiny fraction of those have actual feedback, and only a fraction of those have actionable feedback — e.g., we tend to get a most of our negative feedback on “I don’t know” answers, which are usually good because there’s not sufficient data to answer. If you’re in a domain where you have enough data and the expertise + resources to use post-training, it’s certainly viable from a unit margins perspective. But it’s not the panacea everyone’s claiming.

What’s interesting about these dynamics at the moment is that we’re all focused on costs but not as focused on our own pricing power. Of course, any business is always going to want to reduce its COGS — the more efficient you can be, the better your business scales. This is probably the right place to be given the fierce competitive dynamics in many AI markets. At the same time, while we’re working on technical solutions to reducing COGS, we’re also mindful of the fact that as AI applications mature and the ROI becomes more obvious, we’re likely going to see a corresponding increase in pricing power. The best applications will possibly command a significant premium. That won’t apply in every market of course — only in the ones where quality matters most.

Guesses aside, it’s clear that the economics of AI are changing faster than we would have expected. The sudden deceleration of per-token cost changes has coincided with more mature applications that require more tokens — a double-whammy to increase costs. There will be other solutions (technical and non-technical) that will change these dynamics, but for the foreseeable future, we’re all going to be keeping a close eye on our OpenAI bills.

联系我们 contact @ memedata.com