Last week, we wrote about some of the trends in the inference economy: plateauing token costs, using the right model for the right task, managing different modes of compute, and the impacts on pricing. Our perspective for the purposes of that that post was derived from observing what was going on with LLM inference in general in recent months. Of course, we’re always thinking about how that affects our decision-making as application builders, and we touched on that briefly but mostly focused on what was actually happening to token costs.
Looking back on that post, it occurred to us that the underlying assumption was that the dynamics of token consumption — from the perspective of application builders — is changing as well. Those changes are likely what’s driving the data center buildouts in the first place and are driving the demand side of the equation, which is forcing supply to keep up. As a result, today’s post is about what’s driving token token demand and how you should think about managing your demand as a token consumer.
What do we mean when we say that there’s a change in the demand side of the equation? Simply put, we’re all using more tokens to process more data. The increased token volume is partially driven by increased usage, but that’s far from the whole story. Yes, of course AI applications have more usage, but what’s much more interesting — and challenging — is that we’re seeing a trend in increased token consumption per-request, not just an increase in the aggregate number of requests processed. The driver behind that is ultimately quality.
As we’ve said many times on this blog, getting the behavior you want out of LLMs is all about providing the right information at the right time to a model. If you have the wrong context, you’re going to get poor results. That means the million dollar question is how you find the right information.
Search was the first solution that we all turned to — first, vector search, then returning back to more traditional text search mechanisms. Very quickly, however, we all turned to having an LLM read an input and evaluate how relevant it was to the problem we were solving (”reranking”). Presciently, Vik Singh, now a CVP at Microsoft, said this to us over 2 years ago: “If LLMs were fast enough… why not use the LLM to do a much more advanced similarity search… I think that’s what people actually want.”
The LLM-preprocesses-data paradigm is pervasive in our systems today. At RunLLM, we pre-read data at ingestion time to organize it properly, we read the results of text + vector search to analyze their relevance to a question, we analyze logs and dashboards in real-time with LLMs, and so on. Each one of those tasks is done by a model call in isolation to understand whether that data should be used for future decision-making. Without the involvement of LLMs in these stages, it would be almost impossible for us to provide high-quality results to our customers. We have a joke internally at RunLLM: The solution to every problem in computer science is another layer of abstraction, and the solution to every problem in AI is another layer of LLM calls.
That means that median — and perhaps more importantly, p99 — token consumption (and therefore request costs) are going up very quickly. We’re all solving harder problems, which means we’re throwing more data into LLMs and ultimately consuming dramatically more tokens. In our minds, this is one of the key drivers of increased token demand.
Luckily for the data center builders, this trend is not going anywhere. We might get more efficient and cheaper inference (though we’re skeptical, as we talked about last week), but as LLMs get more integrated into every application and workflow, per-request token use is only going to go up, not down. As a single data point, we have tons of ideas for how we can throw more LLMs at some of the challenges we face in a single investigation at RunLLM, but we’re primarily limited by cost, latency, or evaluations at the moment.
If you’re going to inevitably use more tokens, it’s worth thinking about how to be as thoughtful as possible about those tokens.
We’re pretty confident token demand is going up, and as we discussed last week, token costs are plateauing. Depending on how long it takes to build and power these new datacenters, that means we should all be thinking about how to manage our token usage, especially as models get better and more expensive. We’ve been experimenting with many of these techniques for a while now at RunLLM, so we thought we’d share some early lessons.
Model size is your best friend. Not all models are created equal, and neither are all tasks. Throwing your largest model at every task will probably maximize quality, but it will burn through your budget faster than you can imagine. (We mistakenly spent $63 on a single investigation at RunLLM last month. 😱) There are plenty of things that we do — gating questions, filtering documents, synthesizing logs — that aren’t hard but just require processing data efficiently. For simple tasks, there’s really no reason to use a state-of-the-art-model — GPT-4.1 Mini (one of our current favorites) or a smaller open-source equivalent will get the job done just fine. Unfortunately, we don’t have a cut-and-dry rule for when to use what model. It’s more of an art than a science right now, but evaluation frameworks for specific tasks certainly will help guide you in the right direction.
Be flexible with your providers. We’ve long believed that LLM inference is a race to the bottom. If models get better, the main question becomes who can give you that model for as cheap as possible — especially with open-weight models. However, we touched on the fact that switching model providers is harder than it used to be last week because model providers are making stronger assumptions, but tools like DSPy make prompt optimization easier than ever, which should alleviate some of that tension. While you might not want to be ready to switch between every model provider on the market (there are a lot!), it’s probably worth your time to be ready to use one of a few different providers — or even using features like batch mode within individual providers — when you have the opportunity. The biggest issue with this is actually security & compliance: More data subprocessors create more data exposure risks and make your vendor approvals that much harder. But if you’re in an area where this is less of a concern, keeping your options open is definitely an option to reduce costs.
Do you need reasoning? Reasoning models use a lot more tokens than regular LLMs, and it’s correspondingly quite difficult to control the output costs. It’s worth asking whether and when you need a reasoning model. For daily personal use, we tend to default to ChatGPT 5 Thinking, but we actually are not currently using any reasoning models in production at RunLLM. We’ve had much better luck with breaking problems down into fine-grained steps, using regular Python for orchestration and tool calling, and picking the right model for the right task (see above). Interestingly, this mirrors some of the reasoning-based task planning workflows we see in our daily usage, but with much stronger guardrails. Of course, we’re not solving problems with the generality of a consumer app like ChatGPT, so we have narrower score and much stronger guardrails. But for many workflow/task automation-oriented applications, you might be able to be much more efficient than you realize.
Don’t run straight to fine-tuning/post-training. RL is a hot topic right now. As we mentioned last week, Cursor’s new custom autocomplete model has rekindled the excitement around fine-tuning models for custom tasks. It’s certainly appealing: You take a smaller model, feed it a bunch of data, and voilà — cheaper, faster inference. Unfortunately, reality is not quite that simple. For one thing, RL and post-training are hard. The promise of the recent wave of RL environment startups for post-training is that they will help remove the complexity here, but we’re not convinced that that’s a viable solution. The hard part isn’t running an algorithm to update weights — it’s framing the problem in a way that’s actually going to yield the results you want and getting enough data that fits that problem framing. (This is not a new challenge in RL.) What’s lost in the hype around Cursor is that they collected an immense amount of data well-suited to RL from the natural use of the product — every tab complete suggestion was accepted or rejected, which is a very friendly RL framing. By way of contrast, we have over 1MM question-answer pairs from RunLLM but a tiny fraction of those have actual feedback, and only a fraction of those have actionable feedback — e.g., we tend to get a most of our negative feedback on “I don’t know” answers, which are usually good because there’s not sufficient data to answer. If you’re in a domain where you have enough data and the expertise + resources to use post-training, it’s certainly viable from a unit margins perspective. But it’s not the panacea everyone’s claiming.
What’s interesting about these dynamics at the moment is that we’re all focused on costs but not as focused on our own pricing power. Of course, any business is always going to want to reduce its COGS — the more efficient you can be, the better your business scales. This is probably the right place to be given the fierce competitive dynamics in many AI markets. At the same time, while we’re working on technical solutions to reducing COGS, we’re also mindful of the fact that as AI applications mature and the ROI becomes more obvious, we’re likely going to see a corresponding increase in pricing power. The best applications will possibly command a significant premium. That won’t apply in every market of course — only in the ones where quality matters most.
Guesses aside, it’s clear that the economics of AI are changing faster than we would have expected. The sudden deceleration of per-token cost changes has coincided with more mature applications that require more tokens — a double-whammy to increase costs. There will be other solutions (technical and non-technical) that will change these dynamics, but for the foreseeable future, we’re all going to be keeping a close eye on our OpenAI bills.