The End of Moore's Law for AI? Gemini Flash Offers a Warning

原始链接: https://sutro.sh/blog/the-end-of-moore-s-law-for-ai-gemini-flash-offers-a-warning

Google's recent price increase for its Gemini 2.5 Flash model signals a shift in the AI industry, suggesting the era of perpetually decreasing AI costs is ending. This move, likely driven by unprofitable high-input, low-output workloads and demand exceeding resources, reveals a "soft floor" on LLM inference costs due to hardware limitations, model performance plateaus, and energy expenses. This new economic reality necessitates a change in approach for developers, requiring cost to be considered a fixed constraint in architecture. The era of stable pricing is likely over, potentially replaced by tiered structures. The economic advantages of batch inference and open-source models become significantly stronger. While OpenAI's o3 price decrease might seem contradictory, it's a different class of frontier model with more optimization potential, and possibly influenced by competitive pressure. The future lies in efficient architectures like batch processing and leveraging cost-effective open-source solutions to circumvent the rising costs of real-time APIs.

The Hacker News thread discusses an article claiming Google's Gemini Flash pricing change signals the "end of Moore's Law" for AI. Commenters largely disagree. One points out the pricing update isn't straightforward; Google essentially retired the "non-thinking" mode and adjusted the "thinking" mode price, potentially reflecting actual usage rather than technological limitations. Another emphasizes the article is a sales pitch for the author's company. Several comments highlight the issue of linear API pricing conflicting with the quadratic compute costs of LLMs, especially concerning the KV cache's memory demands. The increased input token price is likely related to managing the large memory footprint created by long prompts. The comparison to Haiku 3.5's pricing is debated, with some arguing Gemini's change is a more direct price adjustment of an existing model. Some users speculate that LLM providers are currently operating at a loss to gain market share. Finally, one user mentions the potential for smaller, high-quality models to become accessible on consumer hardware in the future, potentially negating the need for expensive cloud-based APIs.
相关文章

原文

For the past few years, the AI industry has operated under its own version of Moore's Law: an unwavering belief that the cost of intelligence would perpetually decrease by orders of magnitude each year. Like clockwork, each new model generation promised to be not only more capable but also cheaper to run. Last week, Google quietly broke that trend.

In a move that at first went unnoticed, Google significantly increased the price of its popular Gemini 2.5 Flash model. The input token price doubled from $0.15 to $0.30 per million tokens, while the output price more than quadrupled from $0.60 to $2.50 per million. Simultaneously, they introduced a new, less capable model, "Gemini 2.5 Flash Lite", at a lower price point.

This is the first time a major provider has backtracked on the price of an established model. While it may seem like a simple adjustment, we believe this signals a turning point. The industry is no longer on an endless downward slide of cost. Instead, we’ve hit a fundamental soft floor on the cost of intelligence, given the current state of hardware and software.

In this article, we’ll break down how LLM providers actually price their services, explore why Google likely made this unprecedented move, and discuss what this new economic reality means for anyone building with AI.

The Price is (Not Always) Right: How LLM API Pricing Really Works

From the outside, LLM pricing seems simple: a flat rate per million input and output tokens. In reality, this is a convenient fiction—a blended average designed to simplify a deeply complex cost structure.

To understand why prices go up, you have to understand the real cost drivers behind the scenes.

The simplest formula for a provider's cost is:

API Price  (Hourly Hardware Cost / Throughput in Tokens per Hour) + Margin

The key variable here is Throughput, which is not a single number. It’s a function of four factors:

  1. Hardware: The raw power of the GPU/TPU (e.g., NVIDIA H100 vs. A100).

  2. Model: The size and architecture of the LLM.

  3. Inference Framework: The software stack used to run the model (e.g., vLLM, SGLang, TensorRT-LLM).

  4. Workload Shape: This is the most critical and misunderstood variable. It refers to the ratio of input tokens (prefill) to output tokens (decode). Each run into quadratic (O(n2)) memory costs, but with subtle differences that make decoding more intensive than prefill:

    • Prefill (Input): When you send a prompt, the model can process all input tokens in parallel. This phase is fast but compute intensive. Its cost is best measured by Time-to-First-Token (TTFT).

    • Decode (Output): Generating the response is a serial process—each new token depends on all the previous ones. This phase is often the bottleneck for latency because each new token is predicted one at a time. Its cost is measured by Inter-Token Latency (ITL).

  5. Demand Planning: Keeping extra machines on to handle unexpected loads increases costs for model providers looking to reduce latency. More expected usage can provide cost savings for larger providers who either control their own hardware or can negotiate lower rates with cloud services.

In Google’s case, the model is fixed (Flash), as is the hardware (TPUs) and inference framework. What’s unknown for any new model is workload shape and demand.

The Hidden, Quadratic Cost Of LLM Workloads

Predicting tokens involves calculating the attention between all input tokens and each output token in a sequence as each new output token is generated. The number of computations that must be done to calculate all the attention scores scales like N x N, where N is the total number of tokens in the sequence. Therefore, throughput decreases quadratically with increasing sequence length.

Most things that we are accustomed to buying do not work this way: buying one unit of a good from a supplier costs the same as buying a hundred units. For example, a gas station has the same margin if they sell a gallon, ten gallons, or a hundred gallons.

For LLM providers, API calls cost them quadratically in throughput as sequence length increases. However, API providers price their services linearly, meaning that there is a fixed cost to the end consumer for every unit of input or output token they use.

If a gas station worked this way, buying one gallon of gas at a time would lead to a much higher margin for the station than buying ten gallons of gas at a time. At some point, buying too much gas at a time would lead to a negative margin on the sale for the gas station.

Obviously, most consumables do not work this way. However, while less common, there are situations we deal with every day where the behind-the-scenes cost is quadratic.

Take traffic. Having too many cars on the road eventually leads to congestion – slower speeds. When congestion occurs, every additional car added to the road leads to a quadratic reduction in speed.

Transit authorities face similar conundrums as LLM API providers when pricing tolls. Tolls pay for road repairs plus margin, but they also help to gate usage of roads. Higher tolls theoretically leads to less traffic congestion – such as with New York City’s congestion pricing. Too high of tolls means there isn’t enough throughput to capture enough revenue, and too low of tolls means that there is too much congestion, such that the slower speeds negatively affect how much toll revenue is collected per hour due to decreased usage.

LLM providers must do a similar calculation. The linear price they charge customers for token usage must be high enough to make enough margin off of shorter tasks to offset the hit to margin from longer tasks.

Additionally, providers need to set prices to account for constrained resources. First, they need to serve multiple models, from small workhorse models to large models at the frontier of intelligence. Second, they need resources for training. Higher pricing can also incentivize API consumers to use smaller, more affordable model offerings for simpler tasks.

Setting a price per token for LLM API calls involves looking at the types of tasks that customers perform. In simple terms: how many total tokens are used per task, and of those tokens how many are input (prefill) and output (decode). LLM providers perform statistical analyses to set a blended rate that they hope will be profitable.

Our guess is that Google’s initial assumptions about workload and demand did not pay off.

Reading the Tea Leaves: Why We Think Google Raised Prices

When Google launched Gemini 2.5 Flash, it was positioned as a fast, cost-effective "workhorse" model. There were likely assumptions baked in around:

  1. The types of tasks for which developers would use Flash

  2. How much demand there would be for Flash

Our best guess is that one or both of these factors were off. Workhorse models often excel best for batch tasks like summarization, classification, and data extraction. Those tasks usually have longer input-to-output ratios than for what you would regularly use a large, premier model like Gemini 2.5 Pro.

Because users pay a linear price for input and output tokens, a user summarizing a large corpus of documents benefited from their compute-heavy prefill step costing the same as the input from a more balanced application. While a stab in the dark, Google was likely finding that these high-input, low-output workloads were unprofitable at the original, blended price.

Additionally, higher than expected demand for batch tasks affects throughput immensely. While it is always possible for a provider to increase TPU or GPU resources, this takes time and significant capital expenditure to match current levels of demand for AI models. Assuming AI usage is not going to decrease, relative demand planning cannot be fixed by placing an order for more hardware.

The price hike is most likely a direct correction for the outsized demand for Flash relative to task shape. By significantly increasing the input token price, Google is re-aligning the cost with the actual computational burden. The introduction of "Flash Lite" is a classic market segmentation strategy: if you want the absolute lowest price for your high-input batch jobs, you must now accept a less capable model. If you want the full power of Gemini 2.5 Flash, you have to pay a price that reflects its true operational cost.

The Floor is Not Lava, It's Silicon: Have We Hit A Cost Plateau?

Google’s price hike shatters the illusion of an ever-decreasing cost for intelligence. It reveals that the cost of LLM inference has a soft floor, dictated by the immutable laws of physics and economics.

We are no longer in an era of easy wins where a simple software update or a slightly better model yields massive cost reductions. Here’s why:

  • Hardware is the Bottleneck: The speed of LLMs is fundamentally limited by physical constraints on memory bandwidth. You simply cannot move petabytes of model weights instantly. Additionally, purchasing additional hardware to solve for demand problems must outpace the ever-increasing demand for AI models, which is unlikely at least for a while.

  • Models are Hitting a Performance Wall: For a given model size, capabilities are beginning to asymptote because we are running out of novel data to train on, and training on more data is yielding diminishing returns.

  • Energy Costs are Real: Data centers consume vast amounts of electricity. This is a hard, physical-world cost that doesn’t disappear with a software update. As models get bigger and training runs get longer, their energy appetite grows, putting upward pressure on operational costs.

This new reality has several profound consequences for the industry:

  1. Cost is Becoming a Fixed Constraint: The most significant consequence is a required shift in mindset for developers. For a given tier of intelligence—like the "workhorse" capability of Gemini Flash—the cost has now hit a fundamental floor. Teams can no longer build products assuming that a feature that is too expensive today will become affordable tomorrow at the same price point, simply due to the relentless march of progress. The order-of-magnitude cost for mid-tier intelligence must now be treated as a fixed constraint. This means cost management is no longer just about optimization; it's a core architectural decision that should be baked into product roadmaps from day one.

  2. The End Of Compute Subsidization: Google’s move is likely a leading indicator, not an exception. As other providers gather more granular data on how their models are actually used, we should expect them to make similar adjustments to ensure profitability. The era of stable, predictable pricing may be coming to an end, replaced by more complex, tiered pricing structures aligned with specific use cases.

  3. The Economic Case for Batch & Open Source is Stronger Than Ever: If the cost of real-time inference from proprietary providers has a hard floor, then the relative savings from alternative architectures become much larger. For any task that isn't latency-sensitive, the strategic path forward is clear:

    • Batch Inference: Processing jobs in bulk allows providers like Sutro to maximize GPU utilization, use cheaper spare capacity, and avoid the "always-on" tax of real-time APIs. This translates into massive savings of 50-90% or more.

    • Open-Source Models: As our analysis of workhorse LLMs showed, open-source models like Qwen3 and Llama 3.3 often provide better or equivalent performance for common tasks at a fraction of the cost, without vendor lock-in and with greater control over data privacy.

But What About o3?

At about the same time Google hiked prices for Gemini Flash, OpenAI decreased the price for o3. While this might seem like a counterexample to our analysis on Flash, there are reasons to be skeptical of this conclusion.

First, o3 is a completely different class of model. It is at the frontier of intelligence, whereas Flash is meant to be a workhorse. Consequently, there is more room for optimization that isn’t available in Flash’s case, such as more room for pruning, distillation, etc.

Second, OpenAI has been behind other providers lately in offering affordable foundation models. It is unclear how much of the price drop is due to optimizations or simple sales pressure. OpenAI can afford to take negative margins while playing catch up, whereas Google is a public company that cannot (and does not) play the same compute subsidization games.

Conclusion: Navigating the New Cost Landscape

Google's decision to raise the price of Gemini 2.5 Flash wasn't just a business decision; it was a signal to the entire market. The relentless march toward zero-cost intelligence has hit the wall of economic reality. The cost of running these powerful models is real, and providers can no longer afford to subsidize every type of workload.

This new era demands a smarter approach. Instead of hoping for cheaper models, the path forward lies in better architecture. For the vast majority of AI tasks that don’t require an immediate response, the answer isn’t a more expensive real-time API. It's a more efficient paradigm.

By embracing batch processing and leveraging the power of cost-effective open-source models, you can sidestep the price floor and continue to scale your AI initiatives in ways that are no longer feasible with traditional APIs.

If you’re building batch tasks with LLMs and are looking to navigate this new cost landscape, feel free to reach out to see how Sutro can help.

联系我们 contact @ memedata.com