推测解码的经济学
The Economics of Speculative Decoding

原始链接: https://fergusfinn.com/blog/economics-of-speculative-decoding/

推测解码(Speculative Decoding)曾被视为密集型 Transformer 的“免费”性能提升方案,但随着混合专家模型(MoE)和注意力压缩技术(如 DeepSeek 的 MLA)等现代架构的转变,它正面临新的经济现实。 此前,推测解码利用了密集模型“内存受限”的特性,即验证 token 的成本几乎为零。然而,以下两个因素改变了这一局面: 1. **MoE 的代价:** 与密集层不同,MoE 的路由机制意味着算术强度取决于隐藏状态输入。在小批次场景下,新 token 会激活“新”的专家,从而产生显著的权重传输成本。因此,被拒绝的 token 不再是“免费”的,而被接受的 token 带来的相对价值也随之降低。 2. **注意力压缩:** 多头潜在注意力(MLA)等技术减少了此前可用于推测 token 的内存余量。验证过程现在往往受限于计算而非内存,这意味着每一个推测出的 token 都带有不可忽略的成本。 **结论:** 推测解码已不再是通用的优化手段。由于拒绝惩罚和验证成本的增加,系统性能现在取决于针对最优推测长度($\gamma^*$)的动态、引导式决策。工程师们必须仔细权衡草稿模型的开销与已验证 token 的边际效用,以维持系统效率。

Hacker News 最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 推测性解码的经济学 (fergusfinn.com) 5 分,作者 kkm,1 小时前 | 隐藏 | 过往 | 收藏 | 讨论 | 帮助 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文

Speculative decoding is one of the cleanest performance wins in inference optimisation: it’s lossless, it hits decode latency when not much else does, and in its standard formulation it’s simple and elegant.

It works by looking forwards: speculative decoding takes a position on what tokens will come next. For dense transformers the bet is riskless: accepted tokens pay off, rejected tokens cost nothing, a clean arbitrage on spare memory bandwidth.

A burst of research activity has recently pushed the envelope on how far forwards we can take that bet, for example Eagle 3.1, DFlash, SSD.

This post looks at two architectural shifts that have changed the underlying economics of speculation: what mixture-of-experts routing does to the decode roofline, and how compressed attention takes away the slack that used to make speculated tokens free.

Then it works through what they mean for when, and how far ahead, we should speculate.

The expert tax§

FFN layers in older, dense transformers (like the venerable Llama

The win for speculative decoding is clear. If you’re on the slope of the roofline you’re memory bound, and speculated tokens increase the amount of compute you’re doing without increasing the memory transfer. So both accepted & rejected tokens are free until they push you over the knee.

Modern models almost invariably

This routing means that the arithmetic intensity of the MoE layer can depend on the actual content of the hidden state inputs, not just the shape. In practice, one training objective (for training and large scale inference reasons) is to keep the experts balanced — that is, if BB tokens come in, each expert of EE total should process a fraction B/EB/E of the total.

From here on, take DeepSeek-V4-Flash as an example: k=6k=6

  1. Barely amortising at the bottom. At small batch each new token added to the batch tends to activate fresh experts (at batch 2 the chance the new token’s experts already match is small), so it drags its own weights across the bus and gets little to no amortisation. The intensity leaves the origin at only half its eventual slope, so a token added here, speculated or not, pays close to full freight for its experts.
  2. Shallower slope / distant knee, same ceiling. Once every expert is being triggered, the MoE line climbs more gently, reaching the same ceiling only at a far larger batch. The free-token band is much wider.

Dense climbs steeply; the MoE is shallower by a factor (k+1)/(E+1)(k+1)/(E+1)

The whole idea of speculative decoding is to amortise the weight transfer in autoregressive decoding between multiple steps. Notably, the chart tells us at batch size 11 this barely works for the MoE layers. But, as batch size grows past this low region, there’s a much larger space in which speculative decoding might pay.

The implications for speculative decoding are that:

  1. The win when speculative tokens are accepted is no longer so big
  2. The penalty when speculative tokens are rejected is no longer zero.
  3. Both the win & the penalty from speculative decoding changes nonlinearly with batch size.

The changing face of attention§

The ‘expert tax’ at low batch size is part of the story that’s changed. The other part is attention. A recap: the term for the ratio of FLOPs to memory transferred for an operation is arithmetic intensity. You can figure out whether an operation is memory bound or compute bound by comparing its arithmetic intensity to the ratio of available flops and memory bandwidth, for the hardware you’ll run the operation on.

Generically, we can write the arithmetic intensity of the attention operation as:

AI=fTSmcS+mqTAI = \frac{f\cdot TS}{m_c \cdot S + m_q \cdot T}
联系我们 contact @ memedata.com