扩散语言模型的优势和局限性—

扩散语言模型的优势和局限性——Sean Goodgecke
Strengths and limitations of diffusion language models

原始链接: https://www.seangoedecke.com/limitations-of-text-diffusion-models/

Gemini Diffusion 展示了扩散模型在文本生成方面的速度潜力。与逐个token生成文本的自回归模型不同，扩散模型在每一步都产生完整的输出，允许并行生成正确的token，并通过更少的去噪步骤来调整质量。虽然扩散模型擅长快速生成固定长度的输出（尤其对于较长的文本），但与自回归模型相比，它们在生成短输出时可能较慢。由于在去噪过程中无法有效缓存注意力分数，扩散模型也难以处理较长的上下文窗口。扩散模型像自回归模型一样进行“推理”（例如链式思维）的能力尚不清楚，因为它们的逐块生成范式不容易支持迭代自我纠正。虽然扩散模型通常在内部使用transformer进行噪声预测，但它们的整体架构与自回归transformer fundamentally不同，这影响了它们的行为。扩散模型的未来可能通过更密集的计算去噪来获得更强的推理能力。

Here's a short summary of the Hacker News discussion: A user inquired about the continued prevalence of diffusion language models despite flow matching's perceived superiority in image generation. Another user suggested that the established expertise and fine-tuning of diffusion models might explain their current dominance. A link to a previous discussion and research papers were shared, suggesting that diffusion models potentially exhibit better reasoning capabilities by avoiding early token bias, unlike autoregressive models. Other users discussed the potential for combining diffusion and transformer architectures, possibly alternating their roles within a single interface based on context. The original author of the linked blog post clarified that current diffusion model implementations require attention score calculations across the entire sequence, limiting cacheability advantages compared to autoregressive models, even when denoising only a portion of the text. The blog post was thanked by the users.

（评论） 2025-05-22

双子扩散模型 2025-05-22

简单解释扩散模型 2025-05-19

扩散模型 2024-05-27

（评论） 2024-01-23

原文

Google recently released Gemini Diffusion, which is impressing everyone with its speed. Supposedly they even had to slow down the demo so people could see what was happening. What’s special about diffusion models that makes text generation so much faster? Should every text model be a diffusion model, going forward?

I previously wrote a simple explainer of diffusion models here. If you don’t have any intuitions about how diffusion models are different, I suggest starting with that. This post will go into more detail about how those differences affect performance and quality in model outputs.

Why diffusion models are fast

The biggest difference between diffusion models and traditional autoregressive models (like 4o, Claude, and all current transformer-based models) is that diffusion models generate the entire output at each step. For an output like “abcd”, a autoregressive architecture will generate token-by-token: “a”, “ab”, “abc”, and finally “abcd”. A diffusion model will generate the whole thing, growing more accurate at each step: “xycz”, “aycd”, then “abcd”. This has two interesting consequences for speed:

Unlike normal autoregressive models, diffusion models can generate correct parts of the final token sequence in parallel (e.g. the start and the end) during the same pass
Unlike autoregressive models, diffusion models can be trained to make fewer passes (in exchange for producing a lower-quality output)

You can see the first point in the slowed-down Gemini Diffusion demo. The generation in the first frame is at least partially accurate (i.e. it’s generated a bunch of the “correct” tokens all in one go). And the second point is easy to imagine: just stop halfway through the demo and imagine that’s the output you get. Twice as fast, if you’re happy for there to be some errors in the final output.

Fixed-length vs arbitrary length responses

The other main difference between diffusion and autoregressive models is that a diffusion model always generates a fixed-length output (say, 256 tokens)^{. Technically autoregressive models generate fixed length outputs as well (one token), but in practice they’re designed to generate a token sequences of varying lengths. This has implications both for speed and quality.}

Diffusion models are always going to be faster than autoregressive models at generating the number of tokens in their output set (or higher), for the reasons I laid out in the previous section. If a diffusion model needs to generate 512 tokens, it can do that in two chunks (24 passes) instead of needing 512 passes. However, if you only need to generate a handful of tokens, autoregressive models might be faster. If a diffusion model always makes 12 passes, it’s going to do twice as much work than an autoregressive model in order to generate a six-token response.

Performance on long contexts

Because they generate output in blocks, diffusion models are slower at ingesting long context windows. The reason why is pretty technical. Consider how attention works in an autoregressive language model. Each token is “checked” against all previous tokens in the sequence in order to determine which previous tokens are most relevant. For instance, if the model is about to generate a name, the previous usages of that name in the sequence will all have high attention scores, because they’re being used to determine what name to generate now.

The reason this isn’t straightforwardly quadratic is the “key-value cache”: because autoregressive models generate token-by-token, attention scores for previously-generated tokens don’t have to be checked again.

Diffusion models can’t benefit from the key-value cache as easily, because the current block of tokens being generated can all change during each denoising pass, and thus can’t be cached. So diffusion models must re-calculate attention^{against the entire context window for each token in the block of tokens being generated, every denoising pass. That adds up to many more flops than the equivalent autoregressive model would spend.}

Can diffusion models reason?

A striking recent development in autoregressive models has been the introduction of the “reasoning model”: an autoregressive model that’s been trained to produce a chain-of-thought internal monologue before producing a user-facing answer. It’s intuitive to understand why autoregressive models can do this: they think about each token they produce, so at any point they can “change their mind” and take a new position, usually by outputting a token like “But” or “Wait” or “Hold on”.

What about diffusion models? I don’t think it’s clear yet. Maybe we’ll see strong reasoning models built on diffusion. But if we don’t, it’ll be because the “changing your mind” reasoning paradigm doesn’t map nicely onto block-by-block generation. Why would a diffusion model generate a token block with “Wait, I was wrong” in the middle of it? Wouldn’t that get “edited out” in the denoising pass?

It’s possible that diffusion models could change their minds in a totally different way. When a diffusion model makes multiple passes over an output and updates tokens, is it changing its mind like a reasoning model would? How much reasoning work can be embedded into a denoising pass? There’s at least some current research exploring this direction.

One reason to be broadly skeptical about the potential of diffusion models to reason is precisely that they do much less work per-token than autoregressive models do. That’s just less space for the model to spend “thinking”. However, that’s not necessarily an integral feature of diffusion. Right now diffusion models are built for speed, but we could imagine a diffusion model built to make hundreds of thousands of passes over each generated block of tokens. A model like that could plausibly do quite a lot of reasoning.

Yes, text diffusion models sometimes use transformers

One final technical point: it’s not completely correct to talk about “diffusion models” vs “transformer models”, like I did here. When diffusion models do that pass over the entire input, they use an internal model to predict which parts of the input are noise and should be removed. That internal model is often a transformer model, as in Mercury Coder. Unlike “normal” autoregressive transformer models, the transformer inside a diffusion model doesn’t predict token logits, but instead predicts where the noise is in the input.

However, from the perspective of an AI developer (instead of someone training models at an AI lab) this is kind of an academic distinction. The behavioral characteristics of a diffusion model are the same whether the underlying noise-predicting model is a transformer or not, because the overall diffusion architecture is different enough to dominate the differences in behavior.

Summary

Diffusion models are fast because they can output multiple tokens “in parallel”, instead of going token-by-token
They’re easily tunable to do less editing (i.e. denoising) passes if you want more speed at the cost of quality
However, if you only want two or three tokens, autoregressive models will likely be faster, because a diffusion model needs to do all of its denoising passes no matter what
Diffusion models (at least ones using transformers) will be slower with lots of tokens in context, because outputting many tokens in parallel requires a lot of uncacheable attention work
It’s unclear how easy it is to build a reasoning model on top of diffusion. I bet there’s some really interesting private research going on here at AI labs. Intuitively, they won’t do chain-of-thought reasoning as nicely as autoregressive models, but there might be other approaches available to spend test-time-compute here
Diffusion models can and do use transformers, but it doesn’t make them operate like autoregressive models at all

If you liked this post, consider subscribing to email updates about my new posts.