(comments)

原始链接: https://news.ycombinator.com/item?id=44060533

Here's a short summary of the Hacker News discussion: A user inquired about the continued prevalence of diffusion language models despite flow matching's perceived superiority in image generation. Another user suggested that the established expertise and fine-tuning of diffusion models might explain their current dominance. A link to a previous discussion and research papers were shared, suggesting that diffusion models potentially exhibit better reasoning capabilities by avoiding early token bias, unlike autoregressive models. Other users discussed the potential for combining diffusion and transformer architectures, possibly alternating their roles within a single interface based on context. The original author of the linked blog post clarified that current diffusion model implementations require attention score calculations across the entire sequence, limiting cacheability advantages compared to autoregressive models, even when denoising only a portion of the text. The blog post was thanked by the users.


原文
Hacker News new | past | comments | ask | show | jobs | submit login
Strengths and limitations of diffusion language models (seangoedecke.com)
70 points by rbanffy 1 day ago | hide | past | favorite | 7 comments










I'm curious, in image generation, flow matching is said to be better than diffusion, then why do these language models still start from diffusion, instead of jumping to flow matching directly?


This is just a guess but I think it’s due to diffusion training being more popular so we’ve figured more of the kinks with those models. Flow matching models might follow after you figure out some of their hyperparameters.


A big discussion on this happened here as well https://news.ycombinator.com/item?id=44057820

There is quite a bit of evidence diffusion models work better at reasoning because they don't suffer from early token bias.

https://github.com/HKUNLP/diffusion-vs-ar https://arxiv.org/html/2410.14157v3



Great overview. I wonder if we'll start to see more text diffusion models from other players, or maybe even a mixture of diffusion and transformer models alternating roles behind a single UI, depending on the context and request.


The diffusion models are (or can be) transformer models! They're just not autoregressive.


That's a nice explanation. I wonder whether autoregressive and diffusion language models could be combined such that the model only denoises the (most recent) end of a sequence of text, like a paragraph, while the rest is unchangeable and allows for key-value caching.


Hi, I wrote the post. Thank you!

That’s how it does work, but unfortunately denoising the last paragraph requires computing attention scores for every token in that paragraph, which requires checking those tokens against every token in the sequence. So it’s still much less cacheable than the equivalent autoregressive model.







Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com