(comments)
原始链接: https://news.ycombinator.com/item?id=43969442
The Hacker News discussion revolves around "TransMLA: Multi-head latent attention is all you need," a research paper proposing a more efficient attention mechanism for transformers. TransMLA aims to reduce memory requirements, specifically the KV cache size, by using latent representations.
Commenters highlight potential benefits like enabling local inference and converting existing models (like LLaMA and Qwen) with fine-tuning. Some explain the technical aspects, such as low-rank approximation and the improved expressiveness compared to Group Query Attention (GQA). While GQA saves more memory, MLA offers greater flexibility.
There's also meta-discussion about the paper's title, referencing "Attention Is All You Need," with some finding this trope overused. Another user links an explanatory video about MHA, MQA, GQA, and MLA. The general sentiment seems positive, with interest in seeing models implemented using this technique.
[3.3] For saving the KV cache, only the intermediate latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]
[background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head
sounds like a big win for memory constrained environments like local inference
reply