TransMLA: Multi-head latent attention is all you need

jbellis · 2025-05-13T09:45:39 1747129539

[abstract] This approach significantly reduces the KV cache size relative to traditional multi-head attention

[3.3] For saving the KV cache, only the intermediate latent representations need to be stored: [latex] where r is much smaller than nh · dh [n-sub-h, d-sub-h]

[background] In traditional multi-head attention you must cache full key and value matrices of size T x (nh · dh) where T is the token length, nh is the number of attention heads, dh is the dimensionality of each individual head

sounds like a big win for memory constrained environments like local inference

killerstorm · 2025-05-13T12:54:17 1747140857

Another paper related to attention distillation, although doing something far more radical: transformer attention is distilled onto RWKV-like model: https://huggingface.co/papers/2505.03005

karmakaze · 2025-05-13T16:37:58 1747154278

I'm not "in the field" though I like to read about and use LLMs. This video "How DeepSeek Rewrote the Transformer [MLA]"[0] is really good at explaining MHA, MQA, GQA, and MLA with clear visuals/animations and how DeepSeek MLA is 57x more efficient.

[0] https://www.youtube.com/watch?v=0VLAoVGf_74&t=960s

olq_plo · 2025-05-13T05:54:06 1747115646

Very cool idea. Can't wait for converted models on HF.

MichaelMoser123 · 2025-05-13T17:12:11 1747156331

deepseek-v2,v3,r1 are all using multi-headed attention.

wiz21c · 2025-05-13T08:56:13 1747126573

Not quite related, but do the mamba models gain ground ?

Answering my own question: https://www.reddit.com/r/MachineLearning/comments/1hpg91o/d_...

magicalhippo · 2025-05-13T11:35:36 1747136136

I'm just following the field from the sidelines, but this looks interesting to me. Especially the increase in expressiveness that the new model allows for over GQA, at the cost of just ~10% more memory, and the fact that you can convert existing GQA models like LLaMA, Qwen etc with just a bit of fine-tuning.

Perhaps a trivial insight but I feel a lot of progress often comes in the form of generalizations, where existing approaches can be seen as special cases. Here the authors show that Group Query Attention (GQA) and Multi-Query Attention (MQA) falls out as special cases of their new model.

edit:

Adding my own summary, as I understand it.

The key to what they're doing, no pun intended, is to rely on the fact that large, high-dimensional, matrices may contain a lot of redundant information. Thus one may be able to find an good approximation which has less redundant information, by going through an intermediary stage which has fewer dimensions.

A n-by-m matrix M takes n-dimensional vectors and transforms them to m-dimensional vectors. The trick here is to replace matrix A by two matrices, L and R, which are n-by-r and r-by-m respectively, where r is smaller than n and m. This is called a low-rank approximation.

In a sense you're "straining the matrix", by forcing the information to pass through an intermediary, low-dimensional vector.

The memory savings come from the fact that matrix A has n*m entries, while L and R have n*r and r*m entries respectively. Say n = m = 100 and r = 20, that means A has 100*100 = 10k entries, while L and R have just 100*20 + 20*100 = 4k entries in total.

The trick itself is not new, for example it is also used in LoRA where an additional low-rank approximation matrix is used to tweak the output of an existing model. The low rank means there's far fewer the matrix entries, aka parameters, to train than if one had used a regular fully dense matrix.

The extra expressiveness of MLA comes from the fact that in GQA, in order to save memory, some of the matrices are actually built by gluing copies of a narrower matrix together. This means the information in the glued-up matrices are very redundant and fixed in a certain way, and thus are restricted in how they can transform the inputs.

By using the low-rank approximation instead, the information in the full, reconstructed matrices are not fixed in the same way compared to the glued-up result. Thus the inputs can be transformed in a less restrictive way, leading to the increase in expressiveness.

The GQA method saves a bit more memory compared to MLA as the narrower matrices are even smaller than the low-rank matrices in MLA, but at the cost of expressiveness.

kavalg · 2025-05-13T06:28:45 1747117725

My (possibly wrong) TLDR: TransMLA is a method to "compress" an already trained GQA model, with the additional option to further fine tune it. Shall make inference faster.

yorwba · 2025-05-13T06:52:43 1747119163

It is not a method to compress a Grouped-Query Attention model, but to expand it into an equivalent Multi-head Latent Attention model with the same key-value cache size but larger effective key/value vectors and a correspondingly larger number of trainable parameters. With additional training, you can then obtain a better model that only uses a little bit more memory.

freeqaz · 2025-05-13T06:48:59 1747118939

Also makes models smarter ("expressive")

octocop · 2025-05-13T08:31:26 1747125086

[flagged]

insin · 2025-05-13T10:23:12 1747131792

Why we're moving away from all you need considered harmful

ghc · 2025-05-13T12:33:40 1747139620

It's become the equivalent of the stupid faces on YouTube thumbnails.

tankenmate · 2025-05-13T08:38:08 1747125488

The title of this paper is a reference to a previous paper titled "Attention Is All You Need"[0][1]. This seminal work described the transformer model that is the basis for almost all LLMs, and is almost certainly the most cited paper on AI even though it was only published in 2017.

[0] https://arxiv.org/abs/1706.03762 [1] https://en.wikipedia.org/wiki/Attention_Is_All_You_Need

kristopolous · 2025-05-13T09:11:50 1747127510

Right, it's an 8 year old reference that's been made hundreds of times.

People seem to love going to the references graveyard, digging up tired and dead ones and drag them around town hoping everyone thinks they're clever.

Also this was from 3 months ago.

nihzm · 2025-05-13T09:51:18 1747129878

It has definitely been overused by too many authors. This reminds me a passage of Orwell's essay "Politics and the English Language":

> A newly−invented metaphor assists thought by evoking a visual image, while on the other hand a metaphor which is technically "dead" (e.g., iron resolution) has in effect reverted to being an ordinary word and can generally be used without loss of vividness. But in between these two classes there is a huge dump of worn−out metaphors which have lost all evocative power and are merely used because they save people the trouble of inventing phrases for themselves

tankenmate · 2025-05-13T09:43:27 1747129407

By that argument you must also hate anything that mentions the term "considered harmful", or makes any form of derivative cultural reference (like just about every episode of the Simpsons). Why do you let it get to you?

kristopolous · 2025-05-13T15:36:39 1747150599

Because attention and time are a quantity that only decreases with life.

tankenmate · 2025-05-13T16:21:34 1747153294

Then why waste your time with getting upset about people making tired cultural references? It's a chuckle at best and a meh at worst, getting bothered by it is a waste of effort.

netdevphoenix · 2025-05-13T09:41:19 1747129279

Why is this the most cited paper in AI and not the original 1943 paper who started it all?

zaptrem · 2025-05-13T09:43:34 1747129414

Transformers are what made ML infinitely scalable and caused a huge amount of progress in very few years since everyone could just go scale things. However, idk how many of those papers actually even cite the transformer paper?

tankenmate · 2025-05-13T10:32:03 1747132323

I just checked Google Scholar, not perfect but good for an indicative; "A logical calculus of the ideas immanent in nervous activity" [WS McCulloch, W Pitts - The bulletin of mathematical biophysics, 1943] has ~33,000 citations, and "Attention is all you need" [A Vaswani, N Shazeer, et al, Advances in Neural Information Processing Systems, 2017] has ~180,000 citations.

netdevphoenix · 2025-05-13T12:33:08 1747139588

As I understand, the transformer architecture is built on deep learning.

Would you say that transformers made a bigger progress RELATIVE to the progress made by deep learning? AFAIK, before the first wave of AI powered apps that were visible to users appeared thanks to deep learning in the early 10s. Users went from nothing to fancy AI features, the question is likely subjective but is the jump from nothing to fancy AI features the same as the jump from fancy AI features to GenAI in relative terms?

We can't forget that new tech builds upon older tech hence merits need to be relative

daemonologist · 2025-05-13T16:23:32 1747153412

Because the MCP neuron is taken as common knowledge and people do not feel the need to explicitly reference it (and haven't for some time), and the pace of publishing has increased in recent years.

tankenmate · 2025-05-13T09:46:19 1747129579

Probably because of the modern "publish or perish" mantra led to an exponential growth in publications, and "newer is better" means that newer impactful papers get cited more than older impactful publications. But that thesis is probably a paper in itself (of the meta analysis navel gazing variety).

amelius · 2025-05-13T15:56:06 1747151766

It's better than "omg, this random thing we tried actually works!"

jsheard · 2025-05-13T11:25:08 1747135508

Those words are all you need to get to the top of HN though. Think of the karma!

seeknotfind · 2025-05-13T09:02:54 1747126974

All you need titles stopping is all you need.

Etheryte · 2025-05-13T09:32:13 1747128733

All you need is love, and for these titles to stop. (But they won't do that.)

0xdeadbeefbabe · 2025-05-13T15:51:46 1747151506

They won't stop until someone publishes gotos considered all you need.

EGreg · 2025-05-13T09:16:27 1747127787

We need more than that, and all you need to stop saying that!!

EGreg · 2025-05-13T09:17:26 1747127846

All you need to stop posting titles like that !

(comments)