TransMLA:多头潜在注意力就足够了
TransMLA: Multi-head latent attention is all you need

原始链接: https://arxiv.org/abs/2502.07864

本文介绍了TransMLA,这是一种后训练方法,旨在将基于组查询注意力(GQA)的大型语言模型(LLM),例如LLaMA、Qwen和Mixtral,转换为使用多头潜在注意力(MLA)的模型。MLA 的显著优势在于它使用低秩矩阵压缩键值 (KV) 缓存,从而减少通信瓶颈,并与传统的多分支注意力或 GQA 相比实现更快的推理速度。 虽然 MLA 已在 Deepseek 模型中证明有效,但其广泛应用仍然不足。作者证明了 GQA 可以用具有等效 KV 缓存开销的 MLA 来表示,但反过来则不然。TransMLA旨在弥合这一差距,允许现有的基于 GQA 的模型利用 MLA 的效率。转换后,可以对模型进行进一步训练以增强表达能力,而不会增加 KV 缓存大小。未来的工作包括开发 MLA 专用的推理优化,以保持低延迟并促进从 Deepseek R1 等高级模型的有效蒸馏。最终,TransMLA 旨在促进 MLA 的更广泛使用,优化 LLM 的性能和资源利用率。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交 登录 TransMLA:多头潜在注意力就是你所需要的 (arxiv.org) 16 分,来自 ocean_moist,2 小时前 | 隐藏 | 过去 | 收藏 | 讨论 考虑申请 Y Combinator 2025 年夏季批次!申请截止日期为 5 月 13 日 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们 搜索:

原文

View a PDF of the paper titled TransMLA: Multi-Head Latent Attention Is All You Need, by Fanxu Meng and 2 other authors

View PDF HTML (experimental)
Abstract:Modern large language models (LLMs) often encounter communication bottlenecks on current hardware, rather than purely computational constraints. Multi-head Latent Attention (MLA) tackles this challenge by using low-rank matrices in the key-value (KV) layers, thereby allowing compressed latent KV states to be cached. This approach significantly reduces the KV cache size relative to traditional multi-head attention, leading to faster inference. Moreover, MLA employs an up-projection matrix to increase expressiveness, trading additional computation for reduced communication overhead. Although MLA has demonstrated efficiency and effectiveness in Deepseek V2/V3/R1, many major model providers still rely on Group Query Attention (GQA) and have not announced any plans to adopt MLA. In this paper, we show that GQA can always be represented by MLA while maintaining the same KV cache overhead, but the converse does not hold. To encourage broader use of MLA, we introduce TransMLA, a post-training method that converts widely used GQA-based pre-trained models (e.g., LLaMA, Qwen, Mixtral) into MLA-based models. After conversion, the model can undergo additional training to boost expressiveness without increasing the KV cache size. Furthermore, we plan to develop MLA-specific inference acceleration techniques to preserve low latency in transformed models, thus enabling more efficient distillation of Deepseek R1.
From: Meng Fanxu [view email]
[v1] Tue, 11 Feb 2025 18:20:18 UTC (326 KB)
[v2] Thu, 13 Feb 2025 18:07:04 UTC (327 KB)
联系我们 contact @ memedata.com