基米线性：一种表达性、高效的注意力架构

基米线性：一种表达性、高效的注意力架构
Kimi Linear: An Expressive, Efficient Attention Architecture

原始链接: https://github.com/MoonshotAI/Kimi-Linear

## Kimi Linear：高效长文本注意力机制 Kimi Linear 是一种新颖的混合线性注意力架构，专为卓越的性能和效率而设计，尤其是在处理长序列时。它利用 **Kimi Delta 注意力 (KDA)** – 一种改进的门控 Delta 规则 – 来优化内存使用和速度。测试表明，在较短的文本中（MMLU-Pro），Kimi Linear 的速度与全注意力机制相当，而在极长的文本中（RULER，128k tokens），速度提升了 **3.98 倍**。与传统的 MLA 相比，处理 1M tokens 的速度快 **6.3 倍**。主要优势包括 **将 KV 缓存大小减少高达 75%** 和将解码吞吐量提高高达 **6 倍**。该模型在 5.7T tokens 上训练，在各种基准测试中表现优于全注意力机制，包括强化学习任务。 Kimi Linear 可通过 Hugging Face Transformers 获取，并可与 vLLM 部署以提供与 OpenAI 兼容的 API。KDA 内核已在 FLA 中开源。

## Kimi Linear：一种新的AI效率方法 Moonshot AI 发布了 Kimi Linear，一种新的语言模型架构，旨在提高处理长文本和对话的效率。其核心创新是“混合线性注意力”方法，在大部分处理中使用更快的“线性注意力”捷径，同时保留一些传统的“全注意力”以保证准确性。这减少了所需的内存（“KV 缓存”）——减少了 75%——从而实现了 100 万字的大型上下文窗口。基准测试表明，Kimi Linear 在保持与现有模型相当的质量（MMLU-Pro 上为 51.0）的同时，响应生成速度快高达 6 倍，并且在较长文本上速度明显更快（在 128k 字的 RULER 上为 84.3）。该模型利用 480 亿个参数，但一次只激活 30 亿个，进一步提高了效率。讨论的重点在于这种方法是否以牺牲准确性为代价来换取速度，以及运行这些模型的实用性——至少需要 48GB 的 VRAM 或利用云计算资源。此次发布引发了关于 AI 开发未来、中国模型的数据隐私问题以及性能与效率之间平衡的争论。

原文

(a) On MMLU-Pro (4k context length), Kimi Linear achieves 51.0 performance with similar speed as full attention. On RULER (128k context length), it shows Pareto-optimal performance (84.3) and a 3.98x speedup. (b) Kimi Linear achieves 6.3x faster TPOT compared to MLA, offering significant speedups at long sequence lengths (1M tokens).

Kimi Linear is a hybrid linear attention architecture that outperforms traditional full attention methods across various contexts, including short, long, and reinforcement learning (RL) scaling regimes. At it's core is Kimi Delta Attention (KDA)—a refined version of Gated DeltaNet that introduces a more efficient gating mechanism to optimize the use of finite-state RNN memory.

Kimi Linear achieves superior performance and hardware efficiency, especially for long-context tasks. It reduces the need for large KV caches by up to 75% and boosts decoding throughput by up to $6\times$ for context as long as 1M tokens.

We open-sourced the KDA kernel in FLA, and released two versions model checkpoints trained with 5.7T tokens.

Kimi Delta Attention (KDA): A linear attention mechanism that refines the gated delta rule with finegrained gating.
Hybrid Architecture: A 3:1 KDA-to-global MLA ratio reduces memory usage while maintaining or surpassing the quality of full attention.
Superior Performance: Outperforms full attention in a variety of tasks, including long-context and RL-style benchmarks on 1.4T token training runs with fair comparisons.
High Throughput: Achieves up to $6\times$ faster decoding and significantly reduces time per output token (TPOT).

Inference with Hugging Face Transformers

To use the Kimi Linear model, we recommend the following:

Language: python >= 3.10
Package: torch >= 2.6
Package: fla-core >= 0.4.0

Example Code:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "moonshotai/Kimi-Linear-48B-A3B-Instruct"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

messages = [
    {"role": "system", "content": "You are a helpful assistant provided by Moonshot-AI."},
    {"role": "user", "content": "Is 123 a prime?"}
]
input_ids = tokenizer.apply_chat_template(
    messages, 
    add_generation_prompt=True, 
    return_tensors="pt"
).to(model.device)
generated_ids = model.generate(inputs=input_ids, max_new_tokens=500)
response = tokenizer.batch_decode(generated_ids)[0]
print(response)

For deployment, you can use the latest vllm to create an OpenAI-compatible API endpoint.

vllm serve moonshotai/Kimi-Linear-48B-A3B-Instruct \
  --port 8000 \
  --tensor-parallel-size 4 \
  --max-model-len 1048576 \
  --trust-remote-code

If you found our work useful, please cite

@article{kimi2025kda,
  title  = {Kimi Linear: An Expressive, Efficient Attention Architecture},
  author = {kimi Team},
  year   = {2025},
  url    = {https://github.com/MoonshotAI/Kimi-Linear/blob/master/tech_report.pdf}
}

基米线性：一种表达性、高效的注意力架构 Kimi Linear: An Expressive, Efficient Attention Architecture

Inference with Hugging Face Transformers

基米线性：一种表达性、高效的注意力架构
Kimi Linear: An Expressive, Efficient Attention Architecture