LFM2-24B-A2B：扩展LFM2架构

LFM2-24B-A2B：扩展LFM2架构
LFM2-24B-A2B: Scaling Up the LFM2 Architecture

原始链接: https://www.liquid.ai/blog/lfm2-24b-a2b

## LFM2-24B-A2B：可扩展且易于访问的大型语言模型 Liquid AI 发布了 LFM2-24B-A2B，这是迄今为止最大的模型，将 LFM2 系列从 3.5 亿参数扩展到 240 亿参数。这种稀疏混合专家 (MoE) 模型每个 token 只激活 20 亿参数，从而实现高效扩展和与更大密集模型相当的性能。 LFM2-24B-A2B 专为广泛的可访问性而设计，可容纳在 32GB 内存中，允许部署在消费级笔记本电脑、台式机（包括集成 GPU 和 NPU 的设备）以及云环境中。它现在作为开放权重在 Hugging Face 上提供，并提供文档和 Playground，方便测试和微调。该模型的架构结合了门控短卷积和分组查询注意力，以实现快速处理和低内存使用。扩展涉及增加模型深度和专家数量，同时保持精简的活动参数数量（约 23 亿）。基准测试表明，随着模型扩展，质量呈对数线性提升，在吞吐量方面与 Qwen3 和 gpt-oss 媲美，在单个 H100 GPU 上使用 vLLM 时达到约 26.8K tokens/秒。未来将通过持续的预训练进行进一步改进，目标是发布 LFM2.5-24B-A2B。凭借超过 1000 万次下载的 LFM2 系列，Liquid AI 鼓励用户探索和贡献。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 LFM2-24B-A2B：扩展LFM2架构 (liquid.ai) 6点由 nateb2022 1小时前 | 隐藏 | 过去 | 收藏 | 讨论帮助考虑申请YC 2026年夏季项目！申请截止至5月4日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

Today, we release an early checkpoint of LFM2-24B-A2B, our largest LFM2 model. This sparse Mixture of Experts (MoE) model has 24 billion total parameters with 2 billion active per token, showing that the LFM2 architecture scales effectively to larger sizes.

With this release, the LFM2 family spans nearly two orders of magnitude: from LFM2-350M to LFM2-24B-A2B. Each step up in scale has brought consistent quality gains on standard benchmarks. We designed LFM2-24B-A2B to fit in 32GB of RAM, making it deployable across cloud and edge environments, including consumer laptops and desktops with integrated GPUs (iGPU) and dedicated NPUs.

LFM2-24B-A2B is open-weight and available now on Hugging Face. Check out our docs on how to run or fine-tune it locally, or simply test it on our Playground.

Scaling Up LFM2 MoE

LFM2 is a hybrid architecture that pairs efficient gated short convolution blocks with a small number of grouped query attention (GQA) blocks. This design, developed through hardware-in-the-loop architecture search, gives LFM2 models fast prefill and decode at low memory cost. LFM2-24B-A2B applies this backbone in a Mixture of Experts configuration: with 24B total parameters but only 2.3B active per forward pass, it punches far above the cost of a 2B dense model at inference time.

We use a similar recipe to LFM2-8B-A1B. The model keeps the same hidden dimension (2048) and attention configuration as LFM2-8B-A1B, but scales along two axes: depth and expert count. It goes from 24 layers to 40, and from 32 experts to 64 experts per MoE block, while keeping top-4 routing. To stay within a 2B active parameter budget, each expert is slightly narrower (intermediate size 1536 vs. 1792 in the 8B). The first two layers remain dense for training stability, and the attention-to-convolution ratio holds at roughly 1:3 (10 attention layers out of 40), preserving the fast prefill and low memory characteristics of the LFM2 backbone.

The scaling recipe is: go deeper, add more experts, keep each expert and the active path lean. More layers let the model build richer representations across both convolution and GQA blocks, while doubling the expert count enables finer-grained routing and more room for specialization. Crucially, none of these changes inflate the per-token compute path; the active parameter count grows only ~1.5x (1.5B → 2.3B) against a 3x increase in total parameters (8.3B → 24B). By concentrating capacity in total parameters rather than active parameters, the model stays edge-friendly: inference latency and energy consumption track the small active path, making it deployable on a range of laptops and desktops.

Benchmarks

We took a lightweight post-training approach to ship LFM2-24B-A2B as a traditional instruct model without reasoning traces. We chose this route because it was faster to post-train an instruct version, and instruct models tend to be more popular than thinking variants.

Below we show average benchmark scores across the LFM2 family, from the 350M dense model up to the 24B MoE.

Across benchmarks including GPQA Diamond, MMLU-Pro, IFEval, IFBench, GSM8K, and MATH-500, quality improves log-linearly as we scale from 350M to 24B total parameters. This near-100x parameter range confirms that the LFM2 hybrid architecture follows predictable scaling behavior and does not hit a ceiling at small model sizes.

Fast Everywhere Inference

LFM2-24B-A2B has day-zero support for inference through llama.cpp, vLLM, and SGLang. You can run it on CPU or GPU out of the box, with multiple quantization options (Q4_0, Q4_K_M, Q5_K_M, Q6_K, Q8_0, F16) available in GGUF format for llama.cpp.

We compared LFM2-24B-A2B against two popular MoE models of similar size: Qwen3-30B-A3B-Instruct-2507 (30.5B total, 3.3B active parameters) and gpt-oss-20b (21B total, 3.6B active parameters). We measured both prefill and decode throughputs with Q4_K_M versions of these models using llama.cpp on AMD Ryzen AI Max+ 395.

Decode throughput (in tokens/s) when generating 100 tokens across different context sizes (in tokens):

Prefill throughput (in tokens/s) across different context sizes (in tokens):

We also report throughput (total tokens / wall time) achieved with vLLM on a single H100 SXM5 GPU. High-throughput serving is critical for both cost-efficient deployment and rollout generation during RLVR workloads. Our measurements use a realistic interleaved prefill-and-decode setup representative of production-scale serving and RL workloads.

We benchmarked LFM2-24B-A2B against gpt-oss-20b and Qwen3-30B-A3B-Instruct-2507. On a single H100 SXM5 with vLLM, LFM2-24B-A2B reached approximately 26.8K total tokens per second at 1,024 concurrent requests (1,024 max input tokens / 512 max output tokens), surpassing both comparably sized MoE models under continuous batching and demonstrating the favorable throughput scaling of the LFM2 architecture.

In addition, we are working with hardware partners to bring NPU support for LFM2 models on mobile devices and edge hardware. The MoE design with only 2B active parameters per token makes this model a strong candidate for on-device deployment, even at 24B total parameters.

What's Next

LFM2-24B-A2B has been trained on 17T tokens so far, and pre-training is still running. When pre-training completes, expect an LFM2.5-24B-A2B with additional post-training and reinforcement learning.

In the meantime, download the weights, run it on your laptop or in the cloud, and let us know what you think!

The LFM2 family has crossed over 10 million downloads on Hugging Face! Join the action and download our open weights today to start building.

Citation

Please cite this article as:

Liquid AI, "LFM2.5-24B-A2B: Scaling Up the LFM2 Architecture", Liquid AI Blog, Feb 2026.

Or use the BibTeX citation:

@article{liquidAI202624B,
  author = {Liquid AI},
  title = {LFM2.5-24B-A2B: Scaling Up the LFM2 Architecture},
  journal = {Liquid AI Blog},
  year = {2026},
  note = {www.liquid.ai/blog/},
}