LLM-D：基于 Kubernetes 的大规模原生分布式推理

LLM-D：基于 Kubernetes 的大规模原生分布式推理
LLM-D: Kubernetes-Native Distributed Inference at Scale

2025年5月，CoreWeave、谷歌、IBM研究院、英伟达和红帽联合发布了llm-d，这是一个原生Kubernetes的分布式推理服务栈，旨在简化大规模语言模型（LLM）的部署。llm-d基于vLLM、Kubernetes和推理网关（IGW）等行业标准构建，是一个采用Apache-2许可证的社区驱动项目，可在各种硬件加速器上提供优化的性价比。其关键特性包括：利用IGW的端点选择协议（EPP）进行智能负载均衡的vLLM优化推理调度器；使用vLLM进行独立预填充和解码操作的分散式服务；以及使用vLLM的KVConnector进行的分散式前缀缓存，支持独立和共享缓存方案。计划中的特性包括硬件和工作负载感知的自动扩展。 llm-d可以作为完整解决方案部署，也可以通过单个组件部署。该项目提供用于Kubernetes部署的Helm图表，并通过GitHub、Slack、每周例会和谷歌群组鼓励社区参与。该项目的模块化设计和开放开发模式允许自定义和与现有基础设施集成。

2025-05-21

（评论） 2025-05-20

英伟达DynamO：一个数据中心规模的分布式推理服务框架 2025-03-18

Llamafile 允许您使用单个文件分发和运行 LLM 2023-11-30

原文

Kubernetes-Native Distributed Inference at Scale

Experimenting and developing with llm-d

llm-d is a metaproject composed of subcomponent repositories that can be cloned individually.

To clone all main components:

repos="llm-d llm-d-deployer llm-d-inference-scheduler llm-d-kv-cache-manager llm-d-routing-sidecar llm-d-model-service llm-d-benchmark llm-d-inference-sim"; for r in $repos; do git clone https://github.com/llm-d/$r.git; done

Tip

As a customization example, see this template for adding a custom scheduler filter.

Visit our GitHub Releases page and review the release notes to stay updated with the latest releases.

See our project overview for more details on our development process and governance.
We use Slack to discuss development across organizations. Please join: Slack
We host a weekly standup for contributors on Wednesdays at 12:30pm ET. Please join: Meeting Details
We use Google Groups to share architecture diagrams and other content. Please join: Google Group

This project is licensed under Apache License 2.0. See the LICENSE file for details.