LLM-D:基于 Kubernetes 的大规模原生分布式推理
LLM-D: Kubernetes-Native Distributed Inference at Scale

原始链接: https://github.com/llm-d/llm-d

2025年5月,CoreWeave、谷歌、IBM研究院、英伟达和红帽联合发布了llm-d,这是一个原生Kubernetes的分布式推理服务栈,旨在简化大规模语言模型(LLM)的部署。llm-d基于vLLM、Kubernetes和推理网关(IGW)等行业标准构建,是一个采用Apache-2许可证的社区驱动项目,可在各种硬件加速器上提供优化的性价比。 其关键特性包括:利用IGW的端点选择协议(EPP)进行智能负载均衡的vLLM优化推理调度器;使用vLLM进行独立预填充和解码操作的分散式服务;以及使用vLLM的KVConnector进行的分散式前缀缓存,支持独立和共享缓存方案。计划中的特性包括硬件和工作负载感知的自动扩展。 llm-d可以作为完整解决方案部署,也可以通过单个组件部署。该项目提供用于Kubernetes部署的Helm图表,并通过GitHub、Slack、每周例会和谷歌群组鼓励社区参与。该项目的模块化设计和开放开发模式允许自定义和与现有基础设施集成。

Hacker News new | past | comments | ask | show | jobs | submit login LLM-D: Kubernetes-Native Distributed Inference at Scale (github.com/llm-d) 8 points by bbzjk7 2 hours ago | hide | past | favorite | 2 comments Kemschumam 8 minutes ago | next [–] What would be the benefit of this project over hosting VLLM in Ray? reply xianshou 33 minutes ago | prev [–] Duplicate of https://news.ycombinator.com/item?id=44040883 reply Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact Search:

原文

llm-d Logo

Kubernetes-Native Distributed Inference at Scale

Documentation License Join Slack

Latest News 🔥

  • [2025-05] CoreWeave, Google, IBM Research, NVIDIA, and Red Hat launched the llm-d community. Check out our blog post and press release.

llm-d is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.

With llm-d, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).

Built by leaders in the Kubernetes and vLLM projects, llm-d is a community-driven, Apache-2 licensed project with an open development model.

llm-d adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway.

llm-d Arch

Key features of llm-d include:

  • vLLM-Optimized Inference Scheduler: llm-d builds on IGW's pattern for customizable “smart” load-balancing via the Endpoint Picker Protocol (EPP) to define vLLM-optimized scheduling. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced teams can implement their own scorers to further customize, while benefiting from other features in IGW, like flow control and latency-aware balancing. See our Northstar design

  • Disaggregated Serving with vLLM: llm-d leverages vLLM’s support for disaggregated serving to run prefill and decode on independent instances, using high-performance transport libraries like NIXL. In llm-d, we plan to support latency-optimized implementation using fast interconnects (IB, RDMA, ICI) and throughput optimized implementation using data-center networking. See our Northstar design

  • Disaggregated Prefix Caching with vLLM: llm-d uses vLLM's KVConnector to provide a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache. We plan to support two KV caching schemes. See our Northstar design

    • Independent (N/S) caching with offloading to local memory and disk, providing a zero operational cost mechanism for offloading.
    • Shared (E/W) caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
  • Variant Autoscaling over Hardware, Workload, and Traffic (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. See our Northstar design

For more see the project proposal.

llm-d can be installed as a full solution, customizing enabled features, or through its individual components for experimentation.

llm-d's deployer can be used to install all main components using a single Helm chart on Kubernetes.

Experimenting and developing with llm-d

llm-d is a metaproject composed of subcomponent repositories that can be cloned individually.

To clone all main components:

repos="llm-d llm-d-deployer llm-d-inference-scheduler llm-d-kv-cache-manager llm-d-routing-sidecar llm-d-model-service llm-d-benchmark llm-d-inference-sim"; for r in $repos; do git clone https://github.com/llm-d/$r.git; done

Tip

As a customization example, see this template for adding a custom scheduler filter.

Visit our GitHub Releases page and review the release notes to stay updated with the latest releases.

  • See our project overview for more details on our development process and governance.
  • We use Slack to discuss development across organizations. Please join: Slack
  • We host a weekly standup for contributors on Wednesdays at 12:30pm ET. Please join: Meeting Details
  • We use Google Groups to share architecture diagrams and other content. Please join: Google Group

This project is licensed under Apache License 2.0. See the LICENSE file for details.

联系我们 contact @ memedata.com