Latest News 🔥
- [2025-05] CoreWeave, Google, IBM Research, NVIDIA, and Red Hat launched the llm-d community. Check out our blog post and press release.
llm-d is a Kubernetes-native distributed inference serving stack - a well-lit path for anyone to serve large language models at scale, with the fastest time-to-value and competitive performance per dollar for most models across most hardware accelerators.
With llm-d, users can operationalize GenAI deployments with a modular solution that leverages the latest distributed inference optimizations like KV-cache aware routing and disaggregated serving, co-designed and integrated with the Kubernetes operational tooling in Inference Gateway (IGW).
Built by leaders in the Kubernetes and vLLM projects, llm-d is a community-driven, Apache-2 licensed project with an open development model.
llm-d adopts a layered architecture on top of industry-standard open technologies: vLLM, Kubernetes, and Inference Gateway.
Key features of llm-d include:
-
vLLM-Optimized Inference Scheduler: llm-d builds on IGW's pattern for customizable “smart” load-balancing via the Endpoint Picker Protocol (EPP) to define vLLM-optimized scheduling. Leveraging operational telemetry, the Inference Scheduler implements the filtering and scoring algorithms to make decisions with P/D-, KV-cache-, SLA-, and load-awareness. Advanced teams can implement their own scorers to further customize, while benefiting from other features in IGW, like flow control and latency-aware balancing. See our Northstar design
-
Disaggregated Serving with vLLM: llm-d leverages vLLM’s support for disaggregated serving to run prefill and decode on independent instances, using high-performance transport libraries like NIXL. In llm-d, we plan to support latency-optimized implementation using fast interconnects (IB, RDMA, ICI) and throughput optimized implementation using data-center networking. See our Northstar design
-
Disaggregated Prefix Caching with vLLM: llm-d uses vLLM's KVConnector to provide a pluggable KV cache hierarchy, including offloading KVs to host, remote storage, and systems like LMCache. We plan to support two KV caching schemes. See our Northstar design
- Independent (N/S) caching with offloading to local memory and disk, providing a zero operational cost mechanism for offloading.
- Shared (E/W) caching with KV transfer between instances and shared storage with global indexing, providing potential for higher performance at the cost of a more operationally complex system.
-
Variant Autoscaling over Hardware, Workload, and Traffic (🚧): We plan to implement a traffic- and hardware-aware autoscaler that (a) measures the capacity of each model server instance, (b) derive a load function that takes into account different request shapes and QoS, and (c) assesses recent traffic mix (QPS, QoS, and shapes) to calculate the optimal mix of instances to handle prefill, decode, and latency-tolerant requests, enabling use of HPA for SLO-level efficiency. See our Northstar design
For more see the project proposal.
llm-d can be installed as a full solution, customizing enabled features, or through its individual components for experimentation.
llm-d
's deployer can be used to install all main components using a single Helm chart on Kubernetes.
llm-d
is a metaproject composed of subcomponent repositories that can be cloned individually.
To clone all main components:
repos="llm-d llm-d-deployer llm-d-inference-scheduler llm-d-kv-cache-manager llm-d-routing-sidecar llm-d-model-service llm-d-benchmark llm-d-inference-sim"; for r in $repos; do git clone https://github.com/llm-d/$r.git; done
Tip
As a customization example, see this template for adding a custom scheduler filter.
Visit our GitHub Releases page and review the release notes to stay updated with the latest releases.
- See our project overview for more details on our development process and governance.
- We use Slack to discuss development across organizations. Please join: Slack
- We host a weekly standup for contributors on Wednesdays at 12:30pm ET. Please join: Meeting Details
- We use Google Groups to share architecture diagrams and other content. Please join: Google Group
This project is licensed under Apache License 2.0. See the LICENSE file for details.