(comments)
原始链接: https://news.ycombinator.com/item?id=44040883
LLM-d is a Kubernetes-native system designed for distributed inference of Large Language Models (LLMs) in large-scale production environments (5+ H100 hosts). It aims to address the unique serving challenges posed by LLMs.
The system operates as a three-tier architecture: balancing/scheduling incoming requests, managing model server replicas, and implementing a prefix caching hierarchy. It leverages the Kubernetes Inference Gateway extension for model routing, request prioritization, LoRA support, and flow control.
The project differs from approaches like Nvidia Dynamo SDK by focusing on dynamic runtime efficiency based on changing traffic on Kubernetes. While Dynamo SDK aims to simplify Dynamo adoption on Kubernetes, LLM-d targets users with mature Kubernetes deployments managing both inference and training workloads. LLM-d leverages vLLM for multi-host support and inherits its model compatibility, concentrating primarily on large generative models. The system focuses on optimizing resource utilization across diverse workloads within Kubernetes.
* The "stack-centric" approach such as vLLM production stack, AIBrix, etc. These set up an entire inference stack for you including KV cache, routing, etc.
* The "pipeline-centric" approach such as NVidia Dynamo, Ray, BentoML. These give you more of an SDK so you can define inference pipelines that you can then deploy on your specific hardware.
It seems like LLM-d is the former. Is that right? What prompted you to go down that direction, instead of the direction of Dynamo?
reply