| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
原始链接: https://news.ycombinator.com/item?id=44040883
LLM-D是一个原生Kubernetes的分布式大语言模型推理系统,专注于高效的请求调度和模型服务。它采用三层架构:负载均衡、模型服务器副本和前缀缓存,以此区别于其他方案。 讨论强调了LLM-D相对于其他方法(如vLLM和Nvidia Dynamo)的优势。Dynamo提供了一个用于定义推理管线的SDK,而LLM-D则使用Kubernetes Inference Gateway扩展,提供Kubernetes原生的API来管理模型路由、优先级和流量控制,这更适合那些已经熟悉Kubernetes的大规模部署。它优先考虑根据不断变化的工作负载进行运行时动态效率调整,这与预先静态定义配置的方法不同。 LLM-D的目标是服务大型LLM部署(5台以上H100主机),并利用vLLM实现多主机支持。它是一个有针对性的解决方案,不像KServe那样是一个更广泛的平台。由于专注于服务LLM,因此可能不支持CLIP模型。该系统专门针对大规模服务大型语言模型的独特需求而设计。
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]() |
* The "stack-centric" approach such as vLLM production stack, AIBrix, etc. These set up an entire inference stack for you including KV cache, routing, etc.
* The "pipeline-centric" approach such as NVidia Dynamo, Ray, BentoML. These give you more of an SDK so you can define inference pipelines that you can then deploy on your specific hardware.
It seems like LLM-d is the former. Is that right? What prompted you to go down that direction, instead of the direction of Dynamo?
reply