高性能 EP 内核剖析

高性能 EP 内核剖析
Anatomy of a high-performance EP kernel

原始链接: https://fergusfinn.com/blog/anatomy-of-a-high-performance-ep-kernel/

大型语言模型（LLMs），特别是混合专家（MoE）模型，需要复杂的并行策略才能在多个 GPU 上高效运行。标准并行化采用固定的通信模式，而 MoE 则需要“专家并行”（Expert Parallelism, EP），其数据路由是动态的，且在运行时确定。专家并行的核心挑战在于将 Token 发送至指定的专家（“分发/Dispatch”），并将计算结果返回（“合并/Combine”）。其效率取决于服务阶段： * **高吞吐量（预填充/Prefill）：** 系统在移动数据前，会执行“协调步骤”以获取精确的 Token 计数。这使得系统能够分配紧凑且内存高效的缓冲区，从而将通信开销隐藏在计算密集型的矩阵乘法之后。 * **低延迟（解码/Decode）：** 为了避免网络往返带来的开销，该方法放弃了缓冲区紧凑性。它为每个源-专家对预留最坏情况下的内存缓冲区，从而实现 Token 的即时发送。这种方式通过消除协调延迟，以牺牲内存换取速度。最终，诸如 DeepEP 等现代实现以及不断发展的 UCCL 项目，通过优化内核来处理 MoE 的动态特性，确保 Token 能够高效地在网络架构中穿梭，以匹配相应的专家并返回结果，无论系统侧重于带宽还是延迟。

```Hacker News最新 | 过往 | 评论 | 提问 | 展示 | 招聘 | 提交登录高性能 EP 内核剖析 (fergusfinn.com)8 分，由 kkm 发布于 1 小时前 | 隐藏 | 过往 | 收藏 | 1 条评论帮助 mezark 18 分钟前 [–] 我喜欢这个博客回复指南 | 常见问题 | 列表 | API | 安全 | 法律 | 加入 YC | 联系搜索： ```

Large language models are large. Because they’re large, we need lots of GPUs to run them. It would be nice if LLM inference were ‘embarrassingly parallel’ and we could just always compute independent things on different GPUs. But alas, to use lots of GPUs on LLM inference, we need to get those GPUs talking to one another.

There are lots of different ways to get different GPUs working together: Tensor Parallelism, Pipeline Parallelism, Context Parallelism, Expert Parallelism, etc. All have their place. But for MoE models, in the MoE layers, when you want to serve at large scale, ‘wide Expert Parallelism’ (wideEP) is kingSee vLLM’s original DeepSeek large-scale serving post for a demonstration at production scale: DeepSeek at 2.2k tokens/s per GPU on an H200 cluster, served with wideEP and data parallel attention..

The other kinds of parallelism all require communication between GPUs, but their patterns are fixed by the architecture: who sends, who receives, and how much, are all known before the forward pass begins, and are the same on every step. The comms can run as standard collectives.

Expert parallelism is different. Which tokens need to reach which GPUs is decided by the router, from the data, at runtime, fresh in every MoE layer. And the tokens have somewhere to be reached from: we’ll assume the ‘data parallel attention’ arrangement DeepSeek serves with, where each token lives on exactly one rank (a rank being one GPU somewhere in our cluster). The experts are spread across those same ranks, so a token and the experts it’s routed to will generally not be in the same place. Here’s an example, with 8 GPUs split across 2 nodes, two experts per GPU, 1 token per rank, and 2 routed experts per token:

Hover a rank chip for its token’s round trip, or an expert for everything routed to it. Four of the sixteen experts drew no tokens at all this step: the routing is lumpy.

When it comes time to run our MoE layers, our tokens have to go and meet their experts, wherever they might be in the network fabric. It’s the job of the EP communication kernel to make that happen.

The modern shape of these kernels was set by DeepSeek’s DeepEP library. In this post we’ll build up the anatomy of a DeepEP-style dispatch and combine kernel: the high-throughput shape first, then the low-latency one.

The job we have to do§

Let’s make the setup concrete. We have 8 GPUs, split across 2 nodes, connected with RDMA, and each data parallel rank owns a single GPU. Attention runs on each GPU over a batch of $B_i$

At each rank $r_i$

So at the entrance to the EP layer each rank holds two things: the activation rows it produced, and, after the local routing pass, the top-2 expert assignment for each of those rows.

Not all of the experts live locally. Some live next door, on neighbouring NVLink peers, and some live far away, on nodes reachable only over RDMA. The goal of the expert parallelism kernels is to get the activations where they need to go, run the expert GEMMs when they get there, and then bring them back home.

We’re doing communications here, and with communications it’s handy to specialise on what we care about most: throughput, or latency. The split maps onto the two phases of inference: prefill brings big, compute-bound batches with plenty of other work to hide communication behind, while at decode there is hardly anything else to do, so the transfer itself is what we wait on. We’ll start with the throughput-optimised standard EP shape, then discuss the latency-optimised shape.

High throughput: ask, then send§

Dispatch§

The point of dispatch is to feed a grouped GEMMA grouped GEMM is one kernel launch running a separate matrix multiply per expert, each over a different number of rows. Routing produces exactly this raggedness: each expert’s group is however many tokens it drew.. After routing, every expert has to run a matrix multiply over exactly the tokens assigned to it. Before routing, those tokens are scattered across every rank in the cluster.

So on each rank, dispatch has to gather the tokens bound for the local experts into a single dense buffer that a grouped GEMM can consume in one go.

The difficulty is that we don’t know the shape of that buffer ahead of time. Which token goes to which expert is decided by the router at runtime, and the distribution is lumpy and changes every step. An expert might draw a thousand tokens now and none next time. We can’t know how many our local experts will receive until every other rank has run its router. So neither the size of the local activation buffer nor the slot in that buffer into which any particular token lands is known in advance.

There are two ways to cope with not knowing the size.

We can reserve enough room for the worst case, by allocating a fixed rectangle with a padded slot per expert.
We can find out the real counts first and then allocate exactly what we need.

The fixed rectangle is simpler, but it has to be sized for the worst case. The worst case doesn’t scale with the tokens you actually receive: all the peers might route their whole batches at the same expert, so every padded slot has to be big enough for everyone at once. At prefill batch sizes that means far more HBM holding emptiness than data, and spare HBM is exactly the resource we want back, because it becomes KV cache, which is what keeps sequences in flight. The price of exactness, meanwhile, is affordable here: when batches are large and the GEMMs are compute bound, the extra communication it takes to learn the real counts can hide behind computeServing stacks manufacture room to hide it in: with two-batch overlap, the step is split into two microbatches, so that while one microbatch’s tokens are on the wire, the other’s GEMMs keep the SMs busy. See SGLang’s large-scale EP writeup.. So the throughput path allocates a ragged buffer, $(N_\mathrm{local}, H)$

If we want to allocate only what we need, we have to learn the counts before any activations move. We can do so by running a coordination pass. Every rank already knows from its own routing how many tokens it’s sending to each peer. If everyone trades those numbers, each rank can add up how much it’s about to receiveThe exchange mirrors the fabric: counts cross between nodes over RDMA, then between GPUs within a node over NVLink, gathering as they go, so the coordination costs the same two hops the real dispatch will..

The coordination pass is cheap in bytes, only a handful of integers per peer rather than megabytes of hidden state. Once a rank knows how many tokens are coming from each source, a write-safe layout of the buffer comes naturally as a prefix sum: the first source’s tokens start at zero, the next source’s start where those end, and so on. The counts hand us both of the things we were missing: how big to make the buffer, and where every block sits inside it.

With the layout fixed, we can actually send the activations. The sender never writes to the final buffer directly. It couldn’t if it wanted to: RDMA writes can only land in memory that was registered with the NIC ahead of time, and the compact buffer is allocated fresh each step, at a size we only just learned. So the sender streams its tokens into a small fixed-size queue on the destination, carved out of pre-registered memory, and the receiver, which owns the compact buffer, drains that queue and copies each token into the slot the prefix sum assigned it. The queue also lets the two sides run at their own pace, with its depth bounding how far ahead the sender can get before it has to wait for the receiver to catch up. The queue is fixed-size too, but fixed at a constant.

For the queues that cross nodes there’s one more hop hiding in the picture. A token never travels point-to-point to an arbitrary remote GPU: it goes over RDMA to the GPU with the same index on the destination node, and that GPU forwards it over NVLink to its final host. Each GPU then only ever talks to its own counterparts across nodes, which keeps every RDMA flow on its own rail of the fabric and caps the number of connections each NIC has to keep fed.

What lands in the compact buffer is grouped by where it came from, not by which expert it’s for: the transfer is coarser than the routingA token is sent to a peer once if any of that peer’s experts want it, even if two do, with the per-token expert assignment carried alongside. That is what makes the transfer coarser than the routing.. The grouped GEMM wants contiguous per-expert blocks, so the last step of dispatch is a local permute, from by-source order into by-expert order. In DeepEP this last step is the caller’s: dispatch hands back the by-source buffer along with per-expert counts, and the serving framework does the reordering, or feeds the indices straight to a GEMM that can consume them.

Combine§

The point of combine is to un-run the dispatch kernel, and add up the contributions for each token.

The GEMM left its outputs grouped by expert, so the first thing we have to do is to undo the permutation we did on the way in. The inverse permutation puts the outputs back into the by-source-rank layout that dispatch delivered tokens in.

From there the transport runs in reverse. The rank that hosted the expert is now the sender, streaming its outputs back through the home rank’s per-peer queues into the positions the tokens came from.

We don’t need to do the coordination pass, since combine is handed the same routing information dispatch produced, so it already knows where everything needs to return to.

Each token was routed to $K$

So combine mirrors dispatch at every step. The permute becomes an unpermute, the coordination pass is replaced by reusing dispatch’s routing, and where dispatch compacted arriving tokens into contiguous positions, combine sums them into slots.

Low latency: send without asking§

The reason to optimise the kernel for latency is the decode regime. Each rank holds only a handful of tokens, often one per sequence in the batch. The coordination pass was cheap in bytes, but it’s a full network round trip with barriers, and it has to finish before any activations move. At decode there is little to overlap it with, and it becomes a large fraction of the layerA second important penalty: dynamic shapes like the ones in the high-throughput kernels are tough to push into CUDA graphs, which are more important during decode..

So we want to figure out how to skip it. The coordination pass only ever existed to turn counts into write offsets, and that was only needed because the compact buffer made each sender’s offset depend on what every other sender did. If we’re willing to give up compactness, we can prearrange space for each peer rank to write into. Instead of one packed buffer we pre-reserve a fixed, private region for every (source rank, expert) pair. This is the fixed rectangle from the dispatch fork, with one refinement: the padding is per (source rank, expert) rather than per expert, so no two senders ever write into the same region.

The address a sender writes to is now a formula, the region for its (source, expert) plus a local slot, and every rank can compute it alone:

\mathrm{addr} = \mathrm{base} + (e \cdot R + r) \cdot \mathrm{chunk} + \mathrm{slot}