CODA：将 Transformer 模块重写为 GEMM-Epilogue 程序

CODA：将 Transformer 模块重写为 GEMM-Epilogue 程序
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

原始链接: https://arxiv.org/abs/2605.19269

本文介绍了 **CODA**，一种旨在通过解决内存受限操作瓶颈来优化 Transformer 训练的新型 GPU 内核抽象。虽然 Transformer 训练依赖于高性能的密集线性代数运算（GEMM），但大量的执行时间被浪费在为归一化、激活函数和残差更新等内存密集型任务移动中间张量上。CODA 通过将这些算子重新参数化为“GEMM 加尾声（GEMM-plus-epilogue）”程序来解决这一问题。通过在 GEMM 输出分块（tile）仍保留在芯片内时执行这些任务，而非反复读写全局内存，CODA 最大限度地减少了数据移动。该框架提供了一套用于常见操作的可组合原语，使开发者能够编写出既具备专家级 GEMM 高效性能，又不失 Transformer 模块所需灵活性的内核。评估表明，无论是人工编写还是由大模型生成的 CODA 内核，在标准工作负载下均能实现高性能。最终，CODA 提供了一种实用的方法，弥合了框架级编程生产力与硬件级执行效率之间的差距。

Sorry.

[Submitted on 19 May 2026 (v1), last revised 20 May 2026 (this version, v2)]

View a PDF of the paper titled CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs, by Han Guo and 6 other authors

View PDF HTML (experimental)

Abstract:Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through global memory while performing little arithmetic, making data movement an increasingly important bottleneck in otherwise highly optimized training stacks. We introduce CODA, a GPU kernel abstraction that expresses these computations as GEMM-plus-epilogue programs. CODA is based on the observation that many Transformer operators exposed as separate framework kernels can be algebraically reparameterized to execute while a GEMM output tile remains on chip, before it is written to memory. The abstraction fixes the GEMM mainloop and exposes a small set of composable epilogue primitives for scaling, reductions, pairwise transformations, and accumulation. This constrained interface preserves the performance structure of expert-written GEMMs while remaining expressive enough to cover nearly all non-attention computation in the forward and backward pass of a standard Transformer block. Across representative Transformer workloads, both human- and LLM-authored CODA kernels achieve high performance, suggesting that GEMM-plus-epilogue programming offers a practical path toward combining framework-level productivity with hardware-level efficiency.

From: Han Guo [view email]
[v1] Tue, 19 May 2026 02:30:43 UTC (1,121 KB)
[v2] Wed, 20 May 2026 17:38:24 UTC (493 KB)

CODA：将 Transformer 模块重写为 GEMM-Epilogue 程序 CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs

CODA：将 Transformer 模块重写为 GEMM-Epilogue 程序
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs