HipKittens: 快速且强大的AMD内核

HipKittens: 快速且强大的AMD内核
HipKittens: Fast and furious AMD kernels

原始链接: https://hazyresearch.stanford.edu/blog/2025-11-09-hk

## HipKittens：弥合AMD AI软件差距目前，AI开发受到硬件限制，扩大计算格局至关重要。虽然AMD GPU提供具有竞争力的性能，但其软件生态系统落后于NVIDIA，阻碍了其能力的充分利用。本文介绍**HipKittens**，这是一系列优化的AMD内核和编程原语，旨在简化和加速AMD GPU开发。研究表明，现有的AMD软件（AITER、PyTorch、Triton、Mojo和TileLang等编译器）通常无法达到峰值性能，需要对低级汇编进行大量的手动优化——这是可扩展性的重大障碍。HipKittens通过利用NVIDIA开发中成功使用的**基于切片的抽象**（ThunderKittens），并将其适应于AMD独特的架构来解决这个问题。主要发现表明，这些基于切片的原语在不同架构上具有良好的泛化性，能够实现可读、易于维护的代码，在注意力机制和GEMM操作中**优于AMD现有的基线**——包括手动优化的汇编内核。HipKittens旨在普及AMD GPU在AI中的使用，摆脱繁琐的汇编编程，为更加多样和开放的AI硬件生态系统铺平道路。代码和研究已公开发布。

## AMD 与 AI 进展：摘要最近 Hacker News 上的讨论强调了 AMD 在 AI 能力方面取得的显著进展，摆脱了过去软件方面一直存在的挑战。George Hotz (georgehotz) 透露与 AMD 签订了在 MI350X 硬件上训练 Llama 的合同，并指出 PyTorch/ROCm 支持有了实质性改进——曾经几乎无法使用，现在可以用于运行 nanochat 等任务。虽然仍落后于 NVIDIA，但 AMD 已经不再被认为毫无希望，软件投入是关键。对话的中心是优化 AMD GPU 以用于 LLM，HipKittens 等项目提供了关于需要改进领域的宝贵见解。Tinygrad 正在积极实施这些优化。用户报告在 AMD 硬件（甚至消费级硬件）上本地运行 LLM 的积极体验，尽管专用 GPU 仍然更快。一个关键的收获是 AMD 竞争力的日益增强，这得益于硬件改进和对软件的重新关注。虽然 NVIDIA 仍然保持着强大的地位，尤其是在数据中心，但 AMD（以及其他公司）迎头赶上的潜力正在增加，特别是随着 AI 领域的扩展，不再受 NVIDIA 当前主导地位的限制。人们仍然担心 AMD 的内部开发实践以及真正竞争所需的增加投资。

原文

Team: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Fu, Ryan Swann, Muhammad Osama, Christopher Ré, Simran Arora
Links: Arxiv | Code

AI is gated by hardware. We think that opening up AI’s compute landscape is one of the most important problems to be working on right now. Building towards this goal, we present HipKittens: SoTA AMD kernels and a collection of opinionated programming primitives to make AMD kernel dev easier!

Named after AMD's CUDA equivalent, called HIP.

Building towards multi-silicon AI systems

While AI has largely used a single hardware vendor to get to its current stage, AMD GPU hardware now offers state-of-the-art peak compute and memory bandwidth. However, this performance is locked away from AI workflows due to the lack of mature AMD software.

Spec	NVIDIA B200 SXM5	AMD MI355X OAM
BF16 matrix / tensor	2.2 PFLOPs	2.5 PFLOPs
MXFP8 matrix / tensor	4.5 PFLOPs	5.0 PFLOPs
MXFP6 matrix / tensor	4.5 PFLOPs	10.1 PFLOPs
MXFP4 matrix / tensor	9.0 PFLOPs	10.1 PFLOPs
Memory capacity	180 GB	288 GB
Memory bandwidth	8.0 TB/s	8.0 TB/s

Table 1: Hardware overview. Peak memory and compute speeds for the latest generation GPU platforms.

The AMD software ecosystem includes AITER, a high performance AI kernel library; PyTorch and a few compilers (Triton, Mojo, TileLang); and Composable Kernel (CK), AMD's C++ based programming model for writing kernels. However, despite gigawatt-scale AMD deployments, the software remains brittle.

The existing software offerings fail to consistently achieve peak performance. CK kernels frequently underperform (see our evaluations below). AITER and PyTorch are volatile; for instance, AITER and PyTorch SDPA Llama GQA backwards kernels achieve just 30% and 24% of SoTA performance respectively on AMD MI355X GPUs. And the compilers currently significantly sacrifice performance and have not yet demonstrated reusable programming primitives for AMD. Further, we find that some critical aspects of hardware functionality around bank conflict avoidance are undocumented in the CDNA ISA, discussed in our technical deep dive blog.

Details: expand to learn more about our current compiler observations

We expand on a few observations about current compilers:

Mojo's MHA forwards kernel suffers from bank conflicts and offers just 50% of peak performance on the MI355X (e.g., 430 TFLOPs at B=16, H=16, N=2048, D=128).
TileLang is currently limited to CDNA3 and the MHA kernel “is competitive with PyTorch”, which is amongst the slowest baselines (see results below). TileLang lacks many features that we believe are important for AMD kernels--multiple matrix core shapes like 32x32x16, buffer_load_dwordx4, XCD chiplet swizzling--and also has dependencies on Composable Kernel, adding complexity to the library.
On AMD, Triton struggles with register lifetime tracking and lowering memory accesses to the most performant intrinsics. For example, it may fail to reclaim registers or lower vectorized loads. In our evaluations, we find that Triton kernels from AMD’s ROCm/Triton tend to underperform (even on a vanilla BF16 GEMM).

While the Pythonic interface and portability of such compilers is helpful, our goal is to identify the principles that lead to peak performance as well.

As a result, AMD's most performant AI kernels need to be hand-optimized by experts in raw assembly. It is very difficult to scale to the breadth of AI workloads and as a result, most widely used AI workloads are unsupported/under-optimized on AMD (e.g., we see this on some attention problem shapes, non-causal gqa backwards pass, memory bound kernels).

Figure: what is raw assembly? can't understand it? that's the point!

With all of this, it remains up in the air what the best path forwards is for multi-silicon kernel development!

As a result, the AI community says that there’s a CUDA moat in AI software: tweet #1, tweet #2, tweet #3 and many more.

But that being said, developing performant NVIDIA kernels was also painstakingly tedious a few years ago. Using low level CUDA/CUTLASS, it took two years between the H100 GPU’s release and the release of peak performance open-source attention kernels. Compilers and LLMs-for-kernel-development on NVIDIA have so far sacrificed performance for simplicity and struggled to quickly support new hardware features.

Opinionated primitives are simplifying the process of writing performant NVIDIA kernels! Amazingly through the community’s effort on all these DSLs and AI assistants, NVIDIA kernel development is starting to get easier! Last year, we shared our opinionated take on kernel DSLs — keep the familiar PyTorch feel, but make the primitives C++ embedded to get peak performance, simplicity and extensibility to hardware platforms and AI workloads. In May 2024, we shared ThunderKittens (TK), and have been excited to see its ideas used in a wave of frameworks this year like CuTe DSL in Sept 2025, Tiny Grad “tinykittens”, TileLang in April 2025 and Gluon in June 2025. It’s been fun to see TK used in companies like Together AI, Jump Trading, and Cursor and in academic research.

So then we were curious whether entirely new programming primitives are needed to simplify AMD kernel development, or whether existing primitives suffice. It wasn't obvious to us where this exploration would end up; most modern kernels are designed around NVIDIA-specific hardware features. AMD hardware differs meaningfully (no wgmma/tcgen05 pipelined async matmuls, no tma, no mbarriers, no register reallocation, smaller shared memory, chiplet instead of monolithic, etc.) and we weren't sure where performance would end up nor how different the primitives might look compared to NVIDIA frameworks.

Our exploration resulted in HipKittens, a minimal, opinionated collection of C++ embedded programming primitives for fast AMD kernels. We find:

The tile abstraction generalizes across architectures. The core tile-based primitives we identified as effective on NVIDIA GPUs—including tile types, PyTorch-like bulk compute operators over tiles, and composable load/store interfaces—translate naturally to AMD.
Backend implementations are architecture-specific. The underlying memory access patterns (e.g., swizzling schemes, register scheduling) that realize the tile interface differ between AMD and NVIDIA due to hardware differences.
Scheduling strategies adapt to hardware constraints. The scheduling patterns both within a processor and across processors differ on AMD compared to NVIDIA, reflecting fundamental architectural distinctions. Wave specialization underperforms on CDNA3 and CDNA4. However, reasoning about schedules at the tile granularity—rather than at the level of individual registers or memory transactions—continues to simplify development, maintain code readability, and enable peak performance.

Ultimately, we see that tile-based abstractions remain general across architectures, providing evidence that a unified, performant programming model for AI accelerators is achievable. The key insight is separating the interface (tiles and operations on tiles) from the implementation (how tiles map to hardware), allowing the same high-level programming model to target diverse GPU architectures.

Climbing out of the CUDA moat: Introducing HipKittens

We first explored ThunderKittens for NVIDIA, then ThunderMittens on Apple Silicon and now we’re excited to share HipKittens (HK) for AMD!

Figure: The Kittens cinematic universe! Towards multi-silicon AI!

HK kernels are performant, while remaining easy to read and modify! We might not need raw assembly for peak performance AMD kernels any more! Life is good!

Figure: HipKittens riding the ~wave~ (not warp).

Let's go through the results:

Our attention forwards kernels are written in ~500 lines of code and outperform all of AMDs baselines on average, including the AITER kernels which are written in hand-optimized assembly! We show different head dimensions ( $D$

Our GEMM kernel features a hot loop that's $<100$

We also get speedy attention backwards pass, rotary, and fused dropout-residual-layernorm kernels compared to the strongest available baselines! These results use head dimension $128$

Multi-silicon AI is coming!

Realizing AI's full potential requires diverse, open hardware.^{Today, that means making AMD GPUs truly accessible.}

We want more AI in the world. AI has relied on and innovated on a single hardware provider, but we need to be able to use and experiment with all the kinds of compute we can. We need to be able to use the fastest hardware out there. We’re happy to help address these problems with HipKittens!

Checkout part two for a technical deep dive on HK.

Links: Arxiv | Code