使用SIMD CUDA指令进行更快的排序 (2024)

使用SIMD CUDA指令进行更快的排序 (2024)
Faster sorting with SIMD CUDA intrinsics (2024)

原始链接: https://winwang.blog/posts/bitonic-sort/

这篇博文探讨了位排序算法，这是一种适合GPU加速的并行排序算法，尤其适用于CUDA。位排序是一种排序网络，具有O(log²(n))的并行时间复杂度，在并行环境下优于顺序排序算法。作者强调了位排序算法在改进大型序列的基本情况排序方面的潜力，尤其是在使用SIMD（单指令多数据）原理进行硬件加速时。 CUDA的warp内联函数，特别是`__shfl_sync`，被用来在一个warp（32个线程的一组）内执行数据混洗，从而实现高效的warp内数据交换。使用`__shfl_sync`对位排序的基本实现进行基准测试表明，与使用共享内存进行线程间通信相比，性能提高了30%。作者建议利用这种优化的32路排序来加速归并排序算法中的成对合并过程，暗示未来将探索32路合并技术。文章还参考了之前关于GPU哈希图的工作，该工作实现了比Rust实现显著的性能提升。

Hacker News 的一个帖子讨论了一篇关于使用 SIMD CUDA 内联函数进行更快排序的博客文章。一位评论者 ashvardanian 澄清说，这篇文章关注的是 warp 级别的同步/交换原语，而不是更典型的“CUDA SIMD”，后者处理打包的 16 位或 8 位数据。他还提到了他使用 DPX 指令进行字符串处理的实验，但发现收益有限。作者 winwang 承认了 ashvardanian 的观点，并澄清说“SIMD”是宽泛地用来描述 warp 范围内的函数。他还称赞了 ashvardanian 的项目 StringZilla。讨论涉及到在 GPU 上使用 SWAR（寄存器内 SIMD）的潜力，ashvardanian 正在寻找更好的用例。其他评论者讨论了 GPU 排序的应用。DennisL123 询问了 GPU 对 16/32 位整数排序的性能与 CPU 基数排序相比如何。Winwang 建议说，对于 CPU 上的小型数据集，GPU 排序竞争力较弱，但对于大型数据集则更有效。Maeln 指出，当数据已驻留在 GPU 内存中时，例如在 3D 渲染器中对粒子进行 z 排序，它非常有用。ExDM69 设想使用 32 路 GPU 排序对可见性缓冲区渲染器中的三角形 ID 进行排序。

Full code on Github: https://github.com/wiwa/blog-code/

Hi

// load data into register
u32 datum = ...;
// Loop:
// calculate the lane to compare against
u32 other_lane = ...;
// shuffle!
u32 other_datum = __shfl_sync(gpu::all_lanes, datum, other_lane);
// swap if necessary between this lane and other_lane
// such that the smaller datum is in the lower lane index
if (lane < other_lane) {
    datum = min(datum, other_datum);
} else {
    datum = max(datum, other_datum);
}

u32 other_datum = shared_data[other_lane];
if (lane < other_lane) {
    datum = min(datum, other_datum);
} else {
    datum = max(datum, other_datum);
}
shared_data[lane] = datum;
// Big Green says implicit warp synchronization is an antipattern
__syncwarp();

使用SIMD CUDA指令进行更快的排序 (2024) Faster sorting with SIMD CUDA intrinsics (2024)

Hi Link to heading

What is a Bitonic Sort? Link to heading

Sorting and SIMD Link to heading

CUDA and Code Link to heading

What now? Link to heading

System Details Link to heading

使用SIMD CUDA指令进行更快的排序 (2024)
Faster sorting with SIMD CUDA intrinsics (2024)

Hi