不要在 AMD64 上传递大于 16 字节的结构

不要在 AMD64 上传递大于 16 字节的结构
Don't pass structs bigger than 16 bytes on AMD64

原始链接: https://gist.github.com/FeepingCreature/5dff669aad380a123b15659e195fb96c

标题：传递速度超过 16 字节的结构背景：编程语言设计与实现问题陈述：当传递超过 16 个字节的结构（例如元组、记录或其他类似的结构化数据）时，由于它们通过通用寄存器传递到堆栈帧，因此性能会显着降低。流行编程语言提供的现有解决方案涉及传递较小的组件，例如字段，但是，该解决方案在处理复杂的记录或元组结构时会产生很高的开销，并且在递归调用时会导致堆栈抖动问题。相比之下，编译器通常使用调用约定，通过为每个函数分配更多资源来为结构化数据提供更好的性能，但可能会降低系统资源利用率。因此，需要提高传递结构化数据的性能，同时保持与现有调用约定的兼容性。可能的解决方案：一种可能的解决方案是研究硬件功能，例如 SIMD 或 AVX 指令集扩展或 Intel 的 XMM 状态，并设计专门针对结构化数据（特别是那些大小大大超过 16 字节的数据）的定制调用约定，从而利用硬件功能最佳结果。另一种选择可能涉及设计专门用于结构化数据传递的专用函数接口，以便最大限度地减少调用期间的性能下降。例如，开发人员可以考虑实现定制的调用约定，以启用专门的堆栈管理技术，该技术旨在处理结构化数据，而不会牺牲系统范围的资源分配效率。最后，探索内联传递结构化数据的技术（也许通过宏扩展或展开循环语句）可能会为减少结构化数据操作操作期间的性能瓶颈提供机会。无论实施策略如何，关键目标仍然是最大限度地减少结构化数据传递引起的性能下降，同时实现最大的软件兼容性和效率。

进一步详细说明数组类型的边界检查如何工作：对于 C 程序，数组变量有一个“len”成员来表示数组中的元素数量，它可以通过将“index”与“arr->len”进行比较来执行基本的边界检查，类似于 Rust 所做的。然而，执行此检查需要在循环的每次迭代中执行额外的指令，这可能会给总运行时间增加数千万个周期。此外，就性能影响而言，Rust 的所有权系统由于其借用检查机制而涉及更复杂的内存布局。例如，Rust 盒允许高效借用，而不涉及所有权转移或所有权别名，这允许对拥有的资源进行更灵活的控制流，同时最大限度地减少跟踪所有权和管理跨操作联合分配资源的多个任务的并发资源共享的成本。 Rust 的借用系统使其能够分别为源自原始所有者或分配者范围之外的借用提供零成本借用或廉价的一次性获取成本，这在管理竞争任务之间的资源共享结构时提供了更大的灵活性和效率。另一方面，使用所有权转移机制管理资源共享需要仔细的同步和协调工作，特别是在涉及并发和高水平并行性的设置中。相反，使用所有权和终身借用原则的组合来管理资源共享模式有助于更简单的实现设计方法，并在选择合适的设计替代方案和权衡时提供更广泛的可能性。总体而言，与流行的命令式编程范式中常用的传统共享资源管理范式相比，Rust 的所有权和借用模型提供了相当大的好处。关于数组库实现，Rust 存在几个流行的数组库，包括 ndarray 和 rayon。这些库使用户能够根据特定标准（例如内存占用考虑因素和访问模式要求）选择最佳阵列结构，同时根据需要支持对原始二进制数据或数值计算的高效和通用操作。此外，Rust 数组库通常具有先进的算法，例如基于 BLAS 的线性代数例程或高度专业化的信号处理管道框架，为用户提供针对高级计算领域的强大而复杂的功能。类似地，C 数组库实现也表现出不同程度的复杂性，从实现通用容器功能的相对简单的库到针对特定应用程序或领域（例如科学计算和模拟平台、金融）的全面且深度专业化的库。

原文

Or "How I sped up Neat by a factor of 2x".

If you check the related_post_gen benchmark, you will find that Neat, my language, has moved up a few spots. How did I achieve this? Did I implement new high-level optimizer passes that use language details to expose hidden optimization potential?

I changed arrays to be passed as three pointer parameters instead of one parameter consisting of a three-pointer struct. That's it.

This problem has been vexing me for a long time. Neat seemed weirdly slower than it should have been, particularly compared to D, and if I looked at a profiler I would be seeing a lot of weird stack moves. The compiler seemed to be spending most of its time rearranging large amounts of the stack for function calls.

Why are Neat arrays three pointers? As opposed to D, Neat employs a refcounter. That means that arrays, in addition to start and end, also need a pointer to the base of the array object, where the reference count is stored. It turns out the reason that D arrays are fast and Neat arrays are so slow is just because having 24 bytes instead of 16 puts them into a different regime of parameter passing.

If we check the SystemV AMD64 ABI specification (PDF), it tells us that any struct greater than 16 bytes is passed by pointer. ("If the size of the aggregate exceeds two eightbytes and the first eightbyte isn’t SSE or any other eightbyte isn’t SSEUP, the whole argument is passed in memory.") To pass a struct by memory, we allocate a struct-sized spot on the stack, fill it with the values we pass, then pass the pointer to the function.

Now, LLVM is a very good optimizer, but this does not leave it much room. The value has to go on the stack, which means there must be space for it there, it must be copied out of the register it is probably living in, and it has to remember which parts of the stack are in use and which ones can be reused by another call, which it turns out to be pretty poor at.

We can demonstrate the issue with this benchmark:

==========
harness.h:
==========

#define TYPE double

struct Vector { TYPE x, y, z; };

struct Vector vector_add_struct(struct Vector left, struct Vector right);

struct Vector vector_add_fields(
    TYPE left_x, TYPE left_y, TYPE left_z,
    TYPE right_x, TYPE right_y, TYPE right_z);

==========
harness.c:
==========

#include 
#include 
#include "harness.h"

int main(int argc, const char *argv[])
{
    int mode = atoi(argv[1]);
    int length = atoi(argv[2]);
    struct Vector result = {0};
    if (mode == 0)
    {
        for (int i = 0; i \n", result.x, result.y, result.z);
}

=======
impl.c:
=======

#include "harness.h"

struct Vector vector_add_struct(struct Vector left, struct Vector right)
{
    return (struct Vector) {
        left.x + right.x,
        left.y + right.y,
        left.z + right.z,
    };
}

struct Vector vector_add_fields(
    TYPE left_x, TYPE left_y, TYPE left_z,
    TYPE right_x, TYPE right_y, TYPE right_z)
{
    return (struct Vector) {
        left_x + right_x,
        left_y + right_y,
        left_z + right_z,
    };
}

As you can see, depending on parameters, this either passes some values as separate parameters or a single large struct. The mode and run length are passed on the commandline to prevent the optimizer from constant folding everything.

We must compile impl.c separately to avoid inlining:

clang -O3 impl.c -c -o impl.o
clang -O3 harness.c impl.o -o benchmark
time ./benchmark 0 1000000000
time ./benchmark 1 1000000000

This is hardly subtle: with just the change of passing three separate fields instead of a vector struct, I go from 12.3 seconds to 5.3 seconds!

If we check the assembly, we can indeed see that a lot of instructions are occupied with stack shuffles. In fact, a major benefit of the field version is that the parameters already enter the function in SSE registers, rather than having to be loaded from the stack every time. This was the whole point of the SystemV ABI and its focus on passing values in registers as much as possible, so it's kind of sad to see it fail here. I believe with the number of registers available on AMD64, a benchmark would have shown by-value passing to be valuable even for types above 16 bytes.

In fact, if you think about what the code does, by writing the fields on the stack and then (explicitly rather than implicitly) passing a pointer, the (new, performant) AMD64 System V ABI has effectively regressed to the old x86 cdecl ABI, where everything was passed on the stack! Cdecl, famously, was so known for its slowness that it spawned multiple calling conventions aimed just at making it fast.

Of course, in any real situation this code would be all inlined. In fact, turning on LTO with gcc (though interestingly not clang!) erases any performance difference between the two versions. But still, not every function can or should be inlined.

Now, if you are calling a C API, you have to use the C ABI. But lots of high-level types internal to non-C languages, though they may be presented to the compiler's backend as a struct (such as Neat's three-pointer arrays), don't strictly speaking need to be expressed as one. You are the language writer, and it's up to you to decide how arrays, tuples, sumtypes etc. are passed. That's why I chose to pass all of those (if above 16 bytes) as individual fields instead, and the benchmark shows the benefit.

So if you are on AMD64, and you're either working on languages or microoptimizing an API, I advise you to take the free speedup. You should at least benchmark to see if you can benefit from splitting structs above 16 bytes manually. Especially in inner loops, the benefit can be surprisingly large.

Addendum:

Q: Sure, maybe this is true for structs of pointers. But double is in class SSE, according to the spec. Why isn't the struct passed in SSE registers anyways?
A: Man I don't know. All I can tell you is that it doesn't.

不要在 AMD64 上传递大于 16 字节的结构 Don't pass structs bigger than 16 bytes on AMD64

不要在 AMD64 上传递大于 16 字节的结构
Don't pass structs bigger than 16 bytes on AMD64