量化值传递开销

量化值传递开销
Quantifying pass-by-value overhead

原始链接: https://owen.cafe/posts/struct-sizes/

受完美主义的驱使，作者深入研究了按值传递数据与按引用传递数据的性能基准测试。这项调查意外地导致了一个图形库的创建以及大量的汇编分析。核心发现：按引用传递结构体与按值传递指针大小的结构体具有相似的开销。对于较小的结构体，差异可以忽略不计。然而，在257字节左右会出现明显的性能下降，这与用于较大数据复制的`rep movs`指令有关。该指令似乎在AMD Zen* CPU上存在性能问题，表现为周期性减速——具体而言，避免了4046-4080和8161-8176字节的结构体大小。高达256字节的结构体可以使用SIMD寄存器高效复制。作者建议潜在的编译器技巧来缓解`rep movs`问题，并指出令人印象深刻的吞吐量——每秒可以按值传递7.3亿个16字节的结构体，而2048字节的结构体则为2600万个。

## Hacker News 讨论：值传递开销与 CPU 异常一篇量化值传递开销的文章引发了 Hacker News 的讨论，揭示了 AMD Ryzen 3900X CPU 上一种奇怪的性能异常。用户指出，以值传递特定大小的数据（约 4KB 和 8KB）时会出现显著的减速，这可能与 CPU 相关的 4K 对齐错误有关——从原始位置偏移 4KB 的内存位置加载数据会导致问题。讨论深入到缓存行大小（通常 64 字节，但 Apple 的 M2 为 128 字节）、页面大小，以及问题是源于硬件还是虚拟地址解析。有人建议更大的页面大小可能可以缓解这个问题。一些评论者质疑基准测试的实际应用性，认为复杂的程序交互会影响性能，而不仅仅是孤立的函数调用。另一些人争论“值传递”与“按值调用”的语义，以及编译器优化的作用。最终，讨论强调了硬件、编译器和调用约定之间复杂的关系，它们共同决定了性能。

原文

Should you pass data by value, or by reference?

When you’re a certain kind of perfectionist, the kind that prevents you from being productive, this is a question that might plague you every time you write a function.

I’ve heard a few opinions on this, floating around. I’ve heard “stack-to-stack copies are super cheap”, but.. how cheap are they?

Fine, I’ll write a benchmark. Can’t take too long.

3… 2… (several months pass) 1… 🪄

And that’s how I ended up writing a graphing library^{, and a benchmark.}

If you want to run these benchmarks on your own machine, or dump the assembly, you can check out the benchmark code.

This graph shows the overhead of passing structs of different sizes by value, on different machines.

In general, passing any struct by reference incurs the same overhead as passing a pointer-sized struct by value.

Okay, on a macro level, passing a function parameter by value takes time proportional to the size of the data.

Makes sense.

For smaller struct sizes, it doesn’t look like there’s a huge difference between passing 8 bytes, or passing 32.

In the tabs above, you can inspect the assembly for the benchmark loops of 1-byte and 32-byte structs.

The assembly says there’s a difference, but seemingly not one you’d notice. Technically the 32-byte value is being passed on the stack, using vector registers, whereas the 1-byte value is being passed in a register, but the speed difference is negligible.

graphasm - 256 bytesasm - 257 bytes

╭─➤ sub    rsp,0x100
│   movdqa xmm0,XMMWORD PTR [rsp+0x100]
│   movups XMMWORD PTR [rsp],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x110]
│   movups XMMWORD PTR [rsp+0x10],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x120]
│   movups XMMWORD PTR [rsp+0x20],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x130]
│   movups XMMWORD PTR [rsp+0x30],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x140]
│   movups XMMWORD PTR [rsp+0x40],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x150]
│   movups XMMWORD PTR [rsp+0x50],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x160]
│   movups XMMWORD PTR [rsp+0x60],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x170]
│   movups XMMWORD PTR [rsp+0x70],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x180]
│   movups XMMWORD PTR [rsp+0x80],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x190]
│   movups XMMWORD PTR [rsp+0x90],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x1a0]
│   movups XMMWORD PTR [rsp+0xa0],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x1b0]
│   movups XMMWORD PTR [rsp+0xb0],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x1c0]
│   movups XMMWORD PTR [rsp+0xc0],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x1d0]
│   movups XMMWORD PTR [rsp+0xd0],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x1e0]
│   movups XMMWORD PTR [rsp+0xe0],xmm0
│   movdqa xmm0,XMMWORD PTR [rsp+0x1f0]
│   movups XMMWORD PTR [rsp+0xf0],xmm0
│   call   fefe80 <pass_256_fields_by_value>
│   add    rsp,0x100
│   sub    ebx,0x1
╰── jne    4927c0 <bench_256+0x100>

╭─➤ sub    rsp,0x110
│   mov    ecx,0x20
│   mov    rsi,rbp
│   mov    rdi,rsp
│   rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│   movzx  eax,BYTE PTR [rsi]
│   mov    BYTE PTR [rdi],al
│   call   fefe90 <pass_257_fields_by_value>
│   add    rsp,0x110
│   sub    ebx,0x1
╰── jne    492a00 <bench_257+0x110>

There’s a beautiful clean cliff here at 257 bytes. This seems to represent the difference between an unrolled, vectorized memcpy routine, and one using rep movs (ie. a microcoded for loop).

In the tabs above, you can inspect the assembly for 256 and 257 bytes.

graphasm - 1682 bytesasm - 1696 bytes

╭─➤ sub    rsp,0x690
│   mov    ecx,0xd1
│   mov    rsi,rbp
│   mov    rdi,rsp
│   rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│   call   <pass_1672_fields_by_value>
│   add    rsp,0x690
│   sub    ebx,0x1
╰── jne    <bench_1672+0x120>

╭─➤ sub    rsp,0x6a0
│   mov    ecx,0xd4
│   mov    rsi,rbp
│   mov    rdi,rsp
│   rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│   call   <pass_1696_fields_by_value>
│   add    rsp,0x6a0
│   sub    ebx,0x1
╰── jne    <bench_1696+0x100>

Well this is interesting.

The period is 32, with 8 fast, and 24 slow struct widths per period.

Now, you may think that this would be something to do with having to copy other leftover memory after rep movs, but that doesn’t seem to be the case. I’ve dumped the assembly of two functions above, one from the valley, and one from the hill, which have near-identical assembly.

Instead, it looks like the performance of rep movs itself is periodic.

graphasm - 4064 bytesasm - 4072 bytes

╭─➤ sub    rsp,0xfe0
│   mov    ecx,0x1fc
│   mov    rsi,rbp
│   mov    rdi,rsp
│   rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│   call   <pass_4064_fields_by_value>
│   add    rsp,0xfe0
│   sub    ebx,0x1
╰── jne    <bench_4064+0x100>

╭─➤ sub    rsp,0xff0
│   mov    ecx,0x1fd
│   mov    rsi,rbp
│   mov    rdi,rsp
│   rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│   call   <pass_4072_fields_by_value>
│   add    rsp,0xff0
│   sub    ebx,0x1
╰── jne    <bench_4072+0x120>

Alright, what’s going on here?

Passing a struct of 4064 bytes takes 53ns, but passing one of 4065 bytes takes 222ns, in other words 4x longer.

And yes, these results are reproducible. I can run the benchmark as many times as I want, and always get the same spikes.

Again, I’ve dumped what is essentially matching assembly from the non-hill and from the hill, in the tabs above.

Since the spike doesn’t persist for greater struct widths, this probably isn’t a case of hitting a CPU cache limit.

It would appear that the rep movs instruction has a serious performance bug for these very specific ranges, as implemented in AMD Zen* CPU microcode. If any AMD engineers know what’s going on, please shoot me an email :)

If you work on GCC, Clang, or another compiler, and want to add an absolutely disgusting hack that removes 1-2 rep movs iterations, and adds a few extra manual movs, on my CPU (and probably other AMD CPUs), I’d also love to hear about it.

Passing structs up to size 256 is very cheap, and uses SIMD registers.
Passing structs larger than 256 uses rep movs.
Unrolled vectorized moves seem to be faster than rep movs.
The performance of rep movs is probably periodic.
You can pass 730 million structs of size 16 by value per second.
You can pass 26 million structs of size 2048 by value per second.
Don’t pass around data of size 4046-4080 bytes or 8161-8176 bytes, by value (at least not on an AMD Ryzen 3900X).

量化值传递开销 Quantifying pass-by-value overhead

量化值传递开销
Quantifying pass-by-value overhead