Should you pass data by value, or by reference?
When you’re a certain kind of perfectionist, the kind that prevents you from being productive, this is a question that might plague you every time you write a function.
I’ve heard a few opinions on this, floating around. I’ve heard “stack-to-stack copies are super cheap”, but.. how cheap are they?
Fine, I’ll write a benchmark. Can’t take too long.
3… 2… (several months pass) 1… 🪄
And that’s how I ended up writing a graphing library, and a benchmark.
If you want to run these benchmarks on your own machine, or dump the assembly, you can check out the benchmark code.
This graph shows the overhead of passing structs of different sizes by value, on different machines.
In general, passing any struct by reference incurs the same overhead as passing a pointer-sized struct by value.
Okay, on a macro level, passing a function parameter by value takes time proportional to the size of the data.
Makes sense.
For smaller struct sizes, it doesn’t look like there’s a huge difference between passing 8 bytes, or passing 32.
In the tabs above, you can inspect the assembly for the benchmark loops of 1-byte and 32-byte structs.
The assembly says there’s a difference, but seemingly not one you’d notice. Technically the 32-byte value is being passed on the stack, using vector registers, whereas the 1-byte value is being passed in a register, but the speed difference is negligible.
╭─➤ sub rsp,0x100
│ movdqa xmm0,XMMWORD PTR [rsp+0x100]
│ movups XMMWORD PTR [rsp],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x110]
│ movups XMMWORD PTR [rsp+0x10],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x120]
│ movups XMMWORD PTR [rsp+0x20],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x130]
│ movups XMMWORD PTR [rsp+0x30],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x140]
│ movups XMMWORD PTR [rsp+0x40],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x150]
│ movups XMMWORD PTR [rsp+0x50],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x160]
│ movups XMMWORD PTR [rsp+0x60],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x170]
│ movups XMMWORD PTR [rsp+0x70],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x180]
│ movups XMMWORD PTR [rsp+0x80],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x190]
│ movups XMMWORD PTR [rsp+0x90],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x1a0]
│ movups XMMWORD PTR [rsp+0xa0],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x1b0]
│ movups XMMWORD PTR [rsp+0xb0],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x1c0]
│ movups XMMWORD PTR [rsp+0xc0],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x1d0]
│ movups XMMWORD PTR [rsp+0xd0],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x1e0]
│ movups XMMWORD PTR [rsp+0xe0],xmm0
│ movdqa xmm0,XMMWORD PTR [rsp+0x1f0]
│ movups XMMWORD PTR [rsp+0xf0],xmm0
│ call fefe80 <pass_256_fields_by_value>
│ add rsp,0x100
│ sub ebx,0x1
╰── jne 4927c0 <bench_256+0x100>
╭─➤ sub rsp,0x110
│ mov ecx,0x20
│ mov rsi,rbp
│ mov rdi,rsp
│ rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│ movzx eax,BYTE PTR [rsi]
│ mov BYTE PTR [rdi],al
│ call fefe90 <pass_257_fields_by_value>
│ add rsp,0x110
│ sub ebx,0x1
╰── jne 492a00 <bench_257+0x110>
There’s a beautiful clean cliff here at 257 bytes. This seems to represent the
difference between an unrolled, vectorized memcpy routine, and one using
rep movs (ie. a microcoded for loop).
In the tabs above, you can inspect the assembly for 256 and 257 bytes.
╭─➤ sub rsp,0x690
│ mov ecx,0xd1
│ mov rsi,rbp
│ mov rdi,rsp
│ rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│ call <pass_1672_fields_by_value>
│ add rsp,0x690
│ sub ebx,0x1
╰── jne <bench_1672+0x120>
╭─➤ sub rsp,0x6a0
│ mov ecx,0xd4
│ mov rsi,rbp
│ mov rdi,rsp
│ rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│ call <pass_1696_fields_by_value>
│ add rsp,0x6a0
│ sub ebx,0x1
╰── jne <bench_1696+0x100>
Well this is interesting.
The period is 32, with 8 fast, and 24 slow struct widths per period.
Now, you may think that this would be something to do with having to copy
other leftover memory after rep movs, but that doesn’t seem to be the
case. I’ve dumped the assembly of two functions above, one from the valley,
and one from the hill, which have near-identical assembly.
Instead, it looks like the performance of rep movs itself is periodic.
╭─➤ sub rsp,0xfe0
│ mov ecx,0x1fc
│ mov rsi,rbp
│ mov rdi,rsp
│ rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│ call <pass_4064_fields_by_value>
│ add rsp,0xfe0
│ sub ebx,0x1
╰── jne <bench_4064+0x100>
╭─➤ sub rsp,0xff0
│ mov ecx,0x1fd
│ mov rsi,rbp
│ mov rdi,rsp
│ rep movs QWORD PTR es:[rdi],QWORD PTR ds:[rsi]
│ call <pass_4072_fields_by_value>
│ add rsp,0xff0
│ sub ebx,0x1
╰── jne <bench_4072+0x120>
Alright, what’s going on here?
Passing a struct of 4064 bytes takes 53ns, but passing one of 4065 bytes takes 222ns, in other words 4x longer.
And yes, these results are reproducible. I can run the benchmark as many times as I want, and always get the same spikes.
Again, I’ve dumped what is essentially matching assembly from the non-hill and from the hill, in the tabs above.
Since the spike doesn’t persist for greater struct widths, this probably isn’t a case of hitting a CPU cache limit.
It would appear that the rep movs instruction has a serious performance bug
for these very specific ranges, as implemented in AMD Zen* CPU microcode.
If any AMD engineers know what’s going on, please shoot me an email :)
If you work on GCC, Clang, or another compiler, and want to add an
absolutely disgusting hack that removes 1-2 rep movs iterations, and
adds a few extra manual movs, on my CPU (and probably other AMD CPUs),
I’d also love to hear about it.
- Passing structs up to size 256 is very cheap, and uses SIMD registers.
- Passing structs larger than 256 uses
rep movs. - Unrolled vectorized moves seem to be faster than
rep movs. - The performance of
rep movsis probably periodic. - You can pass 730 million structs of size 16 by value per second.
- You can pass 26 million structs of size 2048 by value per second.
- Don’t pass around data of size 4046-4080 bytes or 8161-8176 bytes, by value (at least not on an AMD Ryzen 3900X).