当O3比O2慢两倍时

原文

While trying to optimize a custom bounded priority queue, I ran into a pathological case and I digged deeper and deeper to try to understand the issue. At one point I decided to collect the data and write this article to share my journey. You can find the repo here.

Here are the results of two benchmarks using criterion to justify the click-bait-y title. Running the benchmark using opt-level=3 introduces a performance penalty of +123% vs opt-level=2. Though you should probably not use opt-level like this with criterion, it is an easy way to make the point.

RUSTFLAGS="-C target-cpu=haswell -C opt-level=2" cargo bench
[...]
pqueue-insert/Priority Queue Insert
  time:   [963.15 ns 963.36 ns 963.56 ns]
Found 253 outliers among 10000 measurements (2.53%)
  7 (0.07%) low mild
  229 (2.29%) high mild
  17 (0.17%) high severe

RUSTFLAGS="-C target-cpu=haswell -C opt-level=3" cargo bench
[...]
pqueue-insert/Priority Queue Insert
  time:   [2.1536 µs 2.1540 µs 2.1545 µs]
  change: [+123.53% +123.60% +123.66%] (p = 0.00 < 0.05)
  Performance has regressed.
Found 203 outliers among 10000 measurements (2.03%)
  11 (0.11%) low mild
  185 (1.85%) high mild
  7 (0.07%) high severe

I’m targeting Haswell architecture because it’s wildly available and support AVX2, FMA3 and BMI2. I ran this benchmark on an AMD Zen 3 5800X and an Intel 4790 (non-K) - an actual Haswell CPU - with the same behavior.

The example and the bench are building a top-k of distinct pair-like (id, dist) elements from a collection of random elements. The constraint of unicity over id make the use of a binary heap unefficient, so instead I maintain a sorted Vec inserting new element at the correct position.

#[derive(Debug, Clone, Copy)]
pub struct Neighbor {
  pub id: u32,
  pub dist: f32,
}

pub struct Queue {
  neighbors: Vec<Neighbor>,
  capacity: NonZeroUsize,
}

The queue is using a binary search on the vec tofind the position to insert new element. They should be ordered by dist then by id and as you may know, floats are a bit tricky to compare. Here is the full code of the insert method of the queue:

pub fn insert( &mut self, neighbor: Neighbor ) {
  // this compare function trigger the pathological behavior, we will come back to it later
  let cmp = |other: &Neighbor| -> Ordering {
    if other.dist < neighbor.dist { Ordering::Less }
    else if other.dist == neighbor.dist { other.id.cmp(&neighbor.id) }
    else { Ordering::Greater }
  };

  if let Err( pos ) = self.neighbors.binary_search_by( cmp ) && pos < self.capacity.get() {
    if self.neighbors.len() == self.capacity.get() {
      _ = self.neighbors.pop();
    }
    unsafe { std::hint::assert_unchecked( self.neighbors.len() < self.neighbors.capacity() ) };
    self.neighbors.insert( pos, neighbor );
  }
}

This section is a bit of a disclaimer. It is hard to be sure what you actually measure with a synthetic benchmark and even harder to draw clear conclusion from it. Maybe I’m wrong here and there, maybe I’m wrong from start to finish.

Here is the benchmark loop:

  let neighbors = generate_random_neighbors();
  let mut queue = Queue::with_capacity( NonZeroUsize::new(64).unwrap() );
  bencher.iter( || {
    queue.clear();
    for neighbor in neighbors.iter() {
      queue.insert(black_box( *neighbor ));
    }
    black_box( &queue );
  });

The data to be inserted are generated from a seeded random uniform distribution for the dist and a randomized range for the ids. We can change the size of the dataset and/or the capacity of the queue to test different workloads.

We will start going deeper using flamegraph to try to spot the issue. For this, we will use the ex-pqueue example and we should set debug = 1 in the [profile.release] section to help the profiler to match assembly code with Rust code.

[profile.release]
debug = 1
codegen-units = 1
lto = "thin"

You may have to install linux-tools and hint where to find perf to be able run cargo flamegraph like this:

PERF=/usr/lib/linux-tools-5.15.0-160/perf \
RUSTFLAGS="-C target-cpu=haswell -C opt-level=2" \
cargo flamegraph -o ex-pqueue-opt-level-2.svg --example ex-pqueue
[...]
done in 2921ms
[ perf record: Woken up 714 times to write data ]
[ perf record: Captured and wrote 178.494 MB perf.data (2913 samples) ]
Running perf script [0s]: writing flamegraph to "ex-pqueue-opt-level-2.svg"

flamegraph opt-leve=2

PERF=/usr/lib/linux-tools-5.15.0-160/perf \
RUSTFLAGS="-C target-cpu=haswell -C opt-level=3" \
cargo flamegraph -o ex-pqueue-opt-level-3.svg --example ex-pqueue
[...]
done in 6521ms
[ perf record: Woken up 1594 times to write data ]
[ perf record: Captured and wrote 398.344 MB perf.data (6501 samples) ]
Running perf script [0s]: writing flamegraph to "ex-pqueue-opt-level-3.svg"

flamegraph opt-leve=3

Here we can see the percentage of samples in binary_search_by is going from 44.15% to 79.62%. That’s a +80%. The compare function goes from 25.88% to 63.57%, which is +145%. Compiler optimizations may (and will) merge or split assembly fragments, so it’s not always obvious which fragment comes from which function in the Rust source code, even with the help of debug informations.

Having a look at the binary_search_by method from core::slice:

  // This loop intentionally doesn't have an early exit if the comparison
  // returns Equal. We want the number of loop iterations to depend *only*
  // on the size of the input slice so that the CPU can reliably predict
  // the loop count.
  while size > 1 {
    let half = size / 2;
    let mid = base + half;

    // SAFETY: the call is made safe by the following invariants:
    // - `mid >= 0`: by definition
    // - `mid < size`: `mid = size / 2 + size / 4 + size / 8 ...`
    let cmp = f(unsafe { self.get_unchecked(mid) });

    // Binary search interacts poorly with branch prediction, so force
    // the compiler to use conditional moves if supported by the target
    // architecture.
    base = hint::select_unpredictable(cmp == Greater, base, mid);

    // This is imprecise in the case where `size` is odd and the
    // comparison returns Greater: the mid element still gets included
    // by `size` even though it's known to be larger than the element
    // being searched for.
    //
    // This is fine though: we gain more performance by keeping the
    // loop iteration count invariant (and thus predictable) than we
    // lose from considering one additional element.
    size -= half;
  }

The comment about branch prediction and conditional moves should ring a bell. Time for disassembly!

We will now use Compiler Explorer (godbolt) to have a look at the assembly emitted by the compiler. Here is a link to a currated version of the queue and the assembly for the Queue::insert() method (search for insert: in the assembly). The first pane is compiled with opt-level=2 and the second with opt-level=3.

We are looking for the generated code of the search loop. I annotated some operations to help, but you better know some assembly to follow. Here is the relevant snippet using opt-level=2:

.LBB3_11:
    mov       rax, r8
.LBB3_12:
    sub       rdx, r9                            ; size -= half
    mov       r8, rax
    cmp       rdx, 1                             ; while size > 1
    jbe       .LBB3_3                            ; exit loop
.LBB3_8:
    mov       r9, rdx
    shr       r9                                 ; half = size / 2
    lea       rax, [r9 + r8]                     ; mid = half + base
    vmovss    xmm1, dword ptr [rcx + 8*rax + 4]
    vucomiss  xmm0, xmm1                         ; cmp neighbor.dist other.dist
    ja        .LBB3_12
    vucomiss  xmm1, xmm0                         ; cmp other.dist neighbor.dist
    jne       .LBB3_11
    jp        .LBB3_11
    cmp       dword ptr [rcx + 8*rax], esi       ; cmp other.id neighbor.id
    ja        .LBB3_11
    jmp       .LBB3_12

Straight forward implementation with two comparison on the float and a total of 5 conditional jumps, one of them to exit the loop. Nothing fancy here. Note the use of jp (bit parity) to check for NaNs.

Now the snippet for opt-level=3:

.LBB3_8:
    mov       r9, rdx
    shr       r9                                 ; half = size / 2
    lea       r10, [r9 + r8]                     ; mid = half + base
    cmp       dword ptr [rcx + 8*r10], esi       ; cmp other.id neighbor.id
    mov       rax, r10
    cmova     rax, r8
    vucomiss  xmm0, dword ptr [rcx + 8*r10 + 4]  ; cmp neighbor.dist other.dist
    cmovne    rax, r8
    cmovp     rax, r8
    cmova     rax, r10
    sub       rdx, r9                            ; size -= half
    mov       r8, rax
    cmp       rdx, 1                             ; while size > 1
    ja        .LBB3_8                            ; exit loop

As you can see, the code is rearranged from one version to the other. This version is shorter and there is only one conditional jump to exit the loop. The 4 conditional jumps are replaced by 4 conditional moves. That’s pretty cool te be honest.

But oh the irony: when the compiler successfully emit a compare function that actually produce conditional moves instead of conditional jumps, the performance plummet.

Modern CPUs are very complex beasts and they try their best to run our code as fast as possible. They may reorder operations, introduce instruct level parallelism do branch prediction and speculative exeuction to squeeze out the maximum work from each CPU cycle.

Each instruction has a latency (number of cycles to perform the instruction) and can be executed by only a selected set of execution units routed by dedicated ports, which will eventually dictate their (maximum) throughput. At the same time, our code logic will introduce dependencies between operations.

We can try to see how our assembly code may perform at the CPU level using uiCA which allows us to simulate the execution of a snippet of assembly and doing iterations on it. We are doing theorical nano-benchmark here!

You can then Run! the simulation and uiCA will produce a predicted throughput along with a detailed output. Beware, the throughput is given in cycles per iteration, so the higher the slower!

Throughput (in cycles per iteration): 5.00
Bottleneck: Ports

The following throughputs could be achieved if the given property were the only bottleneck:

  - LSD: 4.00
  - Issue: 4.00
  - Ports: 5.00
  - Dependencies: 2.00

There are detailed information for each assembly operation in the table (not reported here). You can also see the execution trace over several iterations following the Open Trace link. They may be a bit overwhelming to add in this article but I encourage you to check them out!

Throughput (in cycles per iteration): 13.81
Bottleneck: Dependencies

The following throughputs could be achieved if the given property were the only bottleneck:

  - LSD: 5.00
  - Issue: 5.25
  - Ports: 4.13
  - Dependencies: 14.00

The assembly using conditional moves has a predicted throughput 2.7x lower! uiCA hint us about dependencies as a potential bottleneck. This should not come at a surprise, conditional moves are known to be a double-edged sword for a long time.

At this point, you may think: “So many cmoves! Your compare function is odd!” While one could argue about this implementation, I used it here because the optimization level changes the emitted assembly for this particular one. I tested other implementations, and they consistently produce either jumps (fast) or moves (slow), regardless of opt-level.

We can try with this one which uses f32::total_cmp:

let cmp = |other: &Neighbor| -> Ordering {
  match other.dist.total_cmp( &neighbor.dist ) {
    Ordering::Equal => other.id.cmp( &neighbor.id ),
    ordering => ordering,
  }
};

total_cmp works with the bit representation of the f32, which should add some bit twiddling operations but should remove the special case for NaNs. Using opt-level=3, here is the emitted assembly:

.LBB3_14:
    cmp       dword ptr [rcx + 8*rax], esi       ; cmp other.id neighbor.id
    seta      r11b
.LBB3_16:
    test      r11b, r11b
    cmovne    rax, r9
    sub       r8, r10                            ; size -= half
    mov       r9, rax
    cmp       r8, 1                              ; while size > 1
    jbe       .LBB3_4                            ; exit loop
.LBB3_13:
    mov       r10, r8
    shr       r10                                ; half = size / 2
    lea       rax, [r10 + r9]                    ; mid = half + base
    mov       r11d, dword ptr [rcx + 8*rax + 4]
    mov       ebp, r11d
    sar       ebp, 31                            ; bit twiddling
    shr       ebp                                ; bit twiddling
    xor       ebp, r11d                          ; bit twiddling
    cmp       ebp, edx                           ; cmp other.dist neighbor.dist
    je        .LBB3_14
    setg      r11b
    jmp       .LBB3_16

Nice code again, only one conditional jump and one conditional move. Surely a single cmovne is faster, right?

RUSTFLAGS="-C target-cpu=haswell -C opt-level=3" cargo bench
[...]
pqueue-insert/Priority Queue Insert
  time:   [2.0744 µs 2.0750 µs 2.0756 µs]
  change: [−3.7051% −3.6691% −3.6344%] (p = 0.00 < 0.05)
  Performance has improved.
Found 69 outliers among 10000 measurements (0.69%)
  18 (0.18%) low mild
  51 (0.51%) high mild

Actually it is barely faster than with 4 conditional moves and still 2x slower that with conditional jumps (flamegraph for completeness).

Let’s check what uiCA predict for this piece of assembly: uiCA, total_cmp

Throughput (in cycles per iteration): 15.50
Bottleneck: unknown

The following throughputs could be achieved if the given property were the only bottleneck:

  - LSD: 5.00
  - Issue: 5.25
  - Ports: 5.00
  - Dependencies: 15.00

Same story. If we trust uiCA, the dependencies are killing it.

You may wonder what we could do about this. I copy-pasted the code of binary_search_by and tried some tweaks:

add a hint::black_box around the call to the compare function

Using the first compare function, it degrades the O2 perf by +10% while improving the O3 by -50%, so they are now neck and neck. The generated assembly is different in O3, though. There is a mix of one conditional move and 3 conditional jumps. Meanwhile, with the compare function which uses total_cmp, the performance plummets event more and is now 3x slower. That’s wild!

replace the hint::select_unpredictable with the worst match statement possible:

match f( unsafe { slice.get_unchecked(mid) } ) {
  Ordering::Less | Ordering::Equal => base = mid,
  Ordering::Greater => {},
}

The performance of all tested cases are now similar to each other (and good) but there is no more conditional moves in the generated assembly.

I digged into the history of the implementation of binary_search_by and here are the most relevant links I found:

Improve SliceExt::binary_search performance #45333
Binary search performance regressed in 1.25 #53823
Add select_unpredictable to force LLVM to use CMOV #128250

The performance regression tells us there are benchmarks where conditional moves are faster than conditional jumps, and I bet they were conducted by people who know better than I do. In my tests, the results varied between -10% and +200%. According to uiCA, the bottleneck appears to be dependency-related but I’m not entirely sure the simulation is accurate, and it may represent a worst-case scenario.

Benchmarking is hard, and I might be stretching myself a bit too much here. I hope you enjoyed the ride and learned something in the meantime!

当O3比O2慢两倍时 When O3 is 2x slower than O2

当O3比O2慢两倍时
When O3 is 2x slower than O2