性能优化为何如此困难
Performance optimization is hard because it's fundamentally a brute-force task

原始链接: https://purplesyringa.moe/blog/why-performance-optimization-is-hard-work/

在2025年4月29日的一篇Reddit帖子中,一位开发者发泄了对性能优化的苦恼。这是一个费力的蛮力任务,复杂的相互作用优化可能会使代码性能下降。作者强调了反复试验的重要性,并指出由于向量化或分支预测错误等因素,“愚笨”的算法有时会胜过“智能”算法。 他们强调了基准测试和性能分析的重要性,即使它们有局限性,并告诫不要盲目相信编译器,编译器更擅长零成本抽象而不是真正的优化。编译器的不足需要检查反汇编,有时需要手动调整来引导编译器。 作者惋惜缺乏足够的工具,以及在苹果硅等平台上进行优化的挑战,这些平台的文档匮乏。他们主张获得“最佳可能”的结果,即使它没有达到理论上的最佳值,并强调即使是很小的优化也可以累积起来创造更好的用户体验。

这篇 Hacker News 讨论串探讨了性能优化的挑战,起因是一篇文章声称这本质上是一项蛮力任务。许多评论者认为优化需要深入理解和有条理的实验,超越直觉,依赖仔细的测量和剖析。一些人发现优化很有乐趣,如同冥想一般,因为其紧密的反馈循环。另一些人强调理论计算和高层次思考的重要性,以避免陷入局部最小值,而架构优化可以带来显著的收益。讨论还涉及工具的局限性以及更好地理解 CPU 行为的需求。一些人提到了实际的方法,例如简化代码、缓存和算法改进。虽然蛮力测试有一定的作用,但该讨论串总体上强调需要一种平衡的方法,即结合理论理解和数据驱动分析才能实现最佳性能。
相关文章
  • 四种优化 2025-04-16
  • 询问 HN:如何了解性能优化? 2024-03-04
  • (评论) 2023-11-30
  • (评论) 2024-06-11
  • (评论) 2025-04-29

  • 原文
    Reddit

    I’m not talking about skill, knowledge, or convincing a world focused on radical acceleration that optimization is necessary. Performance optimization is hard because it’s fundamentally a brute-force task, and there’s nothing you can do about it.

    This post is a bit of a rant on my frustrations with code optimization. I’ll also try to give actionable advice, which I hope enchants your experience.

    Certain optimizations can only work together, while others lead to pessimizations when combined. To be an expert means to know what optimization avenues exist; to be a master means to know which ones to choose.

    I have a post on integer formatting in the works, covering a very particular algorithm design – and I still haven’t finished it because there’s like five different choices to make, I have no idea how they impact each other, and I need to analyze 25 variants to claim which one’s the best in conscience. Several of my projects are similarly stuck because I don’t have the willpower to implement a dozen combinations.

    Pruning “obviously” suboptimal approaches is all but a heuristic. I like to think I’m more in tune with an x86-64 CPU than most people, and it still manages to surprise me from time to time. Dumb algorithms can become more applicable due to vectorization, smart code can fail due to branch misprediction or store-to-load forwarding gone wrong.

    Optimization takes a lot of trial and error. I dislike the “intuition doesn’t work, profile your code” mantra because it seemingly says profiling is a viable replacement for theoretical calculations, which it isn’t. But I can’t argue that profiling is avoidable. I often joke that perf report is my go-to disassembler.

    Worse yet, you can’t trust “obviously” good code either. In a previous post, I optimized a single linear pass by replacing it with a superlinear sort. This is by no means a unique experience: just yesterday, I saw someone optimize best-of-class Barrett reduction by dividing the numbers as doubles, rounding them, and computing the reminder from the quotient. It’s so stupid that it can’t possibly work, yet it does.

    The good news is that the work here can be split among multiple people trying different approaches. Open-source projects in particular benefit from this, because contributors typically have different strengths and focus on different ideas. Try to reuse work by consulting your teammates or reading about others’ experiences in solving similar tasks.

    A variation on this is algorithms where a cut-off boundary is present. You no longer choose whether to apply an optimization: you also need to select parameters via more trial and error. For example:

    • Hybrid sorting algorithms can switch between different implementations due to high big-O constants,
    • FFT can switch between recursive and iterative approaches to better utilize processor cache.
    • Depending on data density, the optimal set structure might be bitsets, hash sets, or complementary hash sets.

    Modifying either of the alternative algorithms requires rebenchmarking to update the optimal boundary. Small modifications here can lead to drastic end performance changes due to interactions with CPU cache, branch and memory access prediction, the discrete nature of recursive cut-offs, and floating-point precision (for big integer multiplication via FFT). Forgetting to rebenchmark and abandoning a prospective approach can easily leave 2× performance on the table.

    For another example, consider a program that executes n times either action A or B depending on probability p. If p is far from 12, branch prediction means it’s better to implement the switch with an if; if p is close to 12, branch prediction will fail and a branchless approach will work better. Not only does the relative performance of A and B matter here, but the cost of branch misprediction matters as well, and that might depend not only on the CPU but on the precise code executed.

    Ideally, you’d have a test bench that plots graphs and finds optimal parameter values automatically, even though getting this working can be draining. This way, running checks all the time becomes cheap and emotionally easier. Even if it takes half an hour, you can still work on something else in parallel.

    The worst example of incompatible optimizations is those that fail due to external constraints.

    One example is when two LUTs don’t fit in cache together, but do individually. You can sometimes fix this by splitting the computation into multiple passes, where each pass only needs to access helper data that does fit into cache. This does not necessarily mean two passes over all data, consuming 2× memory bandwidth – you can chunk the data and apply two passes on a chunk, which increases performance if the chunk fits into, say, L3. But sometimes that doesn’t work, and then I bash my head against the wall.

    Register pressure is even worse because that is only a problem because of the ISA, not the microarchitecture. The hardware has enough registers, they just aren’t exposed to user code. You can try to split data between general-purpose registers and vector registers, and that works as long as you seldom cross the GPR-SIMD boundary, but at that point, you might as well change your profession.

    It doesn’t have to be that way. FPGAs enable you to design your own hardware (kind of, anyway), and alternative approaches like interaction nets have a chance to make software-specified operations as optimal as operations that are usually implemented in hardware. But that’s not the world we live in, no, we live in the world where Intel keeps introducing useful instructions to AVX-512 only to abandon them later, so I need to choose between a CPU with vp2intersect or with FP16. So not only do you have to benchmark different code, you also have to test it on different CPUs to decide which EC2 instance to deploy it on.

    The only advice I have for this is to try to achieve the best possible result, even if it’s worse than the theoretical optimum. Reduce the size of one of the LUTs by moving some calculations to runtime, rewrite a chunk of code in assembly to manage registers better, and when all else fails, accept that you have to make a choice.

    “Compilers are smarter than humans” is a common mantra. It couldn’t be further from the truth. Any developer can see that the following two snippets are (supposed to be) equivalent:

    let condition1 = HashSet::from([a, b]).contains(&c);
    let condition2 = a == c || b == c;
    

    But compilers aren’t going to optimize the former into the latter (JVM’s JIT, in some cases, excluded). They don’t reason in abstractions, and they certainly don’t reason in your auxiliary abstractions. This doesn’t just apply to high-level code: LLVM does not even understand that bitwise AND is an intersection.

    No, compilers excel at something different from optimization: they turn higher-level languages into zero-cost abstractions, but there’s no ingenuity. Compilers are optimal transpilers – barring a few exceptions, they codegen exactly what you wrote in the source. They allow you to write assembly with the syntax and capabilities of Rust or C++, but don’t you dare forget that the arr.map(|x| x / c) you wrote will invoke idiv without performing obvious libdivide-style precalculations.

    Sometimes I wonder if -O2 should be renamed to -fzero-cost-abstractions.

    This might make it sound like I’m arguing that compilers are only good at plumbing, but they aren’t even good at that. For example, they can be terrible at register allocation of all things. If a rarely executed chunk of code needs many registers, GCC satisfies that need by spilling variables accessed by the hot path on every iteration, not only on entry to cold path. Clang handles this simple example better but fails in more complicated cases.

    The lesson is never to trust the compiler blindly. Always check the disassembly, consult an instruction-level profiler like perf, and don’t be afraid to use this information to nudge the compiler to do the right thing if it leads to tangible improvements.

    Despite obvious shortcomings, compilers don’t allow you to correct them on things they get wrong. There is no way to provide both optimized assembly and equivalent C code and let the compiler use the former in the general case and the latter in special cases. Custom calling conventions are mostly unsupported, and so is choosing between branchless and branchy code and any other assembly tricks. There are intrinsics, but LLVM and rustc still try to be smart and rewrite them, which sometimes causes pessimizations, leaving no alternative but to add an optimization barrier.

    e-graphs, as popularized by Cranelift, try to tackle this problem, but to my knowledge, there hasn’t been much success in this field. I’m still hopeful, though.

    For x86 processors, uops.info provides timing and port information for each instruction and many Intel and AMD CPUs. Agner Fog wrote a manual on optimization for x86 processors and publishes his own tables. Intel Software Developer’s Manual contains more than 5000 pages documenting not only the instruction set but many internal workings of their CPUs as well.

    Apple Silicon has nothing like that. I have no goddamn idea how to work with M1+ processors. There’s Apple Silicon CPU Optimization Guide, which contains only 169 pages and reads like something people would write for novices, not experts. It reads like a tutorial you might find on HN, not something I would be interested in. It contains estimates of latencies and throughputs for some categories of instructions, but there’s frustratingly little tabular data, and it doesn’t mention uop fusion or provide port information. Dougall Johnson’s research is immensely valuable but only covers M1, not newer CPUs, and it still doesn’t answer many questions.

    Even Apple’s LLVM fork lacks scheduling annotations for Apple Silicon. How am I supposed to write efficient code when Apple doesn’t bother to tune their own compiler? Optimizing code for such a platform is 90% reverse engineering and 10% writing meaningful code – and writing meaningful code is already hard.

    The right fix for this is to commit intellectual property theft, but I’m not allowed to say that, so I won’t. Oops.

    Performance optimization is hard because you have to:

    • Explore dozens of cases manually without losing your mind.
    • Iterate with inadequate tooling. (Profilers and MCA are useful, but they’re still toys that can’t match the underlying complexity.)
    • Jam squares into round holes until they fit. Merge incompatible optimizations.
    • Deal with both corporate greed and cultural apathy.

    It’s not easy by any means, but it’s still something I enjoy doing, even though people often consider anything short of radical improvements a waste of time. To me, a 10% optimization is a form of art, but it’s not just that. Small improvements compound and help form a better user experience, even if no single optimization seems valuable on its own – much like improving data transfer rates has led to structural changes in how we process and utilize information.

    Optimizations save time, and time is the one resource people don’t get enough of.

    联系我们 contact @ memedata.com