Rust – 在 nightly 中使用并行前端加快编译速度

Rust – 在 nightly 中使用并行前端加快编译速度
Rust – Faster compilation with the parallel front-end in nightly

原始链接: https://blog.rust-lang.org/2023/11/09/parallel-rustc.html

Rust 最近推出了一种称为并行前端的新工具，可以缩短编译时间，同时减少对硬件资源的依赖。目前仍然是一个实验性功能，它的目标是通过利用细粒度并行性，而不是仅使用以前版本的编译器中常见的进程间和进程内选项，将使用多线程模式的大型程序的编译时间缩短 50%。然而，与单线程模式相比，性能略有下降。就正确性而言，两种模式的可靠性通常都很高，尽管由于多线程方法增加了复杂性而存在一些已知的错误。我们鼓励对遇到的任何错误提供反馈、建议和报告。该团队计划继续开发和完善并行前端，以便更好地服务其用户社区。要继续了解 Rust，请在线访问他们的文档页面、注册更新、社交媒体帐户和 RSS 源。

然而，procmacros 是 Rust 宏系统的重要组成部分，需要支持各种语言功能，例如循环、条件、函数和控制流结构。删除它们需要对语言和生态系统进行重大改变，从而影响广泛的应用程序和用例。因此，Rust 不太可能大幅简化其编译器执行格式，特别是考虑到 procmacros 在构建健壮且可靠的低级过程宏系统方面发挥的关键作用。从表面上看，是的，Rust 的静态 PIE ELF 可执行文件听起来很诱人。然而，应该指出的是，现有的 Rust 构建的编译管道已被证明对于在当代系统上编译大规模复杂代码库来说相当高效和有效，并且解决其当前基础设施的局限性并不是那么简单或直接的前景正如它的脸上所显示的那样。在实践中，创建一个单独的简单机器代码生成器需要开发专门的工具和 API 来适应新构建过程的各个方面，管理输入和生成输出之间的相互依赖性。此外，针对特定应用程序或用例维护和更新 Rust 工具链的自定义编译版本会带来额外的复杂性、开销和依赖性，可能会增加维护成本并使升级路径复杂化，特别是在安全漏洞和其他性能瓶颈方面。此外，仍然不确定明显更简单的工具链变体是否可以为足够广泛和全面的标准库组件集提供足够的支持，从而为以符合最佳工程实践的方式满足某些基本要求留下空间。总体而言，与投入更多精力、时间和资源来优化和简化现有 Rust 构建和工具链生态系统相比，追求这条道路所涉及的权衡似乎不那么引人注目。

The Rust compiler's front-end can now use parallel execution to significantly reduce compile times. To try it, run the nightly compiler with the -Z threads=8 option. This feature is currently experimental, and we aim to ship it in the stable compiler in 2024.

Keep reading to learn why a parallel front-end is needed and how it works, or just skip ahead to the How to use it section.

Compiler Performance Working Group has continually improved compiler performance for several years. For example, in the first 10 months of 2023, there were mean reductions in compile time of 13%, in peak memory use of 15%, and in binary size of 7%, as measured by our performance suite.

However, at this point the compiler has been heavily optimized and new improvements are hard to find. There is no low-hanging fruit remaining.

But there is one piece of large but high-hanging fruit: parallelism. Current Rust compiler users benefit from two kinds of parallelism, and the newly parallel front-end adds a third kind.

--timings flag, which produces a chart showing how the crates are compiled. The following image shows the timeline when building ripgrep on a machine with 28 virtual cores.

cargo build --timings output when compiling ripgrep

There are 60 horizontal lines, each one representing a distinct process. Their durations range from a fraction of a second to multiple seconds. Most of them are rustc, and the few orange ones are build scripts. The first twenty processes all start at the same time. This is possible because there are no dependencies between the relevant crates. But further down the graph, parallelism reduces as crate dependencies increase. Although the compiler can overlap compilation of dependent crates somewhat thanks to a feature called pipelined compilation, there is much less parallel execution happening towards the end of compilation, and this is typical for large Rust programs. Interprocess parallelism is not enough to take full advantage of many cores. For more speed, we need parallelism within each process.

Samply measuring rustc as it does a release build of the final crate in Cargo. The image is superimposed with markers that indicate front-end and back-end execution.

Samply output when compiling Cargo, serial

Each horizontal line represents a thread. The main thread is labelled "rustc" and is shown at the bottom. It is busy for most of the execution. The other 16 threads are LLVM threads, labelled "opt cgu.00" through to "opt cgu.15". There are 16 threads because 16 is the default number of codegen units for a release build.

There are several things worth noting.

Front-end execution takes 10.2 seconds.
Back-end execution takes 6.2 seconds, and the LLVM threads are running for 5.9 seconds of that.
The parallel code generation is highly effective. Imagine if all those LLVM executed one after another!
Even though there are 16 LLVM threads, at no point are all 16 executing at the same time, despite this being run on a machine with 28 cores. (The peak is 14 or 15.) This is because the main thread translates its internal code representation (MIR) to LLVM's code representation (LLVM IR) in serial. This takes a brief period for each codegen unit, and explains the staircase shape on the left-hand side of the code generation threads. There is some room for improvement here.
The front-end is entirely serial. There is a lot of room for improvement here.

Rayon to perform compilation tasks using fine-grained parallelism. Many data structures are synchronized by mutexes and read-write locks, atomic types are used where appropriate, and many front-end operations are made parallel. The addition of parallelism was done by modifying a relatively small number of key points in the code. The vast majority of the front-end code did not need to be changed.

When the parallel front-end is enabled and configured to use eight threads, we get the following Samply profile when compiling the same example as before.

Samply output when compiling Cargo, parallel

Again, there are several things worth noting.

Front-end execution takes 5.9 seconds (down from 10.2 seconds).
Back-end execution takes 5.3 seconds (down from 6.2 seconds), and the LLVM threads are running for 4.9 seconds of that (down from 5.9 seconds).
There are seven additional threads labelled "rustc" operating in the front-end. The reduced front-end time shows they are reasonably effective, but the thread utilization is patchy, with the eight threads all having periods of inactivity. There is room for significant improvement here.
Eight of the LLVM threads start at the same time. This is because the eight "rustc" threads create the LLVM IR for eight codegen units in parallel. (For seven of those threads that is the only work they do in the back-end.) After that, the staircase effect returns because only one "rustc" thread does LLVM IR generation while seven or more LLVM threads are active. If the number of threads used by the front-end was changed to 16 the staircase shape would disappear entirely, though in this case the final execution time would barely change.

Rust compilation has long benefited from interprocess parallelism, via Cargo, and from intraprocess parallelism in the back-end. It can now also benefit from intraprocess parallelism in the front-end.

You might wonder how interprocess parallelism and intraprocess parallelism interact. If we have 20 parallel rustc invocations and each one can have up to 16 threads running, could we end up with hundreds of threads on a machine with only tens of cores, resulting in inefficient execution as the OS tries its best to schedule them?

jobserver protocol to limit the number of threads it creates. If a lot of interprocess parallelism is occuring, intraprocess parallelism will be limited appropriately, and the number of threads will not exceed the number of cores.

shipping with the parallel front-end enabled. However, by default it runs in single-threaded mode and won't reduce compile times.

Keen users can opt into multi-threaded mode with the -Z threads option. For example:

$ RUSTFLAGS="-Z threads=8" cargo build --release

Alternatively, to opt in from a config.toml file (for one or more projects), add these lines:

[build]
rustflags = ["-Z", "threads=8"]

It may be surprising that single-threaded mode is the default. Why parallelize the front-end and then run it in single-threaded mode? The answer is simple: caution. This is a big change! The parallel front-end has a lot of new code. Single-threaded mode exercises most of the new code, but excludes the possibility of threading bugs such as deadlocks that can affect multi-threaded mode. Even in Rust, parallel programs are harder to write correctly than serial programs. For this reason the parallel front-end also won't be shipped in beta or stable releases for some time.

measurements on real-world code show that compile times can be reduced by up to 50%, though the effects vary widely and depend on the characteristics of the code and its build configuration. For example, dev builds are likely to see bigger improvements than release builds because release builds usually spend more time doing optimizations in the back-end. A small number of cases compile more slowly in multi-threaded mode than single-threaded mode. These are mostly tiny programs that already compile quickly.

We recommend eight threads because this is the configuration we have tested the most and it is known to give good results. Values lower than eight will see smaller benefits, but are appropriate if your hardware has fewer than eight cores. Values greater than eight will give diminishing returns and may even give worse performance.

If a 50% improvement seems low when going from one to eight threads, recall from the explanation above that the front-end only accounts for part of compile times, and the back-end is already parallel. You can't beat Amdahl's Law.

Memory usage can increase significantly in multi-threaded mode. We have seen increases of up to 35%. This is unsurprising given that various parts of compilation, each of which requires a certain amount of memory, are now executing in parallel.

check the issues marked with the "WG-compiler-parallel" label. If your problem does not match any of the existing issues, please file a new issue.

For more general feedback, please start a discussion on the wg-parallel-rustc Zulip channel. We are particularly interested to hear the performance effects on the code you care about.

We are working to improve the performance of the parallel front-end. As the graphs above showed, there is room to improve the utilization of the threads in the front-end. We are also ironing out the remaining bugs in multi-threaded mode.

We aim to stabilize the -Z threads option and ship the parallel front-end running by default in multi-threaded mode on stable releases in 2024.

The parallel front-end has been under development for a long time. It was started by @Zoxc, who also did most of the work for several years. After a period of inactivity, the project was revived this year by @SparrowLii, who led the effort to get it shipped. Other members of the Parallel Rustc Working Group have also been involved with reviews and other activities. Many thanks to everyone involved.

Rust – 在 nightly 中使用并行前端加快编译速度 Rust – Faster compilation with the parallel front-end in nightly

Rust – 在 nightly 中使用并行前端加快编译速度
Rust – Faster compilation with the parallel front-end in nightly