Clang 字节码解释器更新

Clang 字节码解释器更新
Clang Bytecode Interpreter Update

原始链接: https://developers.redhat.com/articles/2025/10/15/clang-bytecode-interpreter-update

## Clang 字节码解释器：年度回顾过去一年，Clang 字节码解释器项目取得了显著进展，该项目旨在在编译时评估常量表达式。虽然没有单一的突破，但大约 500 次提交巩固了实现并提高了性能。测试套件失败次数从 155 减少到 90，并且现在已经提供了 `builtin_constant_p` 的可用实现。关键改进集中在优化和效率上。开发者们学会了谨慎对待可能影响诊断准确性的激进优化，但成功地实现了读取已知值的优化——通过直接访问变量值而不是指针来减少开销。基准测试表明，字节码解释器将处理 sqlite3 源代码等复杂任务的编译时间大致减半。它在涉及动态内存分配的基准测试中也显著优于当前的 Clang 解释器和 GCC，尤其是在分配的内存未被使用时。目前的工作包括解决剩余的功能和错误，并通过 GitHub 上的 `clang:bytecode` 标签进行跟踪。该项目欢迎来自社区的贡献和错误报告，LLVM Discourse 上有积极的讨论。

这次黑客新闻的讨论围绕着Red Hat对Clang字节码解释器的更新。对话的核心探讨了在抽象语法树（AST）级别解释C++代码与使用LLVM中间表示（IR）的优势。用户质疑为了有效地预计算`constexpr`值，语言需要多少AST支持——这是C++反射的关键要素。一位评论员建议，完整的AST解释器也可以使Clang的静态分析器等静态分析工具受益。人们希望拥有更广泛的C++语言可用性，模仿同时具有编译器*和*解释器的编译型托管语言的方法。还注意到一个小的可用性问题：链接的网页破坏了浏览器的后退按钮功能，需要多次点击才能离开。

It’s October again, so let me tell you what happened with the clang bytecode interpreter this year. In case this is the first you've encountered this topic: This is a project for a bytecode interpreter in clang to evaluate constant expressions at compile time. This work is already available in all recent clang versions when -fexperimental-new-constant-interpreter is passed.

Previous articles in this series:

According to Git, there were roughly 500 commits since my last blog post. While there was no huge new feature or breakthrough that I can remember now, the implementation got a lot more solid. I’ve also done a good chunk of performance work this year.

While we still had approximately 155 failures in the clang test suite last year, we are now down to 90. You can track the progress here. We now have an actual working implementation of builtin_constant_p. And while it does disagree with the current interpreter in a few edge cases, it should be enough for all real-world use cases (see the section about libc++ testing).

In the remainder of this blog post, let's have a more detailed look at a few of the changes that happened in the last year.

Optimizing reads from known sources

When looking at constant expressions (and especially functions), one might have the urge to optimize things away. Let's say we profile the interpreter and figure out that integer additions are particularly inefficient. We might want to check that the right-hand side of a += compound assignment operator is not 0 and not emit any bytecode if it is.

constexpr int test() {
  int a;
  a += 0;
  a += 1;
  return a;
}
static_assert(test() == 1);

Not doing the first addition here would be faster of course, but if you look closely, you notice that a is left uninitialized, so the first addition already fails. Not emitting it causes the diagnostic to appear for the second addition, which is wrong. You can imagine similar cases for divisions or multiplications by 1, a unary +, or all discarded statements. If they can fail, we need to evaluate them and diagnose failure.

This is only one example, but in general we need to be cautious with optimizations on this level because we need to diagnose all undefined behavior, and quality of the diagnostic output is important.

Let’s instead look at something we can optimize.

constexpr int G = 12;
constexpr int test(int P) {
  int L = 8;
  return G + L + P;
}
static_assert(test(5) == 25);

This code is pretty straightforward and should work. It adds up the values of a global variable, a local variable, and a parameter. For all three values, the abstract syntax tree (AST) that clang generates consists of a node to get a pointer to the variable, and one to load from that variable. In detail, it looks something like this:

      `-ImplicitCastExpr 0x7d5284257a50 <col:18> 'int' <LValueToRValue>
        `-DeclRefExpr 0x7d5284257a28 <col:18> 'int' lvalue ParmVar 0x7d5284257660 'P' 'int'

Here we see the lvalue for the parameter P and the lvalue-to-rvalue cast that loads the value from the parameter. We used to emit bytecode just like this:

[..]
48     GetPtrGlobal      0
64     LoadPopSint32
72     GetPtrLocal       40
88     LoadPopSint32
96     AddSint32
104    GetPtrParam       0
120    LoadPopSint32
128    AddSint32
136    Destroy           0
152    RetSint32
[..]

For brevity, I've removed some less interesting opcodes at the beginning and at the end of the function. As you can see from the bytecode, we get a pointer (GetPtrLocal, GetPtrGlobal, GetPtrParam) for all three values, followed by a load of a 32-bit signed integer (LoadPopSint32). Those values are added up and returned from the function. This works just fine, but we can do better.

Even before this optimization, we had opcodes to get the value of a variable without the need to first get a pointer to it. Instead of a GetPtrGlobal + LoadPop pair, we can instead emit just a GetGlobal opcode, which immediately pushes the value of the global variable on the stack. This reduces traffic on our stack data structure and reduces the amount of pointers we create. The latter is particularly important because we’re always tracking what memory a pointer points to, just in case that memory goes out of scope.

Because the relevant opcodes already existed, we only needed to do some type checks when generating bytecode and the result looks like this:

[...]
48     GetGlobalSint32    0
64     GetLocalSint32     40
80     AddSint32
88     GetParamSint32     0
104    AddSint32
112    Destroy            0
128    RetSint32
[...]

This is nicer to read, and also more efficient. When evaluating LoadPop opcode, we don’t know whether the pointer we load from is a global variable, a local variable, a temporary, a parameter, and so on. We need to be conservative, and do all possible checks. If we know where we load from, then we can skip unnecessary work. For example, loading a value from an extern variable doesn’t generally work, so we need to check for this. But local variables cannot be extern and so we can skip this check for the GetLocal opcode.

libc++ testing

In January 2025, I started running the libc++ test suite in addition to the clang test suite. The libc++ tests provide more real-world use cases whereas the clang test suite usually stress-tests against edge cases in the C and C++ language specifications. As such, the initial run of the libc++ test suite resulted in a large number of failures (Figure 1).

Image showing clang and libc++ test suite failures with the bytecode interpreter over time. — Figure 1: clang and libc++ test suite failures with the bytecode interpreter over time.

We initially had over 750 test failures. A large chunk of those was caused by missing support for some string builtins. The libc++ test suite extensively tests std::string, after all. Through that and other tests, such as those for std::vector, it also uses dynamic memory allocation at compile time a lot.

The libc++ test cases are usually unfortunately exceedingly complex. Fixing them requires a variation of:

Letting clang preprocess the test so the result is independent of include files and most command-line options. This usually results in a file that is way too large. The longest one I saw was around 300KLOC.
Use a mixture of manual editing and running cvise to reduce the test case. If you’ve never had to do something like this: Good! Otherwise, cvise is great and you should know how to use it.
Debugging the reduced test case to figure out the root cause of the problem.

I often managed to reduce the failing test case to a self-contained test of just a few dozen lines. After three months, we reached the milestone of zero test failures.

Because libc++ is in active development, it changes constantly. This means we sometimes see regressions in the number of test failures, but that's okay.

Let's look at an example benchmark using #embed:

constexpr char str[] = {
#embed "sqlite3.c" suffix(,0)
};
consteval unsigned checksum(const char *s) {
  unsigned result = 0;
  for (const char *p = s; *p != '\0'; ++p) {
    result += *p;
  }
  return result;
}
constexpr unsigned C = checksum(str);

This test case simply reads the sqlite3 amalgamation (which has almost 9MB) into a global char array, then sums up all the character values into a variable C.

Let's time this test with the current interpreter and then with the bytecode interpreter. All runs are using a regular release build of clang with assertions disabled but no other compiler flags set.

hyperfine -r 30 -w 3 'bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000'
Benchmark 1: bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000
  Time (mean ± σ):     36.490 s ±  0.361 s    [User: 36.161 s, System: 0.299 s]
  Range (min … max):   35.711 s … 37.611 s    30 runs
hyperfine -r 30 -w 3 'bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
Benchmark 1: bin/clang -c embed.cpp -std=c++20 -Wno-c23-extensions -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
  Time (mean ± σ):     14.799 s ±  0.274 s    [User: 14.484 s, System: 0.303 s]
  Range (min … max):   14.332 s … 15.345 s    30 runs

So the bytecode interpreter only needs roughly 50% of the time that the current interpreter takes. That's good, but let's look at some of the (Figure 2).

Bar chart showing the runtimes of the embed benchmark. The bytecode interpreter takes less than half the time. — Figure 2: The runtimes of the embed benchmark. The bytecode interpreter takes less than half the time.

Our test case consists of the first declaration of str, which initializes a large char array. That’s not a function call so the bytecode interpreter doesn’t interpret bytecode here at all, it instead evaluates the expressions as they come. This is similar to what the current interpreter is doing, but the bytecode interpreter has some additional overhead. If we look at the runtime for only the initialization, the bytecode interpreter is slower.

The difference used to be much worse but I've optimized the bytecode interpreter to be more performant when checking large arrays for initialization.

Heap allocation benchmark

Here's another benchmark that I've been looking at in the past year:

constexpr int x = []() {
  for (unsigned I = 0; I != 10'000; ++I) {
    char *buffer = new char[1024];
    for (unsigned c = 0; c != 1024; ++c)
      buffer[c] = 97 + (c % 26);
    delete[] buffer;
  }
  return 1;
}();

This is a pretty simple but useless test benchmarking dynamic memory allocation. We allocate some memory, fill it with some characters, and then we free the memory again. This is all done in a lambda that gets called immediately. The final output is just a variable with the value 1.

This is a benchmark we can even time against GCC (I was using 14.3.1), so let's do that:

$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000
  Time (mean ± σ):     27.792 s ±  0.263 s    [User: 27.732 s, System: 0.036 s]
  Range (min … max):   27.329 s … 28.287 s    30 runs
$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
  Time (mean ± σ):      4.241 s ±  0.103 s    [User: 4.218 s, System: 0.019 s]
  Range (min … max):    4.070 s …  4.504 s    30 runs
$ hyperfine -r 30 -w 3 '/usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000'
Benchmark 1: /usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000
  Time (mean ± σ):     43.829 s ±  0.308 s    [User: 42.170 s, System: 1.619 s]
  Range (min … max):   43.201 s … 44.556 s    30 runs

Figure 4 illustrates the comparison.

Bar chart showing the runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. GCC is the slowest, followed by clang with the current interpreter. The bytecode interpreter is much faster than both of the others. — Figure 4: The runtimes of clang with the current interpreter, clang with the bytecode interpreter and GCC. GCC is the slowest, followed by clang with the current interpreter. The bytecode interpreter is much faster than both of the others.

GCC has an interesting optimization that seems to eliminate the heap allocation altogether if it is unused. If we remove the loop initializing the allocated memory (and increase the size from 1024 to 1024 * 1024), GCC is better than both the new and old clang interpreters:

$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000
  Time (mean ± σ):     1048.113 s ± 19.781 s    [User: 783.382 s, System: 263.909 s]
  Range (min … max):   1008.151 s … 1082.708 s    30 runs
$ hyperfine -r 30 -w 3 'bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter'
Benchmark 1: bin/clang -c new-delete.cpp -std=c++23 -fconstexpr-steps=1000000000 -fexperimental-new-constant-interpreter
  Time (mean ± σ):     439.9 ms ±  19.2 ms    [User: 421.7 ms, System: 17.5 ms]
  Range (min … max):   406.7 ms … 475.7 ms    30 runs
$ hyperfine -r 30 -w 3 '/usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000'
Benchmark 1: /usr/bin/g++ -c new-delete.cpp -std=c++23 -fconstexpr-ops-limit=1000000000
  Time (mean ± σ):     116.8 ms ±  23.8 ms    [User: 101.0 ms, System: 15.7 ms]
  Range (min … max):    91.4 ms … 152.4 ms    30 runs

In this case, clang with the current interpreter is much slower than both GCC and the bytecode interpreter. Note the logarithmic Y axis in Figure 5,

GCC is almost four times as fast as the bytecode interpreter. It either removes the heap allocation entirely or new + delete expressions are just very efficiently implemented. I have not profiled clang with the current interpreter thoroughly but I can imagine that zero-initializing the APValue (the data structure clang uses to represent compile-time evaluated values) of the array already takes that long.

Further work

There are still a few features that need to be implemented, and definitely a few underlying problems. I’ve added a clang:bytecode label in GitHub to track several of them in the upstream LLVM repository.

I'll continue working on this project, and I’ve also asked for help on the LLVM Discourse. Other people have also implemented a lot of the recent vector elementwise builtins. So if you’re interested, you can always try to compile your favorite C or C++ project with -fexperimental-new-constant-interpreter and report all problems (or successes!) you find. Any implementation help is also highly appreciated, of course.

Part of me hopes that this is the last blog post I write about this topic, but see you next year!