C++：零成本静态初始化

C++：零成本静态初始化
C++: Zero-cost static initialization

原始链接: https://cofault.com/zero-cost-static.html

## C++ 中零成本静态变量：深度剖析本文探讨了优化 C++ 中静态变量初始化方法，目标是达到与文件作用域静态变量相当的性能。标准的 C++ 静态初始化虽然方便，但由于保护变量和同步机制（如 `__cxa_guard_acquire()`）的存在，会产生开销，以确保线程安全的一次性初始化。作者提出了一种利用 UNIX 链接器鲜为人知的功能的技术：为输出节创建 `__start_SECNAME` 和 `__stop_SECNAME` 符号。通过在专用节（例如 `STATIC_Bar`）中定义占位符对象，然后在全局静态初始化期间就地初始化实例，可以消除运行时保护检查。最初的尝试在使用内联函数时遇到了节属性冲突的挑战。解决方案是使用嵌入式汇编程序 (`__asm__`) 以及 `.pushsection` 指令来精确控制节属性和符号命名，并使用 `__COUNTER__` 确保唯一性。这种方法生成的汇编代码直接访问静态变量，从而消除了初始化开销。虽然复杂，但它展示了一条实现 C++ 中真正零成本静态变量的途径，但仍需要进一步完善以处理构造函数参数和复杂的类型名称。

一个 Hacker News 的讨论围绕着一篇关于 C++ 中“零成本静态初始化”的最新文章展开。该技术旨在优化静态变量的初始化，但评论者指出了一些潜在的缺点。虽然在“热路径”（频繁执行的代码）上有效，但它容易受到静态初始化顺序混乱（SIOF）的影响——这是静态初始化中常见的问题。性能因架构而异；ARM 在原子加载时会产生内存屏障的开销，而 x86 则不会。用户建议使用替代方案，如 `constinit`、`construct_at`，或禁用线程安全的静态变量 (`-fno-threadsafe-statics`)。一个关键点是，初始化完成后，访问开销很小，仅涉及跳转指令，而不是重复的锁调用。还有人提倡使用哨兵值进行显式延迟初始化，以获得潜在的收益，尤其是在编译器无法有效优化代码布局时。

原文

Zero-cost statics in C++

"Усердие все превозмогает!"

К. Прутков, Мысли и афоризмы, I, 84

In C and C++ a static variable can be defined in a function scope:

int foo() {
        static int counter = 1;
        printf("foo() has been called %i times.\n", counter++);
        ...
}

Technically, this defines counter as an object of static storage duration that is allocated not within the function activation frame (which is typically on the stack, but can be on the heap for a coroutine), but as a global object. This is often used to shift computational cost out of the hot path, by precomputing some state and storing it in a static object.

When exactly a static object is initialised?

For C this question is vacuous, because the initialiser must be a compile-time constant, so the actual value of the static object is embedded in the compiled binary and is always valid.

C++ has a bizarrely complicated taxonomy of initialisations. There is static initialisation, which roughly corresponds to C initialisation, subdivided into constant-initialisation and zero-initialisation. Then there is dynamic initialisation, further divided into unordered, partially-ordered and ordered categories. None of these, however, captures our case: for block-local variables, the Standard has a special sub-section in "Declaration statement" [stmt.dcl.4]:

Dynamic initialization of a block-scope variable with static storage duration or thread storage duration is performed the first time control passes through its declaration; such a variable is considered initialized upon the completion of its initialization. If the initialization exits by throwing an exception, the initialization is not complete, so it will be tried again the next time control enters the declaration. If control enters the declaration concurrently while the variable is being initialized, the concurrent execution shall wait for completion of the initialization. If control re-enters the declaration recursively while the variable is being initialized, the behavior is undefined.

For example in

struct Bar {
        Bar() : var(1) {}
        int var;
};

int foo(int x) {
        static Bar b{};
        return b.var + 1;
}

the constructor for b should be called exactly once when foo() is called the first time. This initialisation semantics is very close (sans the exceptions part) to pthread_once(). It is clear that the compiler must add some sort of an internal flag to check whether the initialisation has already been performed and some synchronisation object to serialise concurrent calls to foo() [godbolt]:

foo(int):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16
        mov     DWORD PTR [rbp-4], edi
        movzx   eax, BYTE PTR guard variable for foo(int)::b[rip]
        test    al, al
        sete    al
        test    al, al
        je      .L3
        mov     edi, OFFSET FLAT:guard variable for foo(int)::b
        call    __cxa_guard_acquire
        test    eax, eax
        setne   al
        test    al, al
        je      .L3
        mov     edi, OFFSET FLAT:foo(int)::b
        call    Bar::Bar() [complete object constructor]
        mov     edi, OFFSET FLAT:guard variable for foo(int)::b
        call    __cxa_guard_release
.L3:
        mov     eax, DWORD PTR foo(int)::b[rip]
        add     eax, 1
        leave
        ret

This corresponds roughly to the following code:

int foo(int x) {
        static Bar b{};
        static std::atomic<int> __b_guard = 0;
        if (__cxa_guard_acquire(&__b_guard) != 0) {
                new (&b) Bar{}; /* Construct b in-place. */
                __cxa_guard_release(&__b_guard)
        }
        return b.var + 1;
}

Here __b_guard (guard variable for foo(int)::b in assembly) is the flag variable added by the compiler. __cxa_guard_acquire() is a suprisingly complex function, which includes its own synchronisation mechanism implemented directly on top of the raw Linux futex syscall.

Even after the static variable has been initialised, the overhead of accessing it is still considerable: a function call to __cxa_guard_acquire(), plus atomic_load_explicit(&__b_guard, memory_order::acquire) in __cxa_guard_acquire(). On ARM, such atomic load incurs a memory barrier---a fairly expensive operation.

Can this additional cost be reduced? Yes, in fact it can be completely eliminated, making block-level static variables exactly as efficient as file-level ones. For this we need a certain old, but little-known feature of UNIX linkers. From GNU binutils documentation (beware than in the old versions the terminating symbol is mistakenly referred to as __end_SECNAME):

If an output section’s name is the same as the input section’s name and is representable as a C identifier, then the linker will automatically PROVIDE two symbols: __start_SECNAME and __stop_SECNAME, where SECNAME is the name of the section. These indicate the start address and end address of the output section respectively. Note: most section names are not representable as C identifiers because they contain a ‘.’ character.

(Solaris linker calls them "Encapsulation Symbols", see here.)

The idea is the following: instead of defining a block-level static instance of Bar, define a trivially-initialisable object of a size sufficient to hold an instance of Bar in a dedicated section STATIC_Bar, via (more or less portable) __attribute__((section)). Only such place-holder objects and nothing else are placed in this section. Then, during global static initialisation, scan the resulting array of place-holder objects from __start_STATIC_Bar to __stop_STATIC_Bar and initialise Bar instances in-place. Assuming that functions where static Bars are defined are not themselves called during global static initialisation, this would initialise everything correctly: by the time foo() is called, its b has already been initialised.

Something like this:

#include <stdio.h>
#include <new> /* For placement new. */

#define FAST_STATIC(T)                                                                    \
*({                                                                                       \
        struct placeholder {                                                              \
            alignas(T) char buf[sizeof(T)];                                               \
        };                                                                                \
        static constinit placeholder ph __attribute__((section ("STATIC_" #T))) {{}};     \
        reinterpret_cast<T *>(ph.buf);                                                    \
})

template <typename T> static int section_init(T *start, T *stop)
{
        for (T *s = start; s < stop; ++s)
            new (s) T; /* Construct in-place. */
        return 0;
}

#define FAST_STATIC_INIT(T)                                     \
extern "C" T __start_STATIC_ ## T;                              \
extern "C" T __stop_STATIC_ ## T;                               \
static int _init_ ## T = section_init<T>(&__start_STATIC_ ## T, \
                                         &__stop_STATIC_ ## T);

struct Bar {
        Bar() : var(1) {}
        int var;
};

int foo(int x) {
        Bar &b0 = FAST_STATIC(Bar);
        Bar &b1 = FAST_STATIC(Bar);
        return b0.var + b1.var + 1;
}

FAST_STATIC_INIT(Bar);

int main(int argc, char **argv) {
        return printf("%i\n", foo(argc)); /* Prints "3". */
}

Check the resulting assembly [godbolt]:

foo(int)::ph:
        .zero   4
foo(int)::ph:
        .zero   4
foo(int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        mov     QWORD PTR [rbp-8], OFFSET FLAT:foo(int)::ph
        mov     QWORD PTR [rbp-16], OFFSET FLAT:foo(int)::ph
        mov     rax, QWORD PTR [rbp-8]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rbp-16]
        mov     eax, DWORD PTR [rax]
        add     eax, edx
        add     eax, 1
        pop     rbp
        ret

Voilà! The calls to __cxa_guard_acquire() are gone, yet b0 and b1 are initialised before foo() is called, just as we want. But not so fast, it's C++.

Let's add another static Bar instance, this time in an inline function:

int inline baz(int x) {
        Bar &b = FAST_STATIC(Bar);
        return b.var * x;
}

GCC reports [godbolt]:

<source>:9:38: error: 'ph' causes a section type conflict with 'ph' in section 'STATIC_Bar'

(clang works fine [godbolt], by the way.)

The problem is that in addition to name, sections output by the compiler also have attributes. The compiler selects the attributes based on the properties of the scope where the symbol (to which __attribute__((section)) is applied) is defined. Inline functions force a different attribute selection (similarly do template members), and the linker ends up with multiple sections with the same name, but conflicting attributes. See stackoverflow for details.

As it is, FAST_STATIC() is usable, but section attribute conflicts put awkward resrictions on its applicability. Is this the best we can do? For some time I thought that it is, but then I realised that there is another way to specify the section in which the variable is located: the .pushsection directive of the embedded assembler (do not be afraid, we will use only portable part).

If you do something like

__asm__(".pushsection STATIC_Bar,\"aw\",@progbits\n" \
        ".quad " symbol "\n"                         \
        ".popsection\n")

then the address of the symbol is placed in STATIC_Bar section with the specified attributes.

All we need is something like

#define FAST_STATIC(T)                                          \
*({                                                             \
        struct placeholder {                                    \
            alignas(T) char buf[sizeof(T)];                     \
        };                                                      \
        static constinit placeholder ph {{}};                   \
        __asm__(".pushsection STATIC_" #T ",\"aw\",@progbits\n" \
                ".quad ph\n"                                    \
                ".popsection\n");                               \
        reinterpret_cast<T *>(ph.buf);                          \
})

and we are good (section_init() needs to be fixed a bit, because STATIC_Bar now contains pointers, not instances). But not so fast, it's C++. This does not even compile [godbolt]:

ld: /tmp/ccZRzXXj.o:(STATIC_Bar+0x0): undefined reference to `ph'
ld: /tmp/ccZRzXXj.o:(STATIC_Bar+0x8): undefined reference to `ph'
ld: /tmp/ccZRzXXj.o:(STATIC_Bar+0x10): undefined reference to `ph'
collect2: error: ld returned 1 exit status
Execution build compiler returned: 1

When you define static constinit placeholder ph, the actual name the compiler uses for the symbol is not ph it is the mangled version of something like foo(int)::ph that we saw in the assembly listing above. There is no ph for .quad ph to resolve to.

OK. Are we stuck now? In fact not. You can instruct the compiler to use a particular symbol name, instead of the mangled one. With

        int foo asm ("bar") = 2;

the compiler will use "bar" as the symbol name for foo (both gcc and clang support this).

Of course if we just do

        static constinit placeholder ph asm("ph") {{}};

we fall in the opposite trap of having multiple definitions for "ph". We need to define unique names for our symbols, but there is more or less standard trick for this, based on __COUNTER__ macro. We also need a couple of, again standard, macros for concatenation and stringification. The final version looks like this:

#define CAT0(a, b) a ## b
#define CAT(a, b) CAT0(a, b)

#define STR0(x) # x
#define STR(x) STR0(x)

#define FAST_STATIC_DO(T, id)                                   \
*({                                                             \
        struct placeholder {                                    \
            alignas(T) char buf[sizeof(T)];                     \
        };                                                      \
        static constinit placeholder id asm(STR(id)) {{}};      \
        __asm__(".pushsection STATIC_" #T ",\"aw\",@progbits\n" \
                ".quad " STR(id) "\n"                           \
                ".popsection\n");                               \
        reinterpret_cast<T *>(id.buf);                          \
})

#define FAST_STATIC(T) FAST_STATIC_DO(T, CAT(ph_, __COUNTER__))

template <typename T> static int section_init(T **start, T **stop)
{
        for (T **s = start; s < stop; ++s)
                new (*s) T; /* Construct in-place. */
        return 0;
}

#define FAST_STATIC_INIT(T)                                      \
extern "C" T *__start_STATIC_ ## T;                              \
extern "C" T *__stop_STATIC_ ## T;                               \
static int _init_ ## T = section_init<T>(&__start_STATIC_ ## T,  \
                                         &__stop_STATIC_ ## T);

The resulting assembly for foo() and foo_init() [godbolt] accesses statics directly:

foo(int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        mov     QWORD PTR [rbp-8], OFFSET FLAT:ph_0
        mov     QWORD PTR [rbp-16], OFFSET FLAT:ph_1
        mov     rax, QWORD PTR [rbp-8]
        mov     edx, DWORD PTR [rax]
        mov     rax, QWORD PTR [rbp-16]
        mov     eax, DWORD PTR [rax]
        add     eax, edx
        add     eax, 1
        pop     rbp
        ret
foo_inline(int):
        push    rbp
        mov     rbp, rsp
        mov     DWORD PTR [rbp-20], edi
        mov     QWORD PTR [rbp-8], OFFSET FLAT:ph_2
        mov     rax, QWORD PTR [rbp-8]
        mov     eax, DWORD PTR [rax]
        imul    eax, DWORD PTR [rbp-20]
        pop     rbp
        ret

Finally we won!

"Бывает, что усердие превозмогает и рассудок"

К. Прутков, Мысли и афоризмы, II, 27

P.S. The actual implementation requires more bells and whistles. Parameters need to be passed to constructors, they can be stored within the placeholder. Typenames are not necessarily valid identifiers (think A::B::foo<T>), so the section name needs to be a separate parameter, etc., but the basic idea should be clear.

P.P.S. I have a similar story about optimising access to thread-local variables, involving C++20 constinit and __attribute__((tls_model("initial-exec"))).

C++：零成本静态初始化 C++: Zero-cost static initialization

C++：零成本静态初始化
C++: Zero-cost static initialization