更快的C软件，通过动态特性检测

更快的C软件，通过动态特性检测
Faster C software with Dynamic Feature Detection

原始链接: https://gist.github.com/jjl/d998164191af59a594500687a679b98d

## 优化C代码以适应CPU能力当软件性能严重依赖于CPU特性时，在最大化速度的同时实现可移植性是一个挑战。关键在于利用编译器优化，而不是依赖于保证的ISA可用性。像GCC和Clang这样的编译器可以针对特定的CPU架构（例如`-march=znver3`），自动利用可用的指令集来提高性能。Intel定义了微架构级别（v1-v4）来分类CPU能力，为优化提供了一个基线。您可以为最低公共分母（现在可能是v3/v4）构建，或者为较新/旧的处理器创建单独的构建。为了更精细的控制，**间接函数 (IFUNCs)** 允许动态链接器在运行时选择最佳函数版本。现代编译器（带有C23的GCC/Clang）甚至可以使用诸如`[[gnu::target_clones("avx2,default")]]`之类的属性来自动执行此过程，自动创建AVX2和默认版本。当自动矢量化失败或需要特定的内在函数时，需要手动优化。这涉及创建可移植的和优化的（例如，AVX2）版本，并使用`#ifdef __AVX2__`或编译器pragma (`#pragma GCC target("avx2")`)有条件地编译。运行时CPU检测 (`__builtin_cpu_supports("avx")`) 然后调度到适当的函数。虽然功能强大，但这些技术也有局限性——MUSL libc缺乏IFUNC支持，并且由于编译器限制，Windows支持具有挑战性。

黑客新闻新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录更快的C软件，使用动态特性检测 (gist.github.com) 10 分，by todsacerdoti 57分钟前 | 隐藏 | 过去 | 收藏 | 讨论帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

I've been building some software recently whose performance is very sensitive to the capabilities of the CPU on which it's running. A portable version of the code does not perform all that well, but we cannot guarantee the presence of optional Instruction Set Architectures (ISAs) which we can use to speed it up. What to do? That's what we'll be looking at today, mostly for the wildly popular x86-64 family of processors (but the general techniques apply anywhere).

Make it the compiler's problem.

Compilers are very good at optimising for a particular target CPU microarchitecture, such as if you use -march=native (or e.g. -march=znver3). They know amongst other things, the ISA capabilities of these CPUs and they will quietly take advantage of them at cost of portability.

So the first way to speed up C software is to build for a more recent architecture where the compiler has the tools to speed the code up for you. This won't work for every problem or scenario, but if it's an option for you, it's very easy.

This works surprisingly well on x86-64 because it's now a very mature architecture. But this also means that there's a wide span of capabilities between the original x86-64 CPUs and the CPUs you can buy nowadays. To help make things a bit more digestible, intel devised microarchitecture levels, with later levels including all the features of its predecessors:

Level	Contains e.g.	Intel	AMD
x86-64-v1	(base)	All 64 bit	All 64 bit
x86-64-v2	POPCNT, SSE4.2	2008 (Nehalem/Westmere)	2011 (Bulldozer)
x86-64-v3	AVX2, BMI2	2013 (Haswell/Broadwell)	2015 (Excavator)
x86-64-v4	AVX-512[1]	2017 (Skylake)	2022 (Zen 4)

[1] AVX-512 is not actually one feature, but v4 includes the most useful parts of it.

There are some gotchas I won't dwell on, but not all kit released after these dates is good for these capabilities, in particular there have been:

Slow implementations of some instructions (e.g. PEXT/PDEP in BMI2 in AMD before Zen 3)
Aggressive feature-based market segmentation by intel:
- Consumer avx512 kit more or less doesn't exist.
- Lower cost chips with fewer capabilities.

However, in general, microarchitecture levels give you a good set of baseline capabilities for optimisation. Two ways to use them:

Build for the lowest common denominator in a fleet (which is probably v3 or v4 by now)
Build a version for newer processors and a version for older processors.

Obviously the second is less than ideal if you don't control all the hardware you can run on. Fortunately there's an answer for that (for popular compilers): indirect functions (IFUNCs). IFUNCs essentially have the dynamic linker run a function at link time which returns the real function to use according to the hardware available. And the best bit is for the general case, the compiler can even do all the work for you:

[[gnu::target_clones("avx2,default")]] // gcc/glibc and clang
void * my_func(void *data) { ... }

Note that the square brackets here are c23 syntax for attributes. the equivalent compiler-specific version is __attribute__((target_clones("avx2,default"))). It's the little things that make c23 great!

This will create two versions of my_func, one with avx2 and one with the default flags. It will also generate a resolver function in the background for the dynamic linker to run. Calls to the function will thus be linked to the most optimal version at program startup time!

If you're lucky, this did the trick. If you're slightly less lucky you may have luck triggering autovectorisation with some small modifications (such as alignment annotations). Sadly this process is finicky and unreliable and there isn't really space for it in this post.

Manual optimisation with Intrinsics

Sometimes you need to write multiple versions of an algorithm to get the best performance. Either you can't autovec to work (if it's for SIMD) or you need to work with some specific intrinsics (such as I do for this project).

To take advantage of intrinsics directly, we must provide two versions of an algorithm: a portable version and a version that uses the intrinsics. Here's how we might optimise for AVX2 statically:

#ifdef __AVX2__ // defined by the compiler when AVX2 is supported on the target
  #include <immintrin.h> // header with avx2 intrinsics
  void * my_func(void *data) { ... }
#else
  void * my_func(void *data) { ... }
#endif

With this sort of technique, we can once again support building for targets with specific capabilities, except now with direct access to the intrinsics that can make things faster.

But we're still building for a specific target and we'd like not to do that. Unfortunately there isn't a portable way to do this, but there are compiler-specific hacks. Here's how we do it for gcc and clang for avx2:

// ask the compiler to enable avx2
#pragma GCC push_options
#pragma GCC target ("avx2")
#pragma clang attribute push \
  (__attribute__((target("avx2"))), apply_to = function)

// now include the header with avx2 enabled
#include <immintrin.h>

// now undo to stop our portable code requiring avx2
#pragma GCC pop_options
#pragma clang attribute pop

[[gnu::target("avx2")]] // this function must be compiled for avx2
void * my_func_avx2(void *data) { ... }
void * my_func_portable(void *data) { ... }

Now we need a way to dispatch between them. As we are limiting ourselves to gcc and clang on x86-64, we can use the compiler-provided runtime platform detection to switch implementations:

void * my_func(void *data) {
  return __builtin_cpu_supports("avx") ? my_func_avx2(data) : my_func_portable(data);
}

We could use IFUNCs instead, although we have to write our own resolver this time:

static void * (*resolve_my_func(void)) (void *) {
  __builtin_cpu_init(); // ifunc resolvers are called before this is automatically triggered.
  return __builtin_cpu_supports("avx") ? my_func_avx2 : my_func_portable;
}
void * my_func(void *data) __attribute__ ((ifunc ("resolve_my_func")));

Okay it's a bit gnarly because of the atrocious function pointer syntax in c, but it works. At program startup, this will pick the best version!

At this point, since we're writing our own resolver function, we can provide any logic we like over as many different versions as we like. This makes it possible to handle more complex scenarios such as working around AMD's BMI2 implementation being slow before Zen 3 or Intel's AVX-512 implementation aggressively downclocking the CPU when you use ZMM registers before ice lake. Or probably the scenario you find yourself in, if your luck is anything like mine.

MUSL libc does not (yet) support IFUNCS. It's not a simple feature.

I haven't said a word about windows support. I do not have a windows machine to test on and in any case, the project I'm doing this for is written in C23, while the compiler of choice for windows (outside of WSL), MSVC, supports most of C11. You'd be forgiven for thinking microslop don't actually want people to port C software to windows!