```SIMD速度下统计词数```

```SIMD速度下统计词数```
Counting Words at SIMD Speed

原始链接: https://healeycodes.com/counting-words-at-simd-speed

本文详细记录了一个单词计数程序的性能优化过程，从简单的Python实现开始，最终达到高度优化的、利用SIMD指令的多线程C版本。任务是在一个大型文本文件（6500万字，1GiB基准测试）中统计单词数量。最初的Python尝试速度很慢（89.6秒），通过正则表达式有所改进（13.7秒），移植到C语言后速度大幅提升（1.205秒）。进一步的优化使用了ARM NEON SIMD指令（249毫秒），该指令可以同时处理多个数据点。添加线程将性能提升到181毫秒，但由于内存带宽的限制，收益有限。关键优化包括最小化Python开销、利用高效的C循环以及利用SIMD进行数据级并行。SIMD通过一次对多个数据元素执行相同的操作来显著减少处理时间。最终的多线程版本比最初的Python代码快约494倍，数据处理速度达到5.52 GiB/s。作者指出，进一步的改进可能受到内存带宽的限制。所有代码和基准测试都可在[healeycodes/counting-words-at-simd-speed](https://github.com/healeycodes/counting-words-at-simd-speed)获取。

## SIMD 速度下的单词计数 - Hacker News 摘要最近 Hacker News 的讨论集中在优化单词计数速度上，源于 [healeycodes](https://healeycodes.com/) 发布的一篇关于基于 SIMD 方法的文章。核心挑战在于有效地识别空白字符。提出了一些优化技术。值得注意的是，用户强调 **PyPy** 作为 Python 脚本的意外有效加速器——在基准测试案例中实现了高达 10 倍的改进，无需修改代码。然而，PyPy 与基于 C 的软件包的兼容性有限。通过 **SIMD 指令**（如 AVX512BW 和 NEON）和高效的字节集搜索，讨论了进一步的性能提升。优化方法如 **页面对齐读取、Grand Central Dispatch 和 `mmap`** 也被提及，并附有代码示例链接。分享了一份关于 Michael Abrash 1991 年单词计数性能竞赛的历史参考，提供了宝贵的见解。对话还涉及了替代方法，例如使用 **位掩码和 shuffle 指令** 进行常量时间查找，以及扩展空白字符的定义，使其超越标准空格。最终，该讨论展示了各种从单词计数算法中榨取性能的技术。

原文

I've written some progressively faster word counting programs. First, we'll start with Python, and then we'll drop down to C, and finally, we'll use single instruction, multiple data (SIMD) programming to go as fast as possible.

The task is to count the words in an ASCII text file. For example, Hello there! contains 2 words, and my 1 GiB benchmark text file contains 65 million words.

At a high level: read bytes, scan while tracking minimal state, write the count to stdout.

These are the results from my Apple M1 Pro that I'll dig into:

Python (byte loop): 89.6 s
Python + re: 13.7 s
C (scalar loop): 1.205 s
C + ARM NEON SIMD: 249 ms
C + ARM NEON SIMD + threads: 181 ms

You can also jump straight to the source files.

First try (89.6 seconds)

Here's a reasonable first attempt, but you might spot some obvious performance-related deficiencies.

We read each byte and check if it's part of a set of whitespace characters, while tracking the word count, and whether there was previous whitespace.

# 0_mvp.py
ws = set(b" \n\r\t\v\f")
prev_ws = True
words = 0
with open(sys.argv[1], "rb") as f:
    for byte_value in f.read():
        cur_ws = byte_value in ws
        if not cur_ws and prev_ws:
            words += 1
        prev_ws = cur_ws
print(words)

This program is horrendously slow. It takes 89.6 seconds on my Apple M1 Pro. Python code runs for every byte, incurring interpreter dispatch and object checks again and again.

Using CPython efficiently (13.7 seconds)

There's a big improvement we can make before having to leave Python behind. We can make the program faster by making sure all the work happens in C, in tight loops, with no per-byte Python overhead.

CPython's re module is a thin Python wrapper around a C extension named _sre, "Secret Labs' Regular Expression Engine." Patterns are parsed in Python into a compact bytecode, then executed by the C engine. So a call like re.finditer(pattern, data) spends nearly all of its time inside C, scanning contiguous memory with pointer arithmetic and table lookups.

# 1_c_regex.py
words = 0
with open(sys.argv[1], "rb") as f:
    pattern = rb"[^ \n\r\t\v\f]+"
    data = f.read()
    words = sum(1 for _ in re.finditer(pattern, data))
print(words)

This version is ~6× faster than the initial Python version.

First try in C (1205 milliseconds)

I think the above Python version is very close to the limit that we can get with straightforward Python (e.g. no NumPy, no threads).

By porting our first Python attempt to C, we're rewarded with a ~74× speedup.

// 2_mvp.c
size_t nread = fread(data, 1, file_size, fp);
fclose(fp);
// ..
size_t words = 0;
bool prev_ws = true;
for (size_t i = 0; i < nread; ++i)
{
    unsigned char c = data[i];
    bool cur_ws = is_ws(c);
    if (!cur_ws && prev_ws)
    {
        ++words;
    }
    prev_ws = cur_ws;
}
free(data);
printf("%zu\n", words);

Why is it so much quicker? Before, re.finditer(...) was creating a Python Match object for every word (millions of heap allocations + GC pressure).

The regex engine was also doing extra work when it searched, matched, backtracked, and performed bookkeeping. Even though that's in C, it's still building Python objects for the iterator.

In comparison, this version's C loop is a single pass over bytes with two booleans (prev_ws, cur_ws) and a simple branch. Compilers turn this into very tight code, i.e., no per-word allocations, and no callbacks into the interpreter.

Adding SIMD (249 milliseconds)

Single instruction, multiple data (SIMD) is a form of data-level parallelism where one instruction operates on multiple data elements in parallel. For example, we can create a mask of all the space characters in a 16-byte chunk of our input, with one instruction, like this:

// bytes: [H , e , l , l , o ,   ,   ,   , t , h , e , r , e , ! ,   ,   ]
// ws1:   [00, 00, 00, 00, 00, FF, FF, FF, 00, 00, 00, 00, 00, 00, FF, FF]
uint8x16_t ws1 = vceqq_u8(bytes, vdupq_n_u8(' '));

But our word counting program doesn't require us to count the space characters. We need to count non-whitespace bytes that immediately follow a whitespace byte (any of six different whitespace characters).

Looping over each 16-byte chunk from our buffer, we need to:

Load the bytes into a register
Create a whitespace mask
Create a previous-byte whitespace mask
Create a word-start mask (combine the whitespace and previous-byte whitespace mask)
Sum the word-start mask to get the word count for this chunk
Carry forward the whitespace state

Here's the core section of this C version with NEON SIMD instructions, showing examples of what the masks look like.

// 3_simd.c
size_t words = 0;
// (0) Previous-byte whitespace (defaults to whitespace for start-of-file)
uint8x16_t prev_ws_vec = vdupq_n_u8(prev_ws ? 0xFF : 0x00);
size_t nvec = nread & ~(size_t)15;
for (; i < nvec; i += 16)
{
    // (1) Load 16-bytes from buffer
    // [H , e , l , l , o ,   ,   ,   , t , h , e , r , e , ! ,   ,   ]
    uint8x16_t bytes = vld1q_u8(buffer + i);
    // (2) Per-lane whitespace masks (0xFF for ws, 0x00 for non-ws)
    uint8x16_t ws1 = vceqq_u8(bytes, vdupq_n_u8(' '));
    uint8x16_t ws2 = vceqq_u8(bytes, vdupq_n_u8('\n'));
    uint8x16_t ws3 = vceqq_u8(bytes, vdupq_n_u8('\r'));
    uint8x16_t ws4 = vceqq_u8(bytes, vdupq_n_u8('\t'));
    uint8x16_t ws5 = vceqq_u8(bytes, vdupq_n_u8('\v'));
    uint8x16_t ws6 = vceqq_u8(bytes, vdupq_n_u8('\f'));
    // (3) Combine all masks into a single mask
    // [00, 00, 00, 00, 00, FF, FF, FF, 00, 00, 00, 00, 00, 00, FF, FF]
    uint8x16_t ws = vorrq_u8(ws1, ws2);
    ws = vorrq_u8(ws, ws3);
    ws = vorrq_u8(ws, ws4);
    ws = vorrq_u8(ws, ws5);
    ws = vorrq_u8(ws, ws6);
    // (4) Previous-byte whitespace for each lane
    // [FF, 00, 00, 00, 00, 00, FF, FF, FF, 00, 00, 00, 00, 00, 00, FF]
    uint8x16_t prev_ws_shifted = vextq_u8(prev_ws_vec, ws, 15);
    // (5) Word start mask: non-ws AND prev-ws (two word starts, 'H' and 't')
    // [FF, 00, 00, 00, 00, 00, 00, 00, FF, 00, 00, 00, 00, 00, 00, 00]
    uint8x16_t non_ws = vmvnq_u8(ws);
    uint8x16_t start_mask = vandq_u8(non_ws, prev_ws_shifted);
    // (6) Convert lanes to 1 by shifting right by 7 bits
    // [1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 ]
    uint8x16_t ones = vshrq_n_u8(start_mask, 7); // 2 ones
    // (7) Sum all lanes
    words += (size_t)vaddvq_u8(ones); // 2
    // (8) Carry state for the next iteration
    prev_ws_vec = ws;
}
// Remainder is handled with a scalar loop

This SIMD version is 5× faster because it does less work per byte of input.

Even though there are many more lines of code, it compiles down to efficient operations that run on 16-byte chunks. The compiler (clang -O3) adds further optimizations. For example, there are six vceqq_u8 calls in this version, for each of the whitespace characters: b" \n\r\t\v\f".

This doesn't result in six vector compare-equal (cmeq) instructions because the compiler is able to work out that, while we need exact-match checks for ' ' and '\n', there are more efficient steps for the following groups:

'\t' (0x09) and '\r' (0x0D)
- Perform a bitwise AND with 0xFB and check if the result equals 0x09.
  Both 0x0D & 0xFB = 0x09 and 0x09 & 0xFB = 0x09.
'\v' (0x0B) and '\f' (0x0C)
- Add 0xF5 (unsigned 8-bit add with wraparound) and check if the result is less than 2.
  Adding 0xF5 gives either 0x00 or 0x01, both of which are < 2.

While it looks like more instructions are touching the data, the key thing is that we're replacing an expensive sequence (another equality compare + mask merge + constant load) with a very cheap transform plus a single comparison.

Finding the rest of the compiler's optimizations is left as an exercise to the reader: https://godbolt.org/z/81r9Ts8Wb.

Using the rest of the CPU (181 milliseconds)

When we applied SIMD instructions above, we reached near-optimal single-core performance. However, my laptop and the device you’re reading this on have multiple cores — which my programs haven't been using.

Counting the number of words in a text file is a classic example of something that can be parallelized across threads. The file can be chunked up, and then each chunk is handed to a thread to count the words. When each thread finishes, their counts are summed.

There is one wrinkle, in that we need to check the previous byte to the start of the chunk so we understand the current whitespace state.

For example, with an input like [a, ,M , i, c, r, o, s, e, c, o, n, d, , , ].

Thread one gets [a, ,M , i, c, r, o, s] and counts a and M as word-starts (2).

Thread two gets [e, c, o, n, d, , , ] but knows that the e is not a word-start (0).

The word count is the sum of these totals: 2 + 0 = 2.

The previous version's "count words" function is unchanged, but we call per-chunk.

// 4_threads.c
// Count words in a contiguous memory range using NEON, given the starting prev-ws state
static size_t count_words_neon(
  const unsigned char *data,
  size_t len,
  bool prev_ws
)
// Args for each thread
typedef struct ThreadArgs
{
    const unsigned char *base;
    size_t length;
    bool start_prev_ws; // Derived from previous byte or SOF
    size_t result;
} ThreadArgs;
// Worker which handles an equal chunk of the file
static void *worker(void *p)
{
    ThreadArgs *args = (ThreadArgs *)p;
    args->result = count_words_neon(args->base, args->length, args->start_prev_ws);
    return NULL;
}
int main(int argc, char **argv)
{
    // .. 
    size_t launched = 0;
    for (size_t t = 0; t < max_threads; ++t)
    {
        size_t start = t * chunk;
        if (start >= filesize)
            break;
        size_t end = start + chunk;
        if (end > filesize)
            end = filesize;
        size_t len = end - start;
        // For each chunk, we need to seed with the previous-byte whitespace.
        // Start-of-file is "as if" preceded by whitespace
        bool start_prev_ws = true; 
        if (start > 0)
        {
            start_prev_ws = is_ws_scalar(data[start - 1]);
        }
        args[launched].base = data + start;
        args[launched].length = len;
        args[launched].start_prev_ws = start_prev_ws;
        args[launched].result = 0;
        if (pthread_create(&threads[launched], NULL, worker, &args[launched]) != 0)
        {
            perror("pthread_create");
            free(data);
            return 1;
        }
        ++launched;
    }
    size_t total_words = 0;
    for (size_t i = 0; i < launched; ++i)
    {
        (void)pthread_join(threads[i], NULL);
        total_words += args[i].result;
    }
    // ..
    printf("%zu\n", total_words);
    return 0;
}

My expectation was that this version would be N times faster with N threads than the previous version. This was based on the fact that the work that each thread is doing is entirely independent and each can saturate a CPU core. But it is only ~1.38 times faster.

My understanding is that my NEON loop is already so fast that it's likely limited by memory bandwidth, not CPU speed. So once one or two threads are streaming bytes at near-max RAM speed, the extra threads just fight over the same bandwidth giving little benefit.

The final results:

./bench.sh
Benchmark 1: python3 0_mvp.py bench.txt
  Time (mean ± σ):     89.556 s ±  7.214 s    [User: 87.205 s, System: 0.484 s]
  Range (min … max):   82.393 s … 96.820 s    3 runs
Benchmark 2: python3 1_c_regex.py bench.txt
  Time (mean ± σ):     13.739 s ±  0.136 s    [User: 13.433 s, System: 0.158 s]
  Range (min … max):   13.659 s … 13.896 s    3 runs
Benchmark 3: ./bin/2_mvp bench.txt
  Time (mean ± σ):      1.205 s ±  0.008 s    [User: 1.015 s, System: 0.115 s]
  Range (min … max):    1.198 s …  1.214 s    3 runs
Benchmark 4: ./bin/3_simd bench.txt
  Time (mean ± σ):     249.2 ms ±   6.3 ms    [User: 79.5 ms, System: 100.5 ms]
  Range (min … max):   243.8 ms … 262.6 ms    11 runs
Benchmark 5: ./bin/4_threads bench.txt
  Time (mean ± σ):     181.1 ms ±   4.1 ms    [User: 96.5 ms, System: 93.5 ms]
  Range (min … max):   177.4 ms … 193.7 ms    16 runs
Summary
  ./bin/4_threads bench.txt ran
    1.38 ± 0.05 times faster than ./bin/3_simd bench.txt
    6.65 ± 0.16 times faster than ./bin/2_mvp bench.txt
   75.84 ± 1.86 times faster than python3 1_c_regex.py bench.txt
  494.38 ± 41.35 times faster than python3 0_mvp.py bench.txt

With the SIMD + threads version processing at ~5.52 GiB/s.

If the SIMD stuff went over your head, Why do we even need SIMD instructions? is a great introduction by Daniel Lemire who's written a lot of interesting stuff about SIMD optimizations e.g. Scan HTML faster with SIMD instructions: Chrome edition.

Let me know if I've missed any obvious optimizations.

All the source files, and the benchmark script, can be found at healeycodes/counting-words-at-simd-speed.