展示 HN:Forkrun – 了解 NUMA 的 shell 并行化工具(比 parallel 快 50 倍–400 倍)
Show HN: Forkrun – NUMA-aware shell parallelizer (50×–400× faster than parallel)

原始链接: https://github.com/jkool702/forkrun

## forkrun:高性能并行器 forkrun 是 GNU Parallel 和 xargs -P 的即插替代品,旨在大幅加速基于 shell 的数据准备,在现代 CPU 上实现 **50 倍至 400 倍** 的加速,尤其是在 NUMA 架构上。它拥有 **20 万次/秒 以上的批处理分发** 和 **95-99% 的 CPU 利用率**,远高于 GNU Parallel 的约 6% 利用率。 forkrun 性能的关键在于其“原生本地”设计,最大限度地减少跨 socket 的内存流量并利用 NUMA 感知。它采用了一种新颖的管道,包含数据摄取、索引、声明和回收工作四个阶段,所有阶段都针对物理局部性进行了优化,并采用诸如 `splice()` 和无锁环形缓冲区等技术。 安装很简单:下载并 source 一个包含嵌入式自解压 C 扩展的 bash 脚本(无外部依赖)。使用方法与 GNU Parallel 相同 – 简单地将 `parallel` 替换为 `frun`。 forkrun 具有 **自适应调优** 功能,无需用户配置即可自动优化批处理大小。它需要 Bash 4.0+ 和 Linux Kernel 3.17+,并优先进行故障隔离和集群集成方面的持续开发。

## Forkrun:更快的 Shell 并行器 jkool702 发布了 **forkrun**,这是一款新的基于 shell 的并行化引擎,旨在显著优于 GNU Parallel 等工具,尤其是在现代多核 NUMA 硬件上。经过 10 年开发,forkrun 通过采用 NUMA 感知内存放置、SIMD 扫描和无锁批处理声明等先进技术,在典型工作负载上实现了 **50 倍至 400 倍** 的速度提升。 在 i9-7940x 上的基准测试显示,forkrun 分发 **200,000+ 批/秒**,所有核心的 CPU 利用率达到 **95-99%**,而 GNU Parallel 约为 ~500 批/秒,利用率约为 ~6%。 Forkrun 被设计为 `xargs -P` 和 GNU Parallel 的直接替代品,无需安装 – 它以单个 bash 文件形式提供,并嵌入了 C 扩展。它非常适合高频率、低延迟的任务,例如日志处理和数据准备。 您可以在 [GitHub](https://github.com/jkool702/forkrun) 上找到更多信息、基准测试和源代码。安装只需 source bash 文件即可:`. frun.bash`。
相关文章

原文

License: MIT

forkrun is a self-tuning, drop-in replacement for GNU Parallel and xargs -P that accelerates shell-based data preparation by 50×–400× on modern CPUs and scales linearly on NUMA architectures.

forkrun achieves:

  • 200,000+ batch dispatches/sec (vs ~500 for GNU Parallel)
  • ~95–99% CPU utilization across all cores (vs ~6% for GNU Parallel)
  • Near-zero cross-socket memory traffic (NUMA-aware “born-local” design)

forkrun is built for high-frequency, low-latency workloads on deep NUMA hardware — a regime where existing tools leave most cores idle due to IPC overhead and cross-socket data migration.


🚀 Quick Start (Installation & Usage)

forkrun is distributed as a single bash file with an embedded, self-extracting compiled C extension. There are no external dependencies (no Perl, no Python).

Download and source it directly:

source <(curl -sL https://raw.githubusercontent.com/jkool702/forkrun/main/frun.bash)

(Note: Sourcing the script sets up the required C loadable builtins in your shell environment).

Once sourced, frun acts as a drop-in parallelizer:

frun my_bash_func < inputs.txt             # parallelize custom bash functions natively!
cat file_list | frun -k sed 's/old/new/'   # pipe-based input, ordered output
frun -k -s sort < records.tsv              # stdin-passthrough, ordered output
frun -s -I 'gzip -c >{ID}.gz' < raw_logs   # stdin-passthrough, unique output names

Verifiable Builds: The embedded C-extension is compiled and injected transparently via GitHub Actions. You can trace the git blame of the Base64 blob directly to the public CI workflow run that compiled forkrun_ring.c, guaranteeing the binary contains no hidden malicious code.


⚡ Benchmarks (14-core/28-thread i9-7940x, 100M lines)

Workload forkrun GNU Parallel Speedup Notes
Default (array + fully-quoted args, no-op) 24 M lines/s 58 k lines/s ~415× forkrun default mode
Ordered output (-k, no-op) 24.5 M lines/s 57 k lines/s ~430× ordering is free in forkrun
echo (line args) 22.6 M lines/s ~55 k lines/s ~410× typical shell command
printf '%s\n' (I/O heavy) 12.8 M lines/s ~58 k lines/s ~220× formatting + output
-s stdin passthrough (no-op) 893 M lines/s 6.05 M lines/s (--pipe) ~148× streaming / splice
-b 524288 byte batches (no-op) 1.54 B lines/s 6.02 M lines/s (--pipe) ~256× kernel-limited

Average CPU utilization across ~400 benchmarks

  • forkrun: 95% (27.1 / 28 cores) — No centralized dispatcher; all 27.1 cores do actual work.
  • GNU Parallel: 6% (2.68 / 28 cores) — 1 full core used strictly for dispatching work; 1.68 cores doing actual work.

🧠 How It Works: The Physics of forkrun

Traditional tools like GNU Parallel use heavy regex parsing and IPC dispatch loops that bottleneck multi-socket servers. forkrun operates completely differently. The pipeline has four stages, each designed to preserve physical locality:

  1. Ingest (Born-Local NUMA): Data is splice()'d from stdin into a shared memfd. This is PFS-friendly (avoids Lustre/NFS metadata storms). On multi-socket systems, set_mempolicy(MPOL_BIND) places each chunk's pages on a target NUMA node before any worker touches them. This placement is driven by real-time backpressure from the per-node indexers, making NUMA distribution completely self-load-balancing.
  2. Index: Per-node indexers (pinned to their socket) find record boundaries using AVX2/NEON SIMD scanning at memory bandwidth. They dynamically batch based on runtime conditions, then publish offset markers into a per-node lock-free ring buffer.
  3. Claim (Contention-Free): Workers claim batches via a single atomic_fetch_add — no CAS retry loops, no locks, no contention. Overshoots are handled by depositing remainders into an escrow pipe for idle workers to steal.
  4. Reclaim: A background fallow thread punches holes behind completed work via fallocate(PUNCH_HOLE), bounding memory usage without breaking the offset coordinate system.

Adaptive tuning is fully automatic. A PID-based controller discovers the optimal batch size in O(log L) steps and continuously adjusts based on input rate, consumption rate, and worker starvation — with no user -n or -j configuration required.


🛠 Requirements & Dependencies

forkrun is designed to run anywhere with zero friction:

  • Required: Bash ≥ 4.0 (Bash 5.1+ highly recommended for array performance), Linux Kernel ≥ 3.17 (for memfd).

🏛️ Legacy Version (v2)

With the release of v3.0.0, forkrun has transitioned to a high-performance C-ring architecture (frun.bash). The older v2, pure-Bash coproc-based version (forkrun.bash) remains available in the legacy/ directory. While v3 (frun.bash) is highly recommended for all modern workloads, v2 (forkrun.bash) remains as an alternate fully-functional high-performance bash stream parallelizer. forkrun v1 is not recommended for use.


forkrun currently guarantees correctness under the assumption that at least one worker per NUMA node remains alive until its assigned work completes — a safe assumption for local shell operations on healthy compute nodes.

Priorities for the development roadmap include:

  • Failure isolation and per-batch retries to handle transient worker crashes.
  • Resume-after-interruption state saving to gracefully handle preempted cluster/Slurm jobs.
  • Deeper integration with facility workload managers.

(If forkrun is saving your institution compute-hours, please consider sponsoring its development to accelerate these features!)

联系我们 contact @ memedata.com