Show HN：K(r)ep——一款高性能字符串搜索工具

Show HN：K(r)ep——一款高性能字符串搜索工具
Show HN: K(r)ep - A high-performance string search utility

原始链接: https://github.com/davidesantangelo/krep

Krep是一个高性能的字符串搜索工具，旨在快速高效地处理大型文件和目录。它并非旨在替代grep等功能丰富的工具，而是专注于为常见用例提供尽可能快的搜索速度，并拥有简洁的界面。 Krep利用多种搜索算法，例如Boyer-Moore-Horspool、KMP和Aho-Corasick算法，并利用SIMD加速（SSE4.2、AVX2、NEON）以优化性能。它采用内存映射I/O、多线程和优化的数据结构以实现最大吞吐量。递归目录搜索会智能地跳过二进制文件和无关目录。在基准测试中，Krep展示了比grep快得多的性能，并且比ripgrep略快。Krep采用BSD-2许可证授权。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录 Show HN: K(r)ep - 一个高性能字符串搜索工具 (github.com/davidesantangelo) daviducolo 34分钟前 13 分 | 隐藏 | 过去 | 收藏 | 讨论加入我们，参加6月16日至17日在旧金山举办的AI创业学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系我们搜索：

Ripgrep 比 grep、ag、Git grep、ucg、pt、sift 更快 (2016) 2023-12-01

（评论） 2023-12-01

fd：一个简单、快速且用户友好的“find”替代品 2025-03-19

（评论） 2024-09-04

原文

krep is an optimized string search utility designed for maximum throughput and efficiency when processing large files and directories. It is built with performance in mind, offering multiple search algorithms and SIMD acceleration when available.

Note:
Krep is not intended to be a full replacement or direct competitor to feature-rich tools like grep or ripgrep. Instead, it aims to be a minimal, efficient, and pragmatic tool focused on speed and simplicity.

Krep provides the essential features needed for fast searching, without the extensive options and complexity of more comprehensive search utilities. Its design philosophy is to deliver the fastest possible search for the most common use cases, with a clean and minimal interface.

The Story Behind the Name

The name "krep" has an interesting origin. It is inspired by the Icelandic word "kreppan," which means "to grasp quickly" or "to catch firmly." I came across this word while researching efficient techniques for pattern recognition.

Just as skilled fishers identify patterns in the water to locate fish quickly, I designed "krep" to find patterns in text with maximum efficiency. The name is also short and easy to remember—perfect for a command-line utility that users might type hundreds of times per day.

Multiple search algorithms: Boyer-Moore-Horspool, KMP, Aho-Corasick for optimal performance across different pattern types
SIMD acceleration: Uses SSE4.2, AVX2, or NEON instructions when available for blazing-fast searches
Memory-mapped I/O: Maximizes throughput when processing large files
Multi-threaded search: Automatically parallelizes searches across available CPU cores
Regex support: POSIX Extended Regular Expression searching
Multiple pattern search: Efficiently search for multiple patterns simultaneously
Recursive directory search: Skip binary files and common non-code directories
Colored output: Highlights matches for better readability
Specialized algorithms: Optimized handling for single-character and short patterns
Match Limiting: Stop searching a file after a specific number of matching lines are found.

# Clone the repository
git clone https://github.com/davidesantangelo/krep.git
cd krep

# Build and install
make
sudo make install

# uninstall
sudo make uninstall

The binary will be installed to /usr/local/bin/krep by default.

GCC or compatible C compiler
POSIX-compliant system (Linux, macOS, BSD)
pthread support

Override default optimization settings in the Makefile:

# Disable architecture-specific optimizations
make ENABLE_ARCH_DETECTION=0

krep [OPTIONS] PATTERN [FILE | DIRECTORY]
krep [OPTIONS] -e PATTERN [FILE | DIRECTORY]
krep [OPTIONS] -f FILE [FILE | DIRECTORY]
krep [OPTIONS] -s PATTERN STRING_TO_SEARCH
krep [OPTIONS] PATTERN < FILE
cat FILE | krep [OPTIONS] PATTERN

Search for a fixed string in a file:

krep -F "value: 100%" config.ini

Search recursively:

krep -r "function" ./project

Whole word search (matches only complete words):

krep -w 'cat' samples/text.en

Use with piped input:

-i, --ignore-case Case-insensitive search
-c, --count Count matching lines only
-o, --only-matching Print only the matched parts of lines
-e PATTERN, --pattern=PATTERN Specify pattern(s). Can be used multiple times.
-f FILE, --file=FILE Read patterns from FILE, one per line.
-m NUM, --max-count=NUM Stop searching each file after finding NUM matching lines.
-E, --extended-regexp Use POSIX Extended Regular Expressions
-F, --fixed-strings Interpret pattern as fixed string(s) (default unless -E is used)
-r, --recursive Recursively search directories
-t NUM, --threads=NUM Use NUM threads for file search (default: auto)
-s STRING, --string=STRING Search in the provided STRING instead of file(s)
-w, --word-regexp Match only whole words
--color[=WHEN] Control color output ('always', 'never', 'auto')
--no-simd Explicitly disable SIMD acceleration
-v, --version Show version information
-h, --help Show help message

Comparing performance on the same text file with identical search pattern:

Tool	Time (seconds)	CPU Usage
krep	0.106	328%
grep	4.400	99%
ripgrep	0.115	97%

Krep is approximately 41.5x faster than grep and slightly faster than ripgrep in this test. Benchmarks performed on Mac Mini M4 with 24GB RAM.

The benchmarks above were conducted using the subtitles2016-sample.en.gz dataset, which can be obtained with:

curl -LO 'https://burntsushi.net/stuff/subtitles2016-sample.en.gz'

Krep achieves its high performance through several key techniques:

1. Smart Algorithm Selection

Krep automatically selects the optimal search algorithm based on the pattern and available hardware:

Boyer-Moore-Horspool for most literal string searches
Knuth-Morris-Pratt (KMP) for very short patterns and repetitive patterns
memchr optimization for single-character patterns
SIMD Acceleration (SSE4.2, AVX2, or NEON) for compatible hardware
Regex Engine for regular expression patterns
Aho-Corasick for efficient multiple pattern matching

2. Multi-threading Architecture

Krep utilizes parallel processing to dramatically speed up searches:

Automatically detects available CPU cores
Divides large files into chunks for parallel processing
Implements thread pooling for maximum efficiency
Optimized thread count selection based on file size
Careful boundary handling to ensure no matches are missed

Instead of traditional read operations:

Memory maps files for direct access by the CPU
Significantly reduces I/O overhead
Enables CPU cache optimization
Progressive prefetching for larger files

4. Optimized Data Structures

Zero-copy architecture where possible
Efficient match position tracking
Lock-free aggregation of results

5. Skipping Non-Relevant Content

When using recursive search (-r), Krep automatically:

Skips common binary file types
Ignores version control directories (.git, .svn)
Bypasses dependency directories (node_modules, venv)
Detects binary content to avoid searching non-text files

Contributions are welcome! Please feel free to submit a Pull Request.

This project is licensed under the BSD-2 License - see the LICENSE file for details.