展示 HN: TurboQuant 用于向量搜索

展示 HN: TurboQuant 用于向量搜索 – 2-4 位压缩
Show HN: TurboQuant for vector search – 2-4 bit compression

原始链接: https://github.com/RyanCodrai/py-turboquant

## TurboQuant：快速、数据无关的向量搜索 TurboQuant 是一种新颖的、非官方的 Rust 实现（通过 PyO3 提供 Python 绑定），用于高维向量的近似最近邻搜索。它实现了显著的压缩——低至每坐标 2-4 位——同时保持最小的失真，并且**不需要任何训练数据**。这与需要训练阶段的方法（如 FAISS 产品量化）形成对比。基准测试表明，TurboQuant 的速度与 FAISS 相当，有时甚至更快，尤其是在 ARM（Apple Silicon）架构上。在 4 位量化下，TurboQuant 始终能比 FAISS 获得更高的召回率。压缩比非常可观（与 FP32 相比，压缩比为 8-16 倍）。 TurboQuant 的工作原理是标准化向量，应用随机旋转以使坐标分布可预测，然后使用 Lloyd-Max 量化。其主要优势在于能够在线添加新向量而无需重建索引。性能提升得益于优化的 SIMD 内核（NEON/AVX2）和高效的评分技术。该项目由一个核心 Rust crate (`turbovec`) 和一个 Python 封装器 (`turbovec-python`) 组成。

## TurboQuant 向量搜索：摘要 Hacker News 上分享了一个 Google TurboQuant 论文的新实现，它提供了显著的压缩，同时最大限度地减少了质量损失。该 Python 实现由 justsomeguy1996 开发，使用 2 位量化可实现高达 16 倍的压缩，同时保持高召回率（在 d=1536 时，2 位为 0.862 recall@1@1，4 位为 0.967）。重要的是，TurboQuant 是“数据无关的”——压缩码本是数学推导出来的，无需在添加新向量时进行训练或索引重建。另一位用户 pidtom 强调了他们的相关工作，TurboQuant+ for llama.cpp，它进一步扩展了这些压缩技术，用于本地 LLM，实现了 28-42% 的模型大小减少，并具有相似的优势。这两个项目都是开源的，并且可在 GitHub 上获取。

原文

Rust implementation of TurboQuant for vector search, with Python bindings via PyO3.

Compresses high-dimensional vectors to 2-4 bits per coordinate with near-optimal distortion. Data-oblivious (no training), zero indexing time.

Unofficial implementation of TurboQuant (Google Research, ICLR 2026).

from turbovec import TurboQuantIndex

index = TurboQuantIndex(dim=1536, bit_width=4)
index.add(vectors)
index.add(more_vectors)

scores, indices = index.search(query, k=10)

index.write("my_index.tq")
loaded = TurboQuantIndex.load("my_index.tq")

use turbovec::TurboQuantIndex;

let mut index = TurboQuantIndex::new(1536, 4);
index.add(&vectors);
let results = index.search(&queries, 10);
index.write("index.tv").unwrap();
let loaded = TurboQuantIndex::load("index.tv").unwrap();

TurboQuant vs FAISS IndexPQFastScan on OpenAI DBpedia d=1536 (100K vectors, 1K queries, k=64). FAISS PQ configurations sized to match TurboQuant compression ratios.

TurboQuant requires zero training. FAISS PQ needs a training step (4-10 seconds). TurboQuant index build is 3-4x faster.

ARM (Apple Silicon M3 Max)

	TQ speed	FAISS speed	Ratio	TQ recall@1	FAISS recall@1
2-bit MT	0.125ms/q	0.128ms/q	0.97x	0.870	0.882
2-bit ST	1.272ms/q	1.247ms/q	1.02x	0.870	0.882
4-bit MT	0.232ms/q	0.246ms/q	0.94x	0.955	0.930
4-bit ST	2.474ms/q	2.485ms/q	1.00x	0.955	0.930

On ARM, TurboQuant matches or beats FAISS on speed while requiring no training step. At 4-bit, TurboQuant recall is higher than FAISS (0.955 vs 0.930).

x86 (Intel Sapphire Rapids, 4 vCPUs)

	TQ speed	FAISS speed	Ratio	TQ recall@1	FAISS recall@1
2-bit MT	0.733ms/q	0.590ms/q	1.24x	0.870	0.882
2-bit ST	1.443ms/q	1.208ms/q	1.19x	0.870	0.882
4-bit MT	1.391ms/q	1.181ms/q	1.18x	0.955	0.930
4-bit ST	2.998ms/q	2.477ms/q	1.21x	0.955	0.930

On x86, TurboQuant is within 18-25% of FAISS on speed. At 4-bit, TurboQuant recall is higher than FAISS (0.955 vs 0.930). The speed gap is primarily from TurboQuant's rotation step (~5% of total time) and differences in AVX2 code generation vs FAISS's template-instantiated C++ kernels.

Bit width	Index size (100K x 1536)	Compression vs FP32
2-bit	37.0 MB	15.8x
4-bit	73.6 MB	8.0x

Reproducing Section 4.4 of the paper. recall@1@k = probability that the true nearest neighbor appears in the top-k results. Benchmarked on Apple M3 Max.

GloVe d=200 (100K database vectors, 10K queries)

Recall:

k	2-bit	4-bit
1	0.505	0.812
2	0.658	0.932
4	0.789	0.986
8	0.879	0.998
16	0.943	1.000
32	0.976	1.000
64	0.992	1.000

Search latency:

	2-bit	4-bit
MT	0.029ms/q	0.040ms/q
ST	0.275ms/q	0.429ms/q

Compression:

Bit width	Index size	Compression vs FP32
2-bit	5.1 MB	14.8x
4-bit	9.9 MB	7.7x

OpenAI DBpedia d=1536 (100K database vectors, 1K queries)

Recall:

k	2-bit	4-bit
1	0.870	0.955
2	0.961	0.996
4	0.998	1.000
8	1.000	1.000
16	1.000	1.000
32	1.000	1.000
64	1.000	1.000

Search latency:

	2-bit	4-bit
MT	0.138ms/q	0.256ms/q
ST	1.448ms/q	2.784ms/q

Compression:

Bit width	Index size	Compression vs FP32
2-bit	37.0 MB	15.8x
4-bit	73.6 MB	8.0x

OpenAI DBpedia d=3072 (100K database vectors, 1K queries)

Recall:

k	2-bit	4-bit
1	0.912	0.967
2	0.986	0.997
4	1.000	1.000
8	1.000	1.000
16	1.000	1.000
32	1.000	1.000
64	1.000	1.000

Search latency:

	2-bit	4-bit
MT	0.395ms/q	0.604ms/q
ST	3.240ms/q	5.711ms/q

Compression:

Bit width	Index size	Compression vs FP32
2-bit	73.6 MB	15.9x
4-bit	146.9 MB	8.0x

Each vector is a direction on a high-dimensional hypersphere. TurboQuant compresses these directions using a simple insight: after applying a random rotation, every coordinate follows a known distribution -- regardless of the input data.

1. Normalize. Strip the length (norm) from each vector and store it as a single float. Now every vector is a unit direction on the hypersphere.

2. Random rotation. Multiply all vectors by the same random orthogonal matrix. After rotation, each coordinate independently follows a Beta distribution that converges to Gaussian N(0, 1/d) in high dimensions. This holds for any input data -- the rotation makes the coordinate distribution predictable.

3. Lloyd-Max scalar quantization. Since the distribution is known, we can precompute the optimal way to bucket each coordinate. For 2-bit, that's 4 buckets; for 4-bit, 16 buckets. The Lloyd-Max algorithm finds bucket boundaries and centroids that minimize mean squared error. These are computed once from the math, not from the data.

4. Bit-pack. Each coordinate is now a small integer (0-3 for 2-bit, 0-15 for 4-bit). Pack these tightly into bytes. A 1536-dim vector goes from 6,144 bytes (FP32) to 384 bytes (2-bit). That's 16x compression.

Search. Instead of decompressing every database vector, we rotate the query once into the same domain and score directly against the codebook values. The scoring kernel uses SIMD intrinsics (NEON on ARM, AVX2 on x86) with nibble-split lookup tables for maximum throughput.

The paper proves this achieves distortion within a factor of 2.7x of the information-theoretic lower bound (Shannon's distortion-rate limit). You cannot do much better for a given number of bits.

Online by design. Because the codebook and rotation are derived from math (not from the data), new vectors can be added at any time without rebuilding the index. Traditional methods like Product Quantization require expensive offline codebook training that must be re-run when data changes.

The project is a Cargo workspace with two crates:

turbovec -- pure Rust crate with zero Python dependency. Contains all SIMD kernels, encoding, and search logic.
turbovec-python -- thin PyO3 wrapper exposing TurboQuantIndex to Python.

NEON kernel (ARM): Sequential code layout. vqtbl1q_u8 shuffle-based LUT scoring with vaddw_u8 widening accumulation. Per-block heap with QBS=4 query batching.
AVX2 kernel (x86): FAISS-style perm0-interleaved code layout to work around AVX2's cross-lane shuffle constraint. vpshufb with uint16 reinterpret trick and combine2x2 flush. Multi-query NQ=4 scoring with fused heap and SIMD early-rejection.
Rotation: BLAS-accelerated matrix multiplication via ndarray. Uses Accelerate (macOS) or OpenBLAS (Linux).

pip install maturin
cd turbovec-python
RUSTFLAGS="-C target-cpu=native" maturin build --release
pip install target/wheels/*.whl

Download datasets:

python3 benchmarks/benchmark.py download glove openai-1536 openai-3072

Run benchmarks:

python3 benchmarks/benchmark.py glove openai-1536 openai-3072