深入 FAISS：十亿级相似度搜索

深入 FAISS：十亿级相似度搜索
Inside FAISS: Billion-Scale Similarity Search

原始链接: https://fremaconsulting.ch/blog/faiss

为了压缩高维向量（例如 128 维的 SIFT 描述符），乘积量化（Product Quantization, PQ）提供了一种强大的替代方案，避免了存储未压缩数据时可能耗费的数百 GB 内存。类似于 GIF 的调色板——将 24 位像素映射到 256 色索引，PQ 通过使用来自“码本（codebook）”的代表性质心索引来替换向量。虽然使用单一的扁平码本来实现高压缩率（例如 64 位编码）需要天文数字般的质心数量，这会超出存储限制和训练数据的可用性，但 PQ 通过一种分解技巧解决了这个问题。通过将高维向量拆分为 $m$ 个较小的子向量，并为每个子向量分配一个小型的专属码本，PQ 在仅占用极小、可缓存内存的情况下，实现了巨大的有效搜索空间（例如 $256^8$ 种组合）。这使得压缩比达到 64 倍，将包含十亿级 SIFT 描述符的海量数据集压缩至可管理的范围内，同时还能保持有效的距离估算能力。

```Hacker News 最新 | 往期 | 评论 | 提问 | 展示 | 招聘 | 提交登录深入了解 FAISS：十亿级相似度搜索 (fremaconsulting.ch) 12 分，tohms 发布于 1 小时前 | 隐藏 | 往期 | 收藏 | 讨论我是作者。我写这篇文章是为了作为 2017 年 FAISS 论文 (https://arxiv.org/abs/1702.08734) 的视觉辅助材料，重点关注那些我觉得仅通过文字难以理解的部分。本文涵盖了 FAISS 功能的一个子集，以原论文作为权威参考。NSG、FastScan 和 IMI 未包含在内，它们将会有各自独立的文章。我特别希望能收到关于以下方面的反馈： - IVFPQ / IVFADC 的解释，特别是关于 LUT 重用的论点 - GPU 部分是否充分体现了实际的复杂度乐于回答相关问题。帮助指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系搜索：```

IVF makes search fast by skipping most of the database, but leaves every vector uncompressed. One billion SIFT descriptors still cost 512 GB of RAM. Product Quantization (PQ), introduced by Jégou, Douze and Schmid (2011), is the compression trick FAISS builds on to shrink each vector to 8 bytes while keeping distance estimates meaningful.

Same centroids, a different job

§4 used centroids to partition the search space. PQ uses them to compresseach vector. Same K-Means math, new purpose: the centroid's index becomes the vector. If your codebook has centroids, any vector collapses to one integer in

A pixel-sized analogy

A 24-bit RGB pixel can be any of 16.7 million colors. A GIF gives up some of that range: it picks a 256-color palette and replaces every pixel with an 8-bit index into it. Storage drops 3×, the picture still looks like the picture. That palette is a codebook; the index is a code.

A 128-D SIFT descriptor is just a very long pixel. We want the same trick: find a palette of representative vectors, replace every descriptor with its nearest palette index.

How many bits does an index cost?

To label centroids uniquely you need enough bits to distinguish all of them. Each bit doubles the number of labels, so the code length is $bits. Concretely:$

$centroids \to$
$centroids \to$

More centroids means the quantizer discriminates finer detail, and the code grows accordingly. The codebook designer is playing a bit-budget game.

The paper's aggressive target

Jégou et al. want each 128-D SIFT descriptor to become a 64-bit code, which is 0.5 bit per dimension, a 64× compression ratio against the 512-byte original. Half a bit per dimension is ambitious: it means the quantizer has to encode all the variation along each axis with a single binary-ish choice.

Hitting 64 bits with one flat codebook means picking

Flat codebook feasibility

Pick a vector dimension and a bit budget per dimension to see the centroid count and storage footprint a single flat quantizer would need.

Vector dimension (D)

Bits per dimension

Total bits per code64bits

Centroids needed (k = 2^64)1.84 × 10^19centroids

Codebook storage9.44ZB

Impossible

Bigger than all cloud storage on Earth. Lloyd's algorithm would need roughly 5.53 × 10^20 training samples.

Why 2⁶⁴ is not a codebook

Three walls hit simultaneously when gets that large:

Storage.Each centroid is 128 floats × 4 bytes = 512 bytes. Times $centroids gives about 9.4 zettabytes, more than all cloud storage on Earth in 2025. You cannot persist the codebook, let alone page through it at query time.$
Training data.Lloyd's algorithm needs at least tens of samples per centroid to converge. That is
Query cost.To encode one descriptor you compare it to every centroid. 18 quintillion distance computations per vector is not a latency you can ship.

The flat codebook dies on all three fronts. The paper's move is to keep the effective vocabulary the same size while shrinking the stored codebook by many orders of magnitude.

The factoring trick

Back to the GIF analogy. Instead of finding one 256-color palette that works for the whole image, cut the image into 8 tiles and give each tile its own 256-color palette. Each tile is now one byte; the image is 8 bytes; every tile had access to a full 256-color palette tuned to its own pixels. PQ does exactly this for vectors.

Formally: split into sub-vectors

With $and$

Each encoded vector fits in

The walkthrough below builds the full index on the paper's reference SIFT setting (