将 XOR 技巧扩展到数十亿行数据

将 XOR 技巧扩展到数十亿行数据
Extending That XOR Trick to Billions of Rows

原始链接: https://nochlin.com/blog/extending-that-xor-trick

## 可逆布隆过滤器：扩展异或技巧经典的异或技巧能有效地查找列表中的缺失数字，但在处理大型数据集或大量缺失值时会遇到困难。可逆布隆过滤器（IBF）扩展了这一概念，以处理诸如在十亿行表中识别数千个缺失ID之类的场景，其空间复杂度仅依赖于集合之间差异的大小——这是一个显著的改进。 IBF建立在布隆过滤器之上，增加了精确检索和列出差异的功能。其核心思想利用分区和累加器（ID、哈希和计数值的异或聚合）来识别对称差异（仅存在于其中一个集合中的元素）。虽然朴素的分区可能不可靠，但IBF采用复杂的哈希方案和“剥离”算法来迭代且准确地恢复差异。 IBF由包含这些累加器的单元格组成，并通过编码、相减（查找差异）和解码来运作。IBF是解决“集合调和”问题的强大方案——有效地比较集合而无需完全数据传输——并提供了一种比传统方法更节省空间的替代方案。一个基本的Python实现可供探索这种迷人的数据结构。

## 将异或技巧扩展到数十亿行：摘要一篇最新的文章探讨了将一种巧妙的“异或技巧”扩展到大型数据集差异查找的应用。最初的技巧利用异或运算高效地识别序列中的缺失数字。这种扩展旨在将该方法扩展到数十亿行，但引入了概率性元素。虽然最初的异或技巧是确定性的，但扩展它需要概率算法，例如布隆过滤器或可逆布隆过滤器（IBF）。这些方法并非保证成功，但失败是可以检测到的，允许使用调整后的参数进行重试。诸如Minisketch的替代方案可以为较小的差异提供保证恢复，并具有最佳的空间效率，但具有二次解码复杂度。结合IBF和Minisketch的混合方法也是可能的。讨论强调了空间效率、解码速度和成功保证之间的权衡。核心思想涉及对数据进行分区并迭代“剥离”差异，但同时保持绝对保证*和*线性时间性能仍然是一个挑战。原始异或技巧的简洁性在这些扩展中丧失了，但其在大型规模差异检测方面的潜力仍然引人注目。

Can we extend the XOR trick for finding one or two missing numbers in a list to finding thousands of missing IDs in a billion-row table?

Yes, we can! This is possible using a data structure called an Invertible Bloom Filter (IBF) that compares two sets with space complexity based only on the size of the difference. Using a generalization of the XOR trick [1], all the values that are identical cancel out, so the size of this data structure depends only on the size of the difference.

Most explanations of Invertible Bloom Filters start with standard Bloom filters, which support two operations: insert and maybeContains. Next, they extend to counting Bloom filters, which enables a delete operation. Finally, they introduce Invertible Bloom Filters, which add an exact get operation and a probabilistic listing operation. In this article, I will take a different approach and build up to an IBF from the XOR trick.

IBFs have remained relatively obscure in the software development community while the XOR trick is a well-known technique thanks to leetcode. My goal with this article is to connect IBFs to the XOR trick so that more developers understand this fascinating and powerful data structure.

A = [1,2,3,4,5,6,7,8,9,10]
B = [1,2,4,5,7,8,10]

将 XOR 技巧扩展到数十亿行数据 Extending That XOR Trick to Billions of Rows

Finding 3 Missing Numbers

Detecting when the XOR trick fails

Invertible Bloom Filters

The Core Idea

Structure

Operations

Example Implementation

About the "Set Reconciliation" Problem

Conclusion

References

Further Reading

将 XOR 技巧扩展到数十亿行数据
Extending That XOR Trick to Billions of Rows