将 XOR 技巧扩展到数十亿行数据
Extending That XOR Trick to Billions of Rows

原始链接: https://nochlin.com/blog/extending-that-xor-trick

## 可逆布隆过滤器:扩展异或技巧 经典的异或技巧能有效地查找列表中的缺失数字,但在处理大型数据集或大量缺失值时会遇到困难。可逆布隆过滤器(IBF)扩展了这一概念,以处理诸如在十亿行表中识别数千个缺失ID之类的场景,其空间复杂度仅依赖于集合之间差异的大小——这是一个显著的改进。 IBF建立在布隆过滤器之上,增加了精确检索和列出差异的功能。其核心思想利用分区和累加器(ID、哈希和计数值的异或聚合)来识别对称差异(仅存在于其中一个集合中的元素)。虽然朴素的分区可能不可靠,但IBF采用复杂的哈希方案和“剥离”算法来迭代且准确地恢复差异。 IBF由包含这些累加器的单元格组成,并通过编码、相减(查找差异)和解码来运作。IBF是解决“集合调和”问题的强大方案——有效地比较集合而无需完全数据传输——并提供了一种比传统方法更节省空间的替代方案。一个基本的Python实现可供探索这种迷人的数据结构。

## 将异或技巧扩展到数十亿行:摘要 一篇最新的文章探讨了将一种巧妙的“异或技巧”扩展到大型数据集差异查找的应用。最初的技巧利用异或运算高效地识别序列中的缺失数字。这种扩展旨在将该方法扩展到数十亿行,但引入了概率性元素。 虽然最初的异或技巧是确定性的,但扩展它需要概率算法,例如布隆过滤器或可逆布隆过滤器(IBF)。这些方法并非保证成功,但失败是可以检测到的,允许使用调整后的参数进行重试。 诸如Minisketch的替代方案可以为较小的差异提供保证恢复,并具有最佳的空间效率,但具有二次解码复杂度。 结合IBF和Minisketch的混合方法也是可能的。 讨论强调了空间效率、解码速度和成功保证之间的权衡。 核心思想涉及对数据进行分区并迭代“剥离”差异,但同时保持绝对保证*和*线性时间性能仍然是一个挑战。 原始异或技巧的简洁性在这些扩展中丧失了,但其在大型规模差异检测方面的潜力仍然引人注目。
相关文章

原文

Can we extend the XOR trick for finding one or two missing numbers in a list to finding thousands of missing IDs in a billion-row table?

Yes, we can! This is possible using a data structure called an Invertible Bloom Filter (IBF) that compares two sets with space complexity based only on the size of the difference. Using a generalization of the XOR trick [1], all the values that are identical cancel out, so the size of this data structure depends only on the size of the difference.

Most explanations of Invertible Bloom Filters start with standard Bloom filters, which support two operations: insert and maybeContains. Next, they extend to counting Bloom filters, which enables a delete operation. Finally, they introduce Invertible Bloom Filters, which add an exact get operation and a probabilistic listing operation. In this article, I will take a different approach and build up to an IBF from the XOR trick.

IBFs have remained relatively obscure in the software development community while the XOR trick is a well-known technique thanks to leetcode. My goal with this article is to connect IBFs to the XOR trick so that more developers understand this fascinating and powerful data structure.

Finding 3 Missing Numbers

Let's start with a concrete example:

A = [1,2,3,4,5,6,7,8,9,10]
B = [1,2,4,5,7,8,10]

Set B is missing 3 numbers: {3, 6, 9}.

The classic XOR trick works for finding 1 or 2 missing numbers, but what about 3 or more? If we don't know how many numbers are missing, we can use a count field to detect when the usual XOR trick will fail:

XOR(A) = 1 ^ 2 ^ ... ^ 10 = 11
COUNT(A) = 10
COUNT(B) = 7

Since B has 7 elements and A has 10, we know 3 numbers are missing—too many for the basic XOR trick.

To work around this limitation, we can partition our sets using a hash function. Let's try with 3 partitions:


H(1) = 2, H(2) = 1, H(3) = 1, H(4) = 0, H(5) = 0
H(6) = 0, H(7) = 2, H(8) = 1, H(9) = 2, H(10) = 1


A_0 = [4,5,6]     B_0 = [4,5]
A_1 = [2,3,8,10]  B_1 = [2,8,10]
A_2 = [1,7,9]     B_2 = [1,7]

Now each partition has at most 1 missing element, so we can apply the XOR trick to each:

XOR(B_0) ^ XOR(A_0) = (4 ^ 5) ^ (4 ^ 5 ^ 6) = 6
XOR(B_1) ^ XOR(A_1) = (2 ^ 8 ^ 10) ^ (2 ^ 3 ^ 8 ^ 10) = 3
XOR(B_2) ^ XOR(A_2) = (1 ^ 7) ^ (1 ^ 7 ^ 9) = 9

Success! We recovered all 3 missing elements: {3, 6, 9}.

Unfortunately, this approach alone doesn't work in practice. Here, we got lucky with our hash function. Usually, elements will not distribute evenly across partitions, and some partitions might still have multiple differences.

Detecting when the XOR trick fails

To fix the problems with naive partitioning, we will need to use a more sophisticated approach.

Let's generalize beyond "missing numbers" to the concept of symmetric difference. The symmetric difference of two sets A and B is the set of elements that are in either set but not in both: (A \ B) ∪ (B \ A), often written as A Δ B.

Symmetric Difference Venn Diagram

Figure 1: The symmetric difference A Δ B includes elements that are in either set A or set B, but not in both sets.

Consider this example:

A = [1,2,3,4,6]
B = [1,2,4,5]

The symmetric difference is {3, 6, 5}—elements 3 and 6 are only in A, while element 5 is only in B.

The naive approach of checking |COUNT(A) - COUNT(B)| = 1 fails here because the count difference is 1, but the symmetric difference has 3 elements. Applying the basic XOR trick gives a meaningless result.

To detect this case, we can use an additional accumulator with a hash function:

COUNT(A) = 5, XOR(A) = 2
COUNT(B) = 4, XOR(B) = 2
XOR_HASH(A) = H(1) ^ H(2) ^ H(3) ^ H(4) ^ H(6)
XOR_HASH(B) = H(1) ^ H(2) ^ H(4) ^ H(5)

XOR(B) ^ XOR(A) = 0
XOR_HASH(B) ^ XOR_HASH(A) = H(3) ^ H(5) ^ H(6) 
H(3) ^ H(5) ^ H(6)H(0) 

While we can't yet recover the full symmetric difference, we can detect when the XOR result is unreliable by checking if the hash accumulators behave consistently.

Invertible Bloom Filters

The Core Idea

The original XOR trick relies on an accumulator that stores the XOR aggregate of a list. Building on this, we've introduced:

  1. Partitioning to divide sets into smaller parts
  2. Additional accumulators that store the element count and hash of XOR aggregate

To fully generalize this into a robust data structure, we need:

  1. A partitioning scheme that creates recoverable partitions with high probability
  2. An iterative process that uses recovered values to unlock additional partitions

This is exactly what Invertible Bloom Filters [2] provide. IBFs use a Bloom filter-style hashing scheme to assign elements to multiple partitions, then employ a graph algorithm called "peeling" to iteratively recover the symmetric difference with very high probability [3].

Structure

An IBF consists of an array of cells, where each cell contains three accumulators:

  • idSum: XOR aggregate of element values
  • hashSum: XOR aggregate of element hashes
  • count: number of elements in the cell

IBF Structure Figure 2: Figure from [2] showing the process of encoding elements into IBF cells.

Operations

For computing symmetric differences, IBFs support three key operations:

  1. Encode: Build an IBF from a set of values
  2. Subtract: Subtract one IBF from another—identical values cancel out, leaving only the symmetric difference
  3. Decode: Recover stored values by finding "pure" cells (count = ±1 and H(idSum) = hashSum) and iteratively "peeling" them

The IBF has one parameter d—the expected size of the symmetric difference. With proper sizing (typically m > 1.22d cells), IBFs recover the full symmetric difference with very high probability.

IBF Encode IBF Subtract IBF Decode Algorithms for each IBF operation from [2].

Example Implementation

I have not found a solid, maintained library of IBFs in any language (if you know of any please let me know!), so I created a basic implementation in Python if you would like to play with it: https://gist.github.com/hundredwatt/a1e69ff300de941041d824e49249d3d7

Example usage:


a = [1, 2, 4, 5, 6, 7, 9, 10]
b = [1, 3, 4, 5, 6, 7, 9, 10]
set_a = set(a)
set_b = set(b)
ibf_size = 100
k = 4


ibf_a = IBF(size=ibf_size, k=k)
for item in a:
   ibf_a.insert(item)

ibf_b = IBF(size=ibf_size, k=k)
for item in b:
   ibf_b.insert(item)


diff_ibf = ibf_a.subtract_ibf(ibf_b)


success, decoded_added, decoded_removed = diff_ibf.decode()


assert(success)
expected_added = set_a - set_b
expected_removed = set_b - set_a

print("--- Verification ---")
print(f"Expected added (in A, not B): {len(expected_added)} items")
print(f"Decoded added:                  {len(decoded_added)} items")
assert(expected_added == decoded_added)

print(f"Expected removed (in B, not A): {len(expected_removed)} items")
print(f"Decoded removed:                  {len(decoded_removed)} items")
assert(expected_removed == decoded_removed)

About the "Set Reconciliation" Problem

The general problem of efficiently comparing two sets without transferring their entire contents is called "set reconciliation." Early research used polynomial approaches [4], while modern solutions typically employ either Invertible Bloom Filters or error-correction codes [5].

Conclusion

IBFs are a powerful tool for comparing sets with space complexity based only on the size of the difference. I hope this article has helped you understand how IBFs work and how they can be used to solve real-world problems.

If you want to learn more about IBFs, I recommend reading the references below. The "Further Reading" section contains some additional, advanced papers that I found interesting.

If you have any questions or feedback, please feel free to reach out to me, contact info is in the footer.

References

[1] Florian Hartmann, "That XOR Trick" (2020), https://florian.github.io/xor-trick/

[2] David Eppstein, Michael T. Goodrich, Frank Uyeda, and George Varghese, "What's the Difference? Efficient Set Reconciliation without Prior Context," SIGCOMM (2011), https://ics.uci.edu/~eppstein/pubs/EppGooUye-SIGCOMM-11.pdf

[3] Michael Molloy, "Cores in Random Hypergraphs and Boolean Formulas," Random Structures & Algorithms (2003), https://utoronto.scholaris.ca/server/api/core/bitstreams/371257de-0a44-4702-afeb-542ae9a06986/content

[4] Yaron Minsky and Ari Trachtenberg, "Set Reconciliation with Nearly Optimal Communication Complexity" (2000), https://ecommons.cornell.edu/server/api/core/bitstreams/c3fff828-cfb8-416a-a28b-8afa59dd2d73/content

[5] Pengyu Gong, Lei Yang, and Liwei Wang, "Space- and Computationally-Efficient Set Reconciliation via Parity Bitmap Sketch," VLDB (2021), https://vldb.org/pvldb/vol14/p458-gong.pdf

Further Reading

This paper extends the original IBF paper to support additional encoded fields (not just a single idSum) and a more robust decoding algorithm that I found to be necessary for production use.

This paper shows how to extend IBF-based reconciliation to efficiently handle three or more parties. The implementation is neat, it encodes the IBFs on finite fields of a dimension equal to the number of parties.

This paper reduces the size of the IBF by eliminating the hashSum field. Instead, it uses a hash function that calculates the cell index as a checksum and has some additional checks in the decoding process to deal with potential "anomalies" that show up in this approach.

Standard IBFs require knowing the expected size of the symmetric difference beforehand to size the filter correctly. This recent work introduces "rateless" IBFs that adapt their size dynamically, making them more practical when the difference size is unknown or highly variable.

联系我们 contact @ memedata.com