自适应哈希

自适应哈希
Adaptive Hashing

原始链接: https://quotenil.com/adaptive-hashing.html

自适应哈希通过根据存储的数据动态调整哈希函数来提高哈希表性能。传统的哈希表理论侧重于使用随机选择的哈希函数时的最坏情况，但实际性能受常数因子影响很大。自适应哈希旨在通过学习键的分布并在线切换到更合适的哈希函数来优化这些因子。 SBCL的哈希表进行了修改以加入自适应哈希。对于EQ哈希表（按内存地址比较对象），系统首先进行线性搜索，然后升级到针对顺序分配的对象优化的基于移位的哈希。如果失败，它会恢复到之前的SBCL EQ哈希，甚至是一般的哈希函数，如Murmur。对于EQUAL哈希表（比较内容），该算法最初只对字符串或列表的一部分进行哈希，如果冲突过于频繁，则增加哈希的量。这种方法平衡了性能和鲁棒性，提供了更快的常见情况性能和改进的偏斜数据弹性。

Hacker News上的讨论主要围绕“自适应哈希”（Adaptive Hashing）技术如何优化哈希表性能展开。评论员tptacek指出，“自适应哈希”这个术语以前曾用于像Bcrypt这样的算法，这些算法故意放慢哈希速度以增强安全性。这引发了关于目标相反的算法却共享相同名称的幽默讨论。随后，谈话涉及到快速排序算法（Quicksort），这也是另一个例子，但鲜为人知。一位用户称赞了自适应哈希的实现，因为它消除了其代码中的性能hack。一个关键点是，自适应哈希的实现仅使用字符串的前两个和后两个字符进行哈希，这并非理想。用户建议根据冲突率动态调整哈希字符数量，以提高哈希表增长时的质量。

原文

Tags: tech, lisp, Date: 2025-05-02

At the 2024 ELS, I gave a talk on adaptive hashing, which focusses on making general purpose hash tables faster and more robust at the same time.

Theory vs Practice

Hash table theory most concerns itself with the asymptotic worst-case cost with a hash function chosen randomly from a family of hash functions. Although these results are very relevant in practice,

those pesky constant factors, that the big-O cost ignores, do matter, and
we don't pick hash functions randomly but fix the hash function for the lifetime of the hash table.

There are Perfect Hashing algorithms, that choose an optimal hash function for a given set of keys. The drawback is that they either require the set of keys to be fixed or they are too slow to be used as general purpose hash tables.

Still, the idea that we can do better by adapting the hash function to the actual keys is key. Can we do that online, that is, while the hash table is being used? Potential performance gains come from improving the constant factors mentioned above by

having fewer collisions, and
being more cache-friendly.

The images above plot regret (the expected number of comparisons of per lookup minus the minimum achievable) and the measured run-time of PUT operations vs the number of keys in the hash table with a particular key distribution. Green is Murmur (a robust hash function), Blue is SBCL's expedient EQ hash. The wiggling of the graphs is due to the resizing of the hash table as keys are added to it.

Note how SBCL's regret starts out much lower and becomes much higher than that of Murmur, but if anything, its advantage grows.

Implementation

The general idea is sound, but turning it into real performance gains is hard due to the cost of choosing a hash function and switching to it. First, we have to make some assumption about the distribution of keys. In fact, default hash functions in language runtimes often make such assumptions to make the common cases faster, usually at the cost of weakened worst-case guarantees.

The rest of this post is about how SBCL's built-in hash tables, which had been reasonably fast, were modified. The core switching mechanism looks at

the length of the collision chain on PUT operations,
the collision count on rehash (when the hash table is grown), and
the size of the hash table.

Adapting `EQ` hash tables

Init to to constant hash function. This a fancy way of saying that we do linear search in a vector internally. This is an EQ hash table, so key comparison is as single assembly instruction.
When the hash table is grown to more than 32 keys and it must be rehashed anyway, we switch to a hash function that does a single right shift with the number of bits to shift determined from the longest common run of low-bits in the keys.
If too many collisions, we switch to the previous default SBCL EQ-hash function that has been tuned for a long time.
If too many collisions, we switch to Murmur, a general purpose hash. We could also go all the way to cryptographic hashes.

In step 2, the hash function with the single shift fits the memory allocator's behaviour nicely: it is a perfect hash for keys forming arithmetic sequences, which is often approximately true for objects of the same type allocated in a loop.

adaptive-hash-eq-put