哈希处理的个人身份信息隐私剧场
The Privacy Theater of Hashed PII

原始链接: https://matthodges.com/posts/2025-10-19-privacy-theater-pii-phone-numbers/

许多营销和广告技术公司错误地将对个人身份信息(PII),如电子邮件地址或电话号码进行哈希处理,作为一种隐私保护措施。这是一种“隐私表演”,因为哈希处理只有在真正随机且长度足够长的输入下才是安全的;而 PII 是可预测且范围有限的。 虽然哈希处理 PII 的本意是用于私有集合交集(在不泄露数据的情况下匹配数据),但它很容易被破解。一台相对普通的笔记本电脑可以在数小时内构建一个“彩虹表”——一个预先计算的查找表,从而反向哈希常见的 PII 格式,如电话号码。像 DuckDB 这样的工具可以促进这一过程,快速识别哈希值对应的原始数据。 加盐(在哈希处理之前添加随机数据)也无法解决这个问题。共享盐值允许重建查找表,而每条记录使用不同的盐值则使得匹配变得不可能,除非进行昂贵且具有泄露风险的暴力破解尝试。研究已经证明了哈希处理和索引数十亿个电话号码的可行性。核心问题不在于哈希*算法*本身,而在于将其应用于低熵数据。这意味着这种做法提供了一种虚假的安全性,并不能真正保护用户隐私。

## 哈希处理的个人身份信息:一场隐私闹剧 - 摘要 一则 Hacker News 的讨论集中在公司将个人身份信息 (PII) 进行哈希处理作为一种所谓的隐私措施的缺陷性做法上。核心论点是,单独进行哈希处理并不能提供真正的隐私,因为现代计算能力(如 RTX4090)可以迅速破解哈希,并且密码监控等技术已经利用了更强大的方法,如完全同态加密 (FHE)。 许多评论员指出,这种做法并非出于真正的隐私考虑,而是为了让数据购买者避免购买他们已经拥有的数据,以及让数据销售者避免提前披露数据。 引用了几个不称职的例子——例如,一项虚假宣传为匿名的员工调查,但与电子邮件哈希相关联——表明缺乏理解,而非恶意意图,尽管仍然存在怀疑。 讨论还指出了营销和广告技术领域内更广泛的问题,即对 GDPR 的肤浅理解导致通过诸如哈希处理之类的表面措施来实现“合规”,同时仍然试图通过各种标识符来跟踪用户。 虽然加盐哈希可以提高安全性,但对于有限的输入范围(如电话号码)而言效果不佳,并且会妨碍为了常见客户识别而共享数据。 最终,共识是真正的隐私需要明确的同意以及从普遍数据匹配中根本性的转变。
相关文章

原文

About once a year, I’m reminded of the fact that a lot of marketing SaaS and ad tech dresses up cryptographic hashes as a sort of privacy theater. This shows up frequently in product features for suppression lists with the general idea of uploading hashed values of email addresses or phone numbers to enable matching while preserving privacy. The problem is, hashing PII does not protect privacy.

The long and short of it is: hashing is only effective if the input data is unbounded. It’s why long and unpredictable passwords are necessary, even with a robust hashing function. PII is neither long nor unpredictable.

  • You can download every baby name going back to 1880 from the Social Security administration.
  • Email addresses follow the format [email protected].
  • Social Security numbers are 9 digits, so there are at most 1 billion.
  • North American phone numbers are 10 digits, so there are at most 10 billion.

Despite this, marketing tools still shuffle around PII hashes of this data. For example, here’s BambooHR:

In order to better identify any shared customers we may have, we have decided to compare our customer lists encoded as MD5 Hashes. By encoding our respective customer lists in MD5 Hashes, we will be able to compare customer lists without disclosing any customer info (including customer name).

And platforms like UnsubCentral:

Manually entering in phone numbers to a suppression tool is a waste of time and resources. Our tool can take plain text and compare it against MD5 or SHA hashed lists of phone numbers – simply throw in the data, and it will do the hard work for you.

Everyone is trying to do private set intersection, but doing this with hash-passing is trivially broken on modern consumer hardware. And you don’t even need special password cracking software to break it.

On a laptop, we can build a rainbow table of Parquet files for every North American phone number. We can abuse DuckDB as a hashing mill, and generate every MD5 in the 5XX area code block:

Crunch:

broken, but this approach is hash-agnostic. The problem is the application on low-entropy input data, not the specific hashing algorithm.

File scanning here is just a proof-of-concept. For true query speed you can trade file portability for a database index. Once your Parquet files are built, you can load them into DuckDB as a persistent table and create an index on the hash column. Or, Postgres with a B-tree index is also a great fit.

On my 2020 M1 MacBook Air, computing 1 billion 5XX phone number hashes completes in about 40 minutes. There are about 6.3 billion valid North American phone numbers, so building a hash lookup table for all of them would take my little laptop just over 4 hours.

Salting does not solve this problem. With per-record random salts, identical identifiers hash differently across parties, so no deterministic equality join is possible. Publishing those salts still doesn’t enable a direct join: the only viable path is recover-then-join, which costs \(\sim D \times (N + M)\). That’s operationally impractical for a bulk-matching service, but practically feasible for an adversary performing offline, per-record brute force over time. It also reveals non-matches, because recovery exposes plaintexts beyond the overlap. If parties instead share a global salt to keep hashes comparable, the domain is trivially enumerable and a lookup table can be rebuilt.

None of this is breaking news. In 2021, researchers hashed 118 billion phone numbers:

The limited amount of possible mobile phone numbers combined with the rapid increase in affordable storage capacity makes it feasible to create key-value databases of phone numbers indexed by their hashes and then to perform constant-time lookups for each given hash value. We demonstrate this by using a high-performance cluster to create an in-memory database of all 118 billion possible mobile phone numbers

The thing is, you really don’t need a high-performance cluster anymore.

联系我们 contact @ memedata.com