哈希处理的个人身份信息隐私剧场

哈希处理的个人身份信息隐私剧场
The Privacy Theater of Hashed PII

原始链接: https://matthodges.com/posts/2025-10-19-privacy-theater-pii-phone-numbers/

许多营销和广告技术公司错误地将对个人身份信息（PII），如电子邮件地址或电话号码进行哈希处理，作为一种隐私保护措施。这是一种“隐私表演”，因为哈希处理只有在真正随机且长度足够长的输入下才是安全的；而 PII 是可预测且范围有限的。虽然哈希处理 PII 的本意是用于私有集合交集（在不泄露数据的情况下匹配数据），但它很容易被破解。一台相对普通的笔记本电脑可以在数小时内构建一个“彩虹表”——一个预先计算的查找表，从而反向哈希常见的 PII 格式，如电话号码。像 DuckDB 这样的工具可以促进这一过程，快速识别哈希值对应的原始数据。加盐（在哈希处理之前添加随机数据）也无法解决这个问题。共享盐值允许重建查找表，而每条记录使用不同的盐值则使得匹配变得不可能，除非进行昂贵且具有泄露风险的暴力破解尝试。研究已经证明了哈希处理和索引数十亿个电话号码的可行性。核心问题不在于哈希*算法*本身，而在于将其应用于低熵数据。这意味着这种做法提供了一种虚假的安全性，并不能真正保护用户隐私。

## 哈希处理的个人身份信息：一场隐私闹剧 - 摘要一则 Hacker News 的讨论集中在公司将个人身份信息 (PII) 进行哈希处理作为一种所谓的隐私措施的缺陷性做法上。核心论点是，单独进行哈希处理并不能提供真正的隐私，因为现代计算能力（如 RTX4090）可以迅速破解哈希，并且密码监控等技术已经利用了更强大的方法，如完全同态加密 (FHE)。许多评论员指出，这种做法并非出于真正的隐私考虑，而是为了让数据购买者避免购买他们已经拥有的数据，以及让数据销售者避免提前披露数据。引用了几个不称职的例子——例如，一项虚假宣传为匿名的员工调查，但与电子邮件哈希相关联——表明缺乏理解，而非恶意意图，尽管仍然存在怀疑。讨论还指出了营销和广告技术领域内更广泛的问题，即对 GDPR 的肤浅理解导致通过诸如哈希处理之类的表面措施来实现“合规”，同时仍然试图通过各种标识符来跟踪用户。虽然加盐哈希可以提高安全性，但对于有限的输入范围（如电话号码）而言效果不佳，并且会妨碍为了常见客户识别而共享数据。最终，共识是真正的隐私需要明确的同意以及从普遍数据匹配中根本性的转变。

About once a year, I’m reminded of the fact that a lot of marketing SaaS and ad tech dresses up cryptographic hashes as a sort of privacy theater. This shows up frequently in product features for suppression lists with the general idea of uploading hashed values of email addresses or phone numbers to enable matching while preserving privacy. The problem is, hashing PII does not protect privacy.

The long and short of it is: hashing is only effective if the input data is unbounded. It’s why long and unpredictable passwords are necessary, even with a robust hashing function. PII is neither long nor unpredictable.

You can download every baby name going back to 1880 from the Social Security administration.
Email addresses follow the format [email protected].
Social Security numbers are 9 digits, so there are at most 1 billion.
North American phone numbers are 10 digits, so there are at most 10 billion.

Despite this, marketing tools still shuffle around PII hashes of this data. For example, here’s BambooHR:

In order to better identify any shared customers we may have, we have decided to compare our customer lists encoded as MD5 Hashes. By encoding our respective customer lists in MD5 Hashes, we will be able to compare customer lists without disclosing any customer info (including customer name).

And platforms like UnsubCentral:

Manually entering in phone numbers to a suppression tool is a waste of time and resources. Our tool can take plain text and compare it against MD5 or SHA hashed lists of phone numbers – simply throw in the data, and it will do the hard work for you.

Everyone is trying to do private set intersection, but doing this with hash-passing is trivially broken on modern consumer hardware. And you don’t even need special password cracking software to break it.

On a laptop, we can build a rainbow table of Parquet files for every North American phone number. We can abuse DuckDB as a hashing mill, and generate every MD5 in the 5XX area code block:

PRAGMA threads=8;
PRAGMA temp_directory='/tmp';

COPY (
  WITH gen AS (
    SELECT i::BIGINT AS number, md5(i::VARCHAR) AS hash
    FROM range(5000000000, 6000000000) t(i)
  )
  SELECT substr(hash,1,2) AS h2, (number % 8) AS shard, number, hash
  FROM gen
) TO 'out/md5nums'
    (FORMAT 'parquet',
    PARTITION_BY (h2, shard),
    COMPRESSION 'zstd',
    ROW_GROUP_SIZE 250000);

哈希处理的个人身份信息隐私剧场 The Privacy Theater of Hashed PII

哈希处理的个人身份信息隐私剧场
The Privacy Theater of Hashed PII