美国禁止在人口普查数据中使用差分隐私技术

美国禁止在人口普查数据中使用差分隐私技术
US bans differential privacy in Census data

原始链接: https://desfontain.es/blog/banning-noise.html

美国商务部近期禁止了“噪声注入”（差分隐私的关键组成部分）在人口普查局和经济分析局统计产品中的使用。此举要求相关机构转向以“粗粒化”（降低数据精度）和“抑制”（删除数据）作为保护机密信息的主要方法。差分隐私被认为是平衡数据效用与隐私的黄金标准。它利用经过校准的噪声来防止个人记录被重构，而此前的数据交换等方法因存在该漏洞已不再安全。通过禁止依赖随机性的技术，政府正迫使各机构放弃目前可用的最有效的隐私风险缓解工具。批评人士认为，这项命令导致了灾难性的权衡：未来的统计数据发布要么存在严重的安全隐患，要么在功能上毫无用处，尤其是针对小型人口群体的数据。由于竞争性方法较为粗糙，且在抵御现代重构攻击方面效果较差，该禁令可能会阻碍研究人员追踪人口差异。无论其动机是出于政治议程（如不公平选区划分）还是出于对隐私与效用之间权衡难题的逃避，该指令都显著降低了美国政府数据的质量与安全性。

近期的一场 Hacker News 讨论探讨了围绕美国人口普查数据中可能禁止使用“差分隐私”（一种用于在大规模数据集中掩盖个人身份的技术）所引发的争议。这场辩论凸显了一个根本性的矛盾：透明度与数据效用之间的冲突。原始未掩码数据的支持者认为，数据默认应当公开，并主张如果数据敏感度过高而不宜发布，那么根本就不该进行采集。相反，另一些人则认为，公开细粒度的人口普查数据是安全和伦理上的失职，会引发针对个人的身份识别、选区划分操纵以及剥夺公民权利等问题。许多评论者达成的一个核心共识是，人口普查的准确性依赖于公众信任。如果公民担心其敏感信息（如收入、残疾状况或公民身份）会被泄露，他们很可能会撒谎或拒绝参与，从而导致人口普查数据无法用于联邦拨款和国会席位分配。最终，讨论表明，在政府对准确汇总数据的需求与个人隐私之间取得平衡仍然是一个“不可能”解决的问题，没有任何简单的方案能够同时满足政治和数学上的约束。

原文

Last week, the United States Department of Commerce issued an order declaring that "noise infusion" will be banned from all statistical products published by the Census Bureau and the Bureau of Economic Analysis.

A screenshot of the order mentioned in the article. It reads: a. The
Department shall, as a primary objective, aim to fulfill its statistical
obligations by providing the public with accurate and objective information. b.
The Department is firmly committed to striking a balance of accuracy,
confidentiality, objectivity, and relevance for each statistical product that is
consistent with its statistical obligations and the applicable legal
requirements. c. Any use of noise infusion is inconsistent with the Department’s
policies. 02 The Census Bureau and the Bureau of Economic Analysis shall adhere
to the following order of priority when considering and applying Disclosure
Avoidance: a. Coarsening shall be the preferred category of Disclosure Avoidance
methods for all statistical products. b. Suppression shall be permitted as a
last resort, only to be used when coarsening is prohibited by law or would
substantially defeat the accuracy or usability of a statistical product. c.
Noise infusion shall not be used for any statistical
product.

What does it mean, and why should you care?

Statistical products are a bunch of numbers published from a secret dataset. Often, that dataset contains confidential information, and it is important that the numbers don't reveal that information. The U.S. Census is a well-known example: the statistics are made public, but the contents of each form filled by individual U.S. residents must stay secret.

Scientists have developed a number of techniques that can be used to publish useful statistics while protecting the privacy of the original data. This field is called disclosure avoidance in statistical communities. Here are a few of these techniques.

Suppression: removing data that doesn't pass certain thresholds (e.g. if a count of people is below 5, we don't publish it).
Coarsening (or generalization): making data attributes less precise (e.g. transform a county into its state, a date of birth into an age range, etc.).
Sampling: randomly removing some records from the dataset.
Swapping: taking attributes from different records and exchanging them randomly.
Contribution bounding: making sure that a single individual cannot contribute "too much" to a statistic by limiting their maximum impact.
Noise addition: adding a random number to statistics to hide their true value.

Some of these techniques, when combined, achieve a definition called differential privacy. This definition has a lot of nice fundamental properties and is widely considered the gold standard of privacy protection among scientists. To achieve it, scientists typically rely on a combination of contribution bounding and carefully-calibrated noise addition.

From 1990 to 2010, the U.S. Census Bureau primarily relied on swapping for the decennial census. Then, they realized that this technique was actually very unsafe, and that it was pretty easy to reconstruct individual records using the published statistics. This is bad, because the Bureau is required by federal law to keep these records confidential. So they tried a few alternative approaches, and decided to adopt differential privacy for the 2020 Census: this was the one that kept the statistics most useful, while preventing these attacks.

It bears repeating: differential privacy wasn't chosen because the math was nice and compelling^{. It was selected because among the different options that
mitigated the attack, it was the one that preserved the most utility. Its exact
privacy parameters were chosen not because they provided rock-solid provable
guarantees, but because they squeezed most usefulness out of the data while
reaching an acceptable level of privacy protection.}

Sadly, "preserved the most utility under newly-discovered privacy constraints" did not mean "preserved as much utility as the 2010 Census": the numbers got less accurate, and the inaccuracies got a lot more transparent, and therefore impossible to ignore. This made a number of people very angry.

Demographers and social scientists could no longer ignore that the data they were working with was noisy data. This required a major shift in how they conceptualized and worked with this data.
People who were using Census data to actually reconstruct records could no longer do so. Demographers admitted that this was common practice. It's also an open secret that this was done by political operatives as part of gerrymandering efforts.

Phew, that was a lot of context.

The administration has now decided that noise infusion was no longer an acceptable disclosure avoidance technique.

The order clearly targets differential privacy, but also seems to impact other techniques that involve randomness: the text explicitly mentions that coarsening should always be preferred, falling back to suppression as a "last resort". I have no idea why the order is so specific. Maybe they wanted to make sure the scientists working at the U.S. Census couldn't still use similar techniques without calling them differential privacy?

The order also carefully says it "shall not be interpreted to conflict with any constitutional, statutory, regulatory, or other legal provision". So the confidentiality obligations surrounding these statistical products still apply.

The consequences will be dire for utility or for privacy, and possibly both. It's hard to understate this point: future statistical releases will either be useless compared to past ones, or they will be incredibly unsafe.

For starters, taking away useful tools from the disclosure avoidance toolbox will always lead to more painful privacy/utility trade-offs. The whole point of this research field is to better understand and quantify privacy risk, and develop better tools to mitigate this risk while preserving utility.

For statistical releases, differential privacy is simply the best tool we have right now. It provides a finer way of quantifying trade-offs, and allows us to get more utility out of the data than competing techniques at similar privacy levels. If you take it away, you're left with techniques that either have worse utility at similar levels of privacy, or worse privacy for the same utility.

But all competing techniques also rely on noise addition. The Cell Key method, used at other statistical agencies, adds noise to statistics. Swapping, used from 1990 to 2010 for the U.S. Census, also injects randomness into the process. Sampling is everywhere in statistical work^{. Hell, even
imputation technically
adds noise to the data^!}

By contrast, coarsening and suppression are very blunt instruments. They only work in situations where the statistics are already very coarse, and not too many of them are published. For complex data products with many statistics about small groups of people (like the U.S. Census), they either destroy all utility of the data (especially for minority populations), or are very vulnerable to privacy attacks.

It makes sense: privacy attacks on statistical releases are about solving a system of equations. It is such an easier task when you know for sure that the statistics are all perfectly accurate. Noise forces you to compute probabilities, quantify the uncertainty, carefully consider baselines, and so on. That's why randomness is such a useful tool for disclosure avoidance! Even without formal guarantees, it makes attakcs a lot harder. Take it away and attacks become trivial.

I mean, who knows.

Maybe the goal is to force the U.S. Census to publish statistics that actually enable re-identification, to help with future gerrymandering efforts? Or on the contrary, maybe the idea is to stop the publication of useful demographic data, to prevent researchers from showing unfair disparities among the population?

Hanlon's razor provides an alternative explanation. The fundamental privacy/utility trade-off inherent to statistical data releases is annoying. It would be a lot easier if publishing many statistics didn't automatically come with a high privacy risk. Differential privacy makes this trade-off explicit, and thus impossible to ignore. Maybe banning it is a way of pretending that the problem doesn't exist, in the hope that it will go away?

Thanks to Adam Sealfon, Aloni Cohen, Ben Jacobsen, and Gautam Kamath for helpful comments on earlier drafts of this post.

美国禁止在人口普查数据中使用差分隐私技术 US bans differential privacy in Census data

美国禁止在人口普查数据中使用差分隐私技术
US bans differential privacy in Census data