那160亿密码的故事（又名“数据巨魔”）

那160亿密码的故事（又名“数据巨魔”）
That 16B password story (a.k.a. "data troll")

原始链接: https://www.troyhunt.com/that-16-billion-password-story-aka-data-troll/

最近的头条新闻声称发生了一次大规模数据泄露，泄露了 160 亿个密码，导致大量流量涌入“我是否被泄露”（Have I Been Pwned，HIBP）。然而，分析显示事实远不如新闻报道那样耸人听闻。这些数据并非来自单一泄露事件，而是“偷取器日志”的集合——从单独被入侵的机器上收集的数据。研究人员 Bob 将此数据的一个子集（775GB，27 亿行）发送给 HIBP 进行分析。虽然数据量很大，但包含大量重复信息。在 27 亿行数据中，仅识别出 1.09 亿个唯一的电子邮件地址，比最初的说法减少了 96%。HIBP 添加了 440 万个*新的*地址，命名为“数据巨魔”，但 96% 的数据已经已知。此外，在发现的 2.31 亿个唯一密码中，96% 已经存在于 HIBP 的“已泄露密码”数据库中。夸大的头条新闻源于媒体的耸人听闻，将重新包装且通常是旧的数据描绘成一次新的、大规模的泄露事件。虽然数据泄露应该认真对待，但此事件并未带来新的风险，也不值得它所受到的过度关注。

Hacker News 上围绕一个全面的泄露密码数据库的价值展开讨论，起因是最近的“160 亿密码”事件（被称为“数据巨魔”）。用户指出，如果没有一个集中、可搜索的过去泄露记录，很难确定新的数据泄露是否包含*新的*密码，还是仅仅是之前泄露数据的重新整理。虽然 Troy Hunt 的“Pwned Passwords”（[https://haveibeenpwned.com/passwords](https://haveibeenpwned.com/passwords)）提供了一个资源，但它并非一个完整的解决方案。它提供哈希密码，但无法直接识别重新整理的泄露。此外，泄露通常包含的不仅仅是密码——例如电话号码和地址等有价值的数据——一个全面的数据库也可以帮助跟踪这些数据。这次讨论强调了需要更好的工具来分析和理解数据泄露的范围，而不仅仅是密码泄露。

原文

Spoiler: I have data from the story in the title of this post, it's mostly what I expected it to be, I've just added it to HIBP where I've called it "Data Troll", and I'm going to give everyone a lot more context below. Here goes:

Headlines one-upping each other on the number of passwords exposed in a data breach have become somewhat of a sport in recent years. Each new story wants to present a number that surpasses the previous story, and the clickbait cycle continues. You can see it coming a mile away, and you just know the reality is somewhat less than the headline, but how much less?

And so it was in June when a story with this title hit the headlines: 16 billion passwords exposed in record-breaking data breach. I thought this would be another standard run-of-the-mill sensational headline that would catch a few eyeballs for a couple of days then be forgotten, but no, apparently not. It started with a huge volume of interest in Have I Been Pwned:

That's Google searches for my "little" project, which I found odd, because we hadn't put any data in HIBP! But that initial story gained so much traction and entered the mainstream media to the extent that many publications directed people to HIBP, and inevitably, there was a bunch of searching done to figure out what the service actually was. And the news is still coming out - this story landed on AOL just last week:

You know it's serious because of all the red and exclamation marks... but per the article, "you don't need to panic" 🤷‍♂️

Enough speculating, let's get into what's actually in here, and for that, I went straight to the source:

Bob is a quality researcher who has been very successful over the years at sniffing out breached data, some of which had previously ended up in HIBP as a result of his good work. So we had a chat about this trove, and the first thing he made clear was that this isn't a single source of exposure, but rather different infostealer data sets that have been publicly exposed this year. The headlines implying this was a massive breach are misleading; stealer logs are produced from individually compromised machines and occasionally bundled up and redistributed. Bob also pointed out that many of the data sets were no longer exposed, and he didn't have a copy of all of them. But he did have a subset of the data he was happy to send over for HIBP, so let's analyse that.

All told, the data Bob sent contained 10 JSON files totalling 775GB across 2.7B rows. An intial cursory check against HIBP showed more than 90% of the email addresses were already in there, and of those that were in previous stealer logs, there was a high correlation of matching website domains. What I mean by this is that if the data Bob sent had someone's email address and password captured when logging into Netflix and Spotify, that person was probably already in HIBP's stealer logs against Netflix and Spotify. In other words, there's a lot of data we've seen before.

So, what do we make of all this, especially since the corpus Bob sent is about 17% of the reported 16B headline? Let me speak generally about how these data sets tend to have hyperbolic headlines, and the numbers of actual impact are way smaller:

There's usually duplication across files, as the same data appears multiple times
There's also often duplication within the same file, again, as the same data reappears
A "row" is an instance of someone's email address and password listed next to a website they're logging onto, so 100 distinct rows may all be one person

The corpus of data I received contained 2.7B rows, of which I was able to extract 325M unique stealer log entries. That's the number of rows I could successfully parse out website, email address and password values from. In my earlier example with the one person's credentials captured for both Netflix and Spotify, that would mean two unique stealer log records. All of this then distilled down to 109M unique email addresses across all the files, and that's the number you'll now see in HIBP. In other words, 2.7B -> 109M is a 96% reduction from headline to people. Could we apply the same maths to the 16B headline? We'll never know for sure, but I betcha the decrease is even greater; I doubt additional corpuses to the tune of that many billion would continue to add new email addresses, and the duplication ratio would increase.

Because it always comes up after loading stealer logs, a quick caveat:

Not all email addresses loaded into this breach will contain corresponding stealer log entries. This is because we have one process to regex out all the addresses (the code is open source), and another process that pulls rows with email addresses against valid websites and passwords.

And because I'll end up copying and pasting this over and over again in responses to queries, another caveat:

Presence in a stealer log is often an indicator of an infected device, but we have no data to indicate when it was infected. There will be a lot of old data in here, just as there's a lot of repackaged data.

Of the passwords in valid stealer log entries, there were 231M unique ones, and we'd seen 96% of them before. Those are now all in Pwned Passwords with updated prevalence counts and are searchable via the website and, of course, via the API. Speaking of which, those passwords are presently being searched a lot:

Of the 109M email addresses we could parse out of the corpus, 96% of them were already in HIBP (that number coincidentally matches the percentage of existing passwords we track). They weren't all from previous stealer logs, of course, but anecdotally, during my testing, I found a lot of crossover between this one and the ALIEN TXTBASE logs from earlier this year. Regardless, we added 4.4M new addresses from Data Troll that we'd never seen before, so that alone is significant. Not significant enough to justify hyperbolic headlines to the effect of "biggest ever", but still sizeable.

To summarise:

The 16B headline distils down to a much smaller number of unique values of actual impact
The data is largely from stealer logs that have been circulating for some time now
It's certainly not fresh and doesn't pose any new risks that weren't already present

And lastly, there's that "Data Troll" title. When I first saw this story getting so much traction, the image I had in my mind was of a troll sitting on stashes of data. The mass media then picked this up and turned it into deliberately provocative headlines, manipulating the narrative to seek attention. Hopefully, this post tempers all that a little bit and brings some sanity back into the discussion. We need to take data exposures like this seriously, but it certainly didn't deserve the attention it got.

那160亿密码的故事（又名“数据巨魔”） That 16B password story (a.k.a. "data troll")

那160亿密码的故事（又名“数据巨魔”）
That 16B password story (a.k.a. "data troll")