停止发布垃圾数据,这很丢人。
Twice this week, I have come across embarassingly bad data

原始链接: https://successfulsoftware.net/2026/03/29/stop-publishing-garbage-data-its-embarrassing/

最近出现的数据质量问题引发了对机构信任的担忧,以及由未经检查的数据训练大型语言模型可能导致的“混乱末日”。作者在两个英国数据集中发现了明显的错误:政府的燃料价格数据,包含位于海洋中的地点和严重不准确的价格比例;以及汽车协会(RAC)关于电动汽车的报告,显示车辆数量的图表严重失实。 尽管作者已于3月22日报告了燃料数据问题,但错误数据一周后仍公开可用。这些错误可能源于未经验证的用户提交和缺乏基本检查,凸显了一种令人担忧的趋势。作者强调需要严格的校对、代码测试和数据验证,以维护数据完整性并防止虚假信息的传播。最终,在日益数据驱动的世界中,对工作感到自豪并确保准确性至关重要。

## 坏数据与发布它的价值 最近在Hacker News上进行了一场讨论,围绕公开发布明显存在缺陷的数据(例如燃料价格数据和一位小数点位置错误的图表)。核心争论在于,机构是否应该优先保证数据准确性*再*发布,或者发布不完美的数据仍然有价值。 许多评论者认为,虽然干净的数据是理想的,但期望完美往往导致*没有*数据被发布。他们强调,即使数据不完整,提供方法论和注意事项也很重要,并强调允许用户自行清理和验证数据的益处。另一些人则强调,应该实施基本的合理性检查来捕捉明显的错误。 一个关键点是真正“干净”数据的成本——需要大量的人力劳动,并且可能通过过滤引入偏差。几位用户分享了花费大量时间清理数据,但项目最终被放弃的经历。最终,共识倾向于发布已知存在问题的数据,培养处理*不完美*信息的技术,并优先考虑透明度而非难以达到的完美。原始帖子的标题甚至从更严厉的立场(“停止发布垃圾数据”)被修改为更温和的标题。
相关文章

原文

Twice this week, I have come across embarassingly bad data.

The first instance is the UK government’s fuel finder data. This is a downloadable CSV file of fuel station locations and prices from around the UK. A potentially very useful database, especially during the current conflict in the Middle East. A customer suggested it as a possible practice dataset for my data wrangling and visualization software, Easy Data Transform . So I had a quick look and spotted some glaring errors within a few minutes.

A quick plot of the latitude and longitude shows some clear outliners:

On further investigation, some of these UK fuel stations are apparently located in the Indian and South Atlantic oceans. In at least one case, it looks like they got the latitude and longitude the wrong way around.

A quick look at the fuel price columns also shows some major issues:

The ratio between the most expensive and cheapest fuel (per litre) is 1538:1. Clearly wrong.

Shown as a histogram with a logarithmic Y axis:

I am guessing that the reason for this bad data is that the fuel stations are submitting their own data and, humans being humans, they make mistakes. But then the government is publishing the data without even the most basic checks. That just isn’t good enough.

I reported the problem on 22-Mar-2026. They acknowledge my email on 24-Mar-2026 (“Thank you for sharing this, we have passed this on to the technical team to have a look at.”). The CSV file published on 29-Mar-2026 still has the garbage data.

The second instance is a report on electric cars from UK motoring organization, the RAC. The first graph in the article is this:

Did the number of Battery Electric Vehicles on the UK’s roads suddeny drop from ~1.4 million in 2024 to ~0.0017 million in 2025? What happened to those ~1.4 million vehicles? I’m guessing that someone got their thousands and millions mixed up. But then they published the report with this glaring error. Did anyone mathematically literate even check this graph?

Lousy data undermines trust in institutions and can lead to bad decisions. I fear we are heading for a future where LLMs generate data, which people don’t bother to properly check. This data is then used train LLMs. The error is then much harder to spot once it is served back without the original source by LLMs. A slop-apocalypse.

Authors should have their work proof read, programmers should test their code and data people should do basic data validation. Let’s take some pride in our work.

联系我们 contact @ memedata.com