停止发布垃圾数据,这很丢人。
Twice this week, I have come across embarassingly bad data

原始链接: https://successfulsoftware.net/2026/03/29/stop-publishing-garbage-data-its-embarrassing/

最近出现的数据质量问题引发了对机构信任的担忧,以及由未经检查的数据训练大型语言模型可能导致的“混乱末日”。作者在两个英国数据集中发现了明显的错误:政府的燃料价格数据,包含位于海洋中的地点和严重不准确的价格比例;以及汽车协会(RAC)关于电动汽车的报告,显示车辆数量的图表严重失实。 尽管作者已于3月22日报告了燃料数据问题,但错误数据一周后仍公开可用。这些错误可能源于未经验证的用户提交和缺乏基本检查,凸显了一种令人担忧的趋势。作者强调需要严格的校对、代码测试和数据验证,以维护数据完整性并防止虚假信息的传播。最终,在日益数据驱动的世界中,对工作感到自豪并确保准确性至关重要。

一篇 Hacker News 的讨论强调了数据质量差日益严重的问题,发帖者对发布“垃圾数据”感到尴尬。 许多评论者表示同意,强调干净的数据获取成本很高——需要彻底的手工验证(例如临床试验中的源数据验证),而不仅仅是识别异常值。 讨论强调了数据对组织形象的重要性;有说服力的指标对于维护信任和保障业务至关重要。 对话扩展到 Hacker News 上展示的项目(“Show HN”),人们对不准确的文档甚至创作者对自身代码缺乏充分理解表示担忧。 普遍的观点倾向于问责制:发布虚假信息,即使是无意的,也是不可接受的。 最后一条评论指出,低质量的、AI驱动的项目正在涌入该网站进行自我宣传。
相关文章

原文

Twice this week, I have come across embarassingly bad data.

The first instance is the UK government’s fuel finder data. This is a downloadable CSV file of fuel station locations and prices from around the UK. A potentially very useful database, especially during the current conflict in the Middle East. A customer suggested it as a possible practice dataset for my data wrangling and visualization software, Easy Data Transform . So I had a quick look and spotted some glaring errors within a few minutes.

A quick plot of the latitude and longitude shows some clear outliners:

On further investigation, some of these UK fuel stations are apparently located in the Indian and South Atlantic oceans. In at least one case, it looks like they got the latitude and longitude the wrong way around.

A quick look at the fuel price columns also shows some major issues:

The ratio between the most expensive and cheapest fuel (per litre) is 1538:1. Clearly wrong.

Shown as a histogram with a logarithmic Y axis:

I am guessing that the reason for this bad data is that the fuel stations are submitting their own data and, humans being humans, they make mistakes. But then the government is publishing the data without even the most basic checks. That just isn’t good enough.

I reported the problem on 22-Mar-2026. They acknowledge my email on 24-Mar-2026 (“Thank you for sharing this, we have passed this on to the technical team to have a look at.”). The CSV file published on 29-Mar-2026 still has the garbage data.

The second instance is a report on electric cars from UK motoring organization, the RAC. The first graph in the article is this:

Did the number of Battery Electric Vehicles on the UK’s roads suddeny drop from ~1.4 million in 2024 to ~0.0017 million in 2025? What happened to those ~1.4 million vehicles? I’m guessing that someone got their thousands and millions mixed up. But then they published the report with this glaring error. Did anyone mathematically literate even check this graph?

Lousy data undermines trust in institutions and can lead to bad decisions. I fear we are heading for a future where LLMs generate data, which people don’t bother to properly check. This data is then used train LLMs. The error is then much harder to spot once it is served back without the original source by LLMs. A slop-apocalypse.

Authors should have their work proof read, programmers should test their code and data people should do basic data validation. Let’s take some pride in our work.

联系我们 contact @ memedata.com