你不会下载一个 Hacker News

你不会下载一个 Hacker News
You Wouldn't Download a Hacker News

作者成功使用自定义的Go API客户端下载了完整的Hacker News存档（一个20GB的JSON文件）。受hn.unlurker.com项目启发，这个项目扩展成了一个完整的数据转储。下载完成后，他们尝试使用DuckDB（一个快速、可嵌入的分析引擎）进行分析。DuckDB允许他们高效地查询数据，执行诸如统计“correct horse battery staple”出现次数和生成时间序列可视化等任务。具体来说，他们计算了12周移动平均值，衡量的是提及编程语言（如Python、JavaScript、Java、Ruby和Rust）以及数据库（如Postgres、MySQL、Mongo、Redis和Sqlite）的评论和故事所占的比例，从而揭示了随时间的趋势。他们认为DuckDB是分析这种规模数据集的绝佳工具。作者在开玩笑地提到用这个数据集训练LLM机器人后，结束了这个项目，并邀请读者探索他们的其他作品。

一篇Hacker News帖子讨论了一位用户下载整个HN存档来训练AI模型。这引发了关于用人类数据喂养AI模型的意义的讨论，一些人担心这会产生垃圾邮件机器人。一些评论者建议使用BigQuery或ClickHouse，这些是公开可用的HN数据数据库，用于分析，而不是抓取数据。对话随后转向在AI时代管理在线信任的问题，包括防止机器人的独特人类验证方法。也有人争论投票/反对票的价值和影响，以及AI生成的内容是否最终会主导在线讨论，模糊人机交互的界限。最后，讨论了数据分析本身以及它揭示的内容，包括改进未来帖子中的数据和分析的建议。

DuckDB 作为新的 jq 2024-03-23

YC/HN 二十年 2025-03-18

（评论） 2025-03-18

（评论） 2025-04-07

原文

TLDR: I Did Download It

And now I can analyze it with DuckDB. Behold the fraction of total comments and stories referencing key topics over time!

Part 1: The Mods Are Asleep, Download It All

As part of building hn.unlurker.com, I wrote a HN API client. There are already a bunch of other clients, but I wanted to try the latest Go features and linters on a new project. I’m glad I did; it was a lot of fun.

The client can retrieve active items, lists of items, etc. (comments and stories are called “items” in the HN API). Although I only really needed recent items for my project, for completeness I added “scan” which downloads all the items, in order, from zero to the latest or the other way around.

I wondered — could I just download the whole thing? Extrapolating from a few thousand items, it would only be tens of GiB of JSON. I thought I’d give it a try.

hn scan --no-cache --asc -c- -o full.json

I had to CTRL-C a stalled download a few times, but scan is resumable so after a few hours I was done. I had a 20 GiB JSON file of everything that has ever happened on Hacker News, and I can just re-run the command above to “top it off” any time I need the latest. But what could I do with it?

Part 2: Feed The Ducks

First I just grepped for things. How many times has the phrase “correct horse battery staple” appeared on the site? Quite a few: 231 times (the last one just today). But grepping stuff is old news, so I thought I’d try out DuckDB.

In the database world, DuckDB is unique: a super-fast embeddable analytics execution engine also available as a command-line tool. I spend most of my day wrangling a different database (there’s the plug my coworkers will be looking for) but I’ve been meaning to try DuckDB and it seemed perfect for this one-off task.

As it turns out, with their new UI for novices like me, it’s a breeze to use. AND LLMs are pretty good at helping craft the SQL queries. I just had to import the data:

CREATE TABLE items AS
SELECT *
FROM read_json_auto('/home/jason/full.json', format='nd', sample_size=-1);

Then query it. Here’s a 12-week moving average of the fraction of total items containing the terms I am interested in:

WITH weekly AS (
  SELECT
    DATE_TRUNC('week', TO_TIMESTAMP(time)) AS week_start,
    COUNT(*) FILTER (WHERE text ILIKE '%python%')::float     / NULLIF(COUNT(*),0)
      AS python_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%javascript%')::float / NULLIF(COUNT(*),0)
      AS javascript_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%java%')::float       / NULLIF(COUNT(*),0)
      AS java_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%ruby%')::float       / NULLIF(COUNT(*),0)
      AS ruby_prop,
    COUNT(*) FILTER (WHERE text ILIKE '%rust%')::float       / NULLIF(COUNT(*),0)
      AS rust_prop
  FROM items
  GROUP BY week_start
)
SELECT
  week_start,
  AVG(python_prop)     OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_python_12w,
  AVG(javascript_prop) OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_javascript_12w,
  AVG(java_prop)       OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_java_12w,
  AVG(ruby_prop)       OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_ruby_12w,
  AVG(rust_prop)       OVER (
    ORDER BY week_start
    ROWS BETWEEN 11 PRECEDING AND CURRENT ROW
  ) AS avg_rust_12w
FROM weekly
ORDER BY week_start;

Overall DuckDB seems really great for analyzing data sets of this size.

Next Steps

Now that I have a local download of all Hacker News content, I can train hundreds of LLM-based bots on it and run them as contributors, slowly and inevitably replacing all human text with the output of a chinese room oscillator perpetually echoing and recycling the past.

Or alternatively, I think for this project I am done. Someone else will have to take it to the next logical step.

Thanks for reading! Please check out hn.unlurker.com, take a look at my other articles, or find me on X.