查特博先生是一位维多利亚时代、经过伦理培训的模型。

查特博先生是一位维多利亚时代、经过伦理培训的模型。
Mr. Chatterbox is a Victorian-era ethically trained model

原始链接: https://simonwillison.net/2026/Mar/30/mr-chatterbox/

2026年3月，Trip Venturella发布了“Mr. Chatterbox”，一个独特的语言模型，仅使用来自英国图书馆的28,000多份维多利亚时代（1837-1899）的公共版权书籍进行训练。该模型拥有3.4亿个参数（类似于GPT-2 Medium），旨在展示在不依赖抓取和未经许可的数据的情况下构建LLM的潜力。目前，Mr. Chatterbox的回复虽然具有独特的维多利亚风格，但还比较基础——更像一个马尔可夫链，而非一个复杂的LLM。开发者承认需要更多训练数据（估计超过70亿个token）才能实现真正的对话能力。尽管存在局限性，该项目被认为是在完全公共领域资源的基础上构建LLM的一个有希望的步骤。一个名为“llm-mrchatterbox”的插件允许用户使用LLM框架在本地运行2.05GB的模型，展示了在Claude Code的帮助下，插件创建的成功自动化。你可以在[这里](链接到HuggingFace Spaces demo)试用演示。

## Chatterbox先生与LLM训练伦理 - 摘要由simonwillison.net构建的、经过伦理训练的新型LLM“Chatterbox先生”引发了Hacker News上的讨论。该模型的独特之处在于它仅使用公有领域作品进行训练——具体来说，是1899年之前出版的书籍，以避免版权问题。对话迅速扩展到训练数据来源的挑战。用户强调了使用LLM数字化英国国家档案馆等庞大档案的潜力，并提到了TimeCapsuleLLM等现有项目。人们对即使在1900年之前的时间范围内，版权限制仍然存在（英国版权在作者去世后70年仍然有效）表示担忧。一个关键的争论集中在“伦理”训练的含义上。虽然在法律上合规，但一些人认为，未经原始作者明确同意使用数据在伦理上仍然存在问题。另一些人则反驳说，版权本身就存在问题，作者可能没有预料到几个世纪后控制他们的作品。初步测试表明，该模型仅有3.4亿个参数，在连贯对话方面存在困难，这表明数据量和模型大小至关重要。

原文

30th March 2026

Trip Venturella released Mr. Chatterbox, a language model trained entirely on out-of-copyright text from the British Library. Here’s how he describes it:

Mr. Chatterbox is a language model trained entirely from scratch on a corpus of over 28,000 Victorian-era British texts published between 1837 and 1899, drawn from a dataset made available by the British Library. The model has absolutely no training inputs from after 1899 — the vocabulary and ideas are formed exclusively from nineteenth-century literature.

Mr. Chatterbox’s training corpus was 28,035 books, with an estimated 2.93 billion input tokens after filtering. The model has roughly 340 million paramaters, roughly the same size as GPT-2-Medium. The difference is, of course, that unlike GPT-2, Mr. Chatterbox is trained entirely on historical data.

Given how hard it is to train a useful LLM without using vast amounts of scraped, unlicensed data I’ve been dreaming of a model like this for a couple of years now. What would a model trained on out-of-copyright text be like to chat with?

Thanks to Trip we can now find out for ourselves!

The model itself is tiny, at least by Large Language Model standards—just 2.05GB on disk. You can try it out using Trip’s HuggingFace Spaces demo:

Honestly, it’s pretty terrible. Talking with it feels more like chatting with a Markov chain than an LLM—the responses may have a delightfully Victorian flavor to them but it’s hard to get a response that usefully answers a question.

The 2022 Chinchilla paper suggests a ratio of 20x the parameter count to training tokens. For a 340m model that would suggest around 7 billion tokens, more than twice the British Library corpus used here. The smallest Qwen 3.5 model is 600m parameters and that model family starts to get interesting at 2b—so my hunch is we would need 4x or more the training data to get something that starts to feel like a useful conversational partner.

But what a fun project!

Running it locally with LLM

I decided to see if I could run the model on my own machine using my LLM framework.

I got Claude Code to do most of the work—here’s the transcript.

Trip trained the model using Andrej Karpathy’s nanochat, so I cloned that project, pulled the model weights and told Claude to build a Python script to run the model. Once we had that working (which ended up needing some extra details from the Space demo source code) I had Claude read the LLM plugin tutorial and build the rest of the plugin.

llm-mrchatterbox is the result. Install the plugin like this:

llm install llm-mrchatterbox

The first time you run a prompt it will fetch the 2.05GB model file from Hugging Face. Try that like this:

llm -m mrchatterbox "Good day, sir"

Or start an ongoing chat session like this:

llm chat -m mrchatterbox

If you don’t have LLM installed you can still get a chat session started from scratch using uvx like this:

uvx --with llm-mrchatterbox llm chat -m mrchatterbox

When you are finished with the model you can delete the cached file using:

llm mrchatterbox delete-model

This is the first time I’ve had Claude Code build a full LLM model plugin from scratch and it worked really well. I expect I’ll be using this method again in the future.

I continue to hope we can get a useful model from entirely public domain data. The fact that Trip was able to get this far using nanochat and 2.93 billion training tokens is a promising start.

查特博先生是一位维多利亚时代、经过伦理培训的模型。 Mr. Chatterbox is a Victorian-era ethically trained model

Running it locally with LLM

查特博先生是一位维多利亚时代、经过伦理培训的模型。
Mr. Chatterbox is a Victorian-era ethically trained model