为什么人工智能模型使用这么多破折号？

为什么人工智能模型使用这么多破折号？
Why do AI models use so many em-dashes?

原始链接: https://www.seangoedecke.com/em-dashes/

人工智能生成的内容明显充斥着破折号——多到如此程度，以至于人类作者现在会避免使用它们，以免被误认为是人工智能。尽管经过多次尝试，令人惊讶的是，很难提示人工智能模型*不要*使用它们，这导致了一个谜团：为什么过度使用？最初的理论——破折号在英语中很常见，提供预测的灵活性，或者节省token——并没有令人信服。一个引人注目的解释集中在人工智能训练数据的变化上。早期模型，如GPT-3.5，主要使用互联网数据和盗版电子书。后来的模型，如GPT-4o，受益于*印刷*书籍的大规模数字化。至关重要的是，作者认为这些数字化书籍偏向于19世纪末和20世纪初的文学作品，这些作品比现代写作更频繁地使用破折号。大量旧文本的涌入很可能“感染”了模型，使其养成了这种标点符号习惯。虽然人类反馈（RLHF）等其他因素可能发挥作用，但训练数据的转变似乎是最有可能的原因。最终，需要从人工智能实验室获得更多见解来证实这一理论。

## AI 与长破折号：摘要最近的 Hacker News 讨论探讨了为什么 AI 模型经常在其生成的文本中使用长破折号（—）。领先的理论认为，模型从训练数据中学习了这种风格怪癖，特别是《大西洋月刊》和《纽约客》等高声望的出版物，它们经常使用长破折号。一些人推测，对精致写作风格的偏好，通过人类反馈强化学习 (RLHF) 加强，也促成了这种趋势。有趣的是，长破折号的过度使用正在产生反作用——读者现在将它们与 AI 生成的内容联系起来，导致一些作者完全避免使用它们。还有关于潜在“水印”的讨论——有意包含长破折号等模式来识别 AI 生成的文本。其他贡献因素包括长破折号在文学中的历史用法、特定作者（如 Paul Graham）的影响，甚至语言差异（俄语使用长破折号来表示对话归属）。最终，这场讨论强调了 AI 模仿人类写作风格如何反讽地改变了这些风格本身。

原文

If you asked most people to name a defining feature of AI-generated writing, they’d probably say the em-dash — like this. Language models use em-dashes so much that real humans who like em-dashes have stopped using them out of fear of being confused with AI. It’s also surprisingly hard to prompt models to avoid em-dashes: take this thread from the OpenAI forums where users share their unsuccessful attempts. Given all that, it’s kind of weird that we don’t really know why language models use the em-dash so much.

Explanations I don’t find convincing

One common explanation is that normal English text contains a lot of em-dashes, so it’s just learned behavior from the training data. I find this fairly unconvincing, for the reason that everyone thinks AI uses a lot of em-dashes. If em-dashes were as common in AI prose as human prose, they would be as unremarkable as the use of other punctuation marks.

Another explanation I’m not convinced by is that AI models like em-dashes because they’re so versatile. When the model is trying to predict the next token, an em-dash keeps its options open: it could either continue on the same point or make a brand new point. Since models are just trying to pick the next most likely token, could they just be “playing it safe” by using em-dashes? I don’t think so. First, other punctuation marks are similarly flexible. Second, I’m not sure that “playing it safe” is a good idiom for thinking about how models generate text.

Other people have argued that AI models use em-dashes because model training explicitly biases for brevity, and em-dashes are very token-efficient. From what I can tell by playing with the OpenAI tokenizer, the em-dash itself isn’t inherently more efficient, but plausibly without it you’d have to write some connective tissue like ”, therefore”. Still, I don’t buy this. Many em-dashes (e.g. the common “it’s not X - it’s Y” pattern) could simply be replaced with a comma, which is equally brief^{. I also don’t think GPT-4o is so brevity-focused that it’s doing micro-optimizations around punctuation like this: if it wanted to use fewer tokens, it could simply waffle less^.}

Could em-dash use be RLHF-ed in from African English?

One theory I spent more time looking into was that em-dash use could reflect the local English dialect of the RLHF workers. The final stage of training a language model^{involves RLHF: reinforcement learning with human feedback. Essentially, hundreds of human testers are paid to interact with the model and grade model outputs, which are then fed back into the model to make it more helpful and friendly.}

The AI company paying for this work is incentivized to do it in countries that are low cost-of-living but have many fluent English speakers. For OpenAI, this meant African countries like Kenya and Nigeria. But one interesting consequence of this decision is that African English is subtly different from American or British English. For instance, African English uses the word “delve” more liberally, which is the explanation for why GPT-4o really likes the word “delve” (and other flowery words like “explore” and “tapestry”)^.

Does African English use a lot of em-dashes, causing African RLHF workers to rate responses with em-dashes highly? This would be a neat explanation, but I don’t think it’s true. I pulled a dataset of Nigerian English text and measured the frequency of em-dashes per-word. Out of all words in the dataset, 0.022% of them were em-dashes. This paper about the frequency of punctuation marks in English text in general estimates general em-dash rates as between 0.25% and 0.275%:

The use of the dash increased after 1750, then reached its peak (about 0.35%) in 1860, but afterwards continued to drop up until the 1950s before starting to fluctuate between 0.25% and 0.275%. The frequency of punctuation marks calculated in the current study is relative to word count in corpora.

Remember that point about em-dash rates peaking in 1860 for later. But for now, it seems like Nigerian English, which is a good-enough stand-in for punctuation rates in African English, is actually less prone to use em-dashes. For that reason, I don’t think the overuse of em-dashes and “delve” are caused by the same mechanism.

Digitization of print media

One interesting observation about em-dashes is that GPT-3.5 did not use them. GPT-4o used ~10x more em-dashes than its predecessor, and GPT-4.1 was even worse. However, Anthropic and Google’s models do use em-dashes. Even the open-source Chinese models use em-dashes^{. What changed between November 2022 and July 2024?}

One thing that changed was the makeup of the training data. In 2022, OpenAI was almost certainly training on a mix of public internet data and pirated books from sites like LibGen. However, once the power of language models became apparent, AI labs quickly realized that they needed more high-quality training data, which meant scanning a lot of print books. Only OpenAI employees know when or if OpenAI started scanning books, but court filings have revealed that Anthropic started their process in February 2024. I think it’s reasonable to assume that OpenAI did something similar. In other words, between 2022 and 2024 the training data changed to include a lot of print books.

Remember the punctuation rates study above that showed em-dash rates peaking in 1860? I think it’s a plausible theory that the books AI labs digitized skewed closer to 1860 than the pirated books. Intuitively, pirated content biases towards contemporary and popular literature, because that’s what people want to download. If AI labs wanted to go beyond that, they’d have to go and buy older books, which would probably have more em-dashes. We now arrive at what I think is the most plausible explanation for why AI models include so many em-dashes:

State-of-the-art models rely on late-1800s and early-1900s print books for high-quality training data, and those books use ~30% more em-dashes than contemporary English prose. That’s why it’s so hard to get models to stop using em-dashes: because they learned English from texts that were full of them.

I want to thank this blog from Maria Sukhareva for putting me onto this point. I disagree with her that em-dashes are structurally preferred, for reasons I’ve briefly covered above, but I think it’s very plausible that she’s correct about digitization driving em-dash usage. For some more specific examples and a similar point, you can also check out this post, which shows just how many em-dashes some classic works have. My favorite book, Moby-Dick, has a staggering 1728 em-dashes!

Summary

There are three broad categories of possible explanation for why models use em-dashes so much.

The first category are structural explanations, which argue that em-dashes are somehow inherently preferred by autoregressive models because they save tokens, or preserve optionality, or something else. I don’t find this convincing because GPT-3.5 didn’t overuse emdashes, and it just doesn’t match my intuition about how inference works.

The second category are RLHF explanations, which argue that human raters prefer em-dashes because they’re more conversational or they’re more common in the particular variant of English where the RLHF-ers live. I think there’s no support for the variant-of-English argument, but the “it’s more conversational” argument could be right. Hard to say what evidence could confirm or deny it.

The third category are training data explanations, which argue that em-dashes are just in the training data. I don’t buy this as a general explanation, but it does seem likely to me that they might be overrepresented in some high-quality training data: in particular, early-1900s print books. Overall, I think that’s the strongest explanation.

Final thoughts

This is still largely based on speculation. Maybe I’m wrong about when OpenAI started digitizing written text. If they did it before GPT-3.5, then it couldn’t be the cause of em-dashes. Certainly models trained today are at least in part infected with em-dashes by training on the output of other AI models. Either they’re deliberately trained on synthetic data, or they just can’t avoid vacuuming in a host of AI-generated content along with other internet texts.

One thing I’m still a bit confused about: if em-dashes are common because they’re a feature of late-1800s/early-1900s writing, why doesn’t AI prose read more like Moby-Dick? Is it plausible that the models are picking up fragments of older English prose stylings, like punctuation, but are still producing contemporary-sounding text?

I also might be wrong that newly-digitized content would have older publication dates. It’s plausible to me that pirated books would skew more contemporary, but could that be outweighed by the number of older books that are in the public domain?

There also might be a simpler explanation for em-dash prevalence: for instance, maybe em-dashes just read more conversational, so they were preferred by RLHF-ers, and this created a vicious cycle that biased towards more and more em-dashes. This would kind of line up with a Sam Altman interview clip where he says they added more em-dashes because people liked them. I don’t know how you’d go about proving or disproving this.

In general, I’m still surprised that there’s no widespread consensus about the cause of one of the most identifiable features of AI prose. I do think I’m probably right that digitizing late-1800s/early-1900s works is the cause - but it would be really nice if someone who was at OpenAI between GPT-3.5 and GPT-4o (or who’s in a position to know for some other reason) could confirm that this is what happened.

If you liked this post, consider subscribing to email updates about my new posts, or sharing it on Hacker News. Here's a preview of a related post that shares tags with this one.