大型语言模型可以做到随机性吗?
Can LLMs do randomness?

原始链接: https://rnikhil.com/2025/04/26/llm-coin-toss-odd-even

本研究以轻松的方式,通过抛硬币和数字生成实验,探究了大型语言模型(OpenAI和Anthropic)中的随机性问题。研究使用旨在引发无偏见回应的提示,发现大多数模型都表现出具有统计显著性的偏差。 在抛硬币实验(每次100次抛掷)中,所有模型都显示出“正面”的偏差,其中Claude 3.7 Sonnet的偏差最小(正面概率58%),GPT-o1的偏差最大(正面概率99%)。只有Claude的偏差在统计上不显著。 在数字生成实验(0-10)中,大多数模型都表现出对奇数的偏好。令人惊讶的是,在抛硬币实验中表现无偏的Claude,在该实验中却显示出强烈的奇数偏好(97%)。相反,GPT-4.5-preview展现出完全均衡的奇偶数分布,而GPT-4.1则没有表现出统计显著性的偏差。 该研究强调,虽然大型语言模型理论上理解随机性,但训练数据的影响可能导致意想不到的、具有统计显著性的偏差,这凸显了理解这些倾向的重要性。

Hacker News 上的一个帖子讨论了大型语言模型 (LLM) 是否能够生成随机性。原帖介绍了一个实验,测试了 LLM 执行抛硬币和选择数字等任务的能力。 评论者指出,LLM 本身是确定性的,随机性来自于后处理,例如温度设置和潜在的服务器端注入。一些人建议,为了更严格的分析,应该在具有固定种子和温度的本地模型上进行随机性测试。 讨论涉及影响 LLM 输出的因素,包括浮点错误、硬件差异以及强化学习对标记预测方差的影响。一些人还提到了偏差可能影响 LLM 响应的可能性,这反映了人类在随机数生成中的偏差。讨论也强调了定义随机性以及探索其在 LLM 中各个方面的重要性。最终,大多数人同意 LLM 依赖于外部随机性来源(PRNG),而不是内在地生成它。

原文

Are LLMs random?

While LLMs theoretically understand “randomness,” their training data distributions may create unexpected patterns. In this article we will test different LLMs from OpenAI and Anthropic to see if they provide unbiased results. For the first experiment we will make it toss a fair coin and for the next, we will make it guess a number between 0-10 and see if its equally distributed between even and odd. I know the sample sizes are small and probably not very statistically significant. This whole thing is just for fun.

Experiment 1 : Tossing a fair coin

Prompt used: Toss a fair coin. Just say “heads” or “tails”. Just output the result. Don’t say anything else. Don’t write code. Don’t use any tools.

Before we plot the results, we calculate deviation. Deviation simply measures how far each model’s heads probability strays from the ideal unbiased value (0.5 or 50%). It’s calculated as:

Deviation = P(Heads) - 0.5

For example, Claude 3.7 Sonnet has P(Heads) = 0.58, so its deviation is 0.58 - 0.5 = 0.08 (or 8%). This directly quantifies bias magnitude and direction, with positive values indicating heads bias and negative values indicating tails bias. The first graph shows raw proportions of heads vs tails, while the second graph visualizes these deviations.

Next we also do a chi-squared test to determine whether the bias is statistically significant or could reasonably occur by chance. I know we don’t have a big enough sample size but I am just doing this for fun. For each model, it’s calculated as:

χ² = Σ (Observed - Expected)²/Expected

With 100 tosses per model and an expected 50/50 split:

χ² = (Observed_Heads - 50)²/50 + (Observed_Tails - 50)²/50

For Claude 3.7 Sonnet:

χ² = (58 - 50)²/50 + (42 - 50)²/50 = 2.56

A χ² value greater than 3.84 (critical value for df=1, p=0.05) indicates statistical significance. Models with statistically significant bias are shown in red in the deviation graph, indicating their bias likely reflects an inherent trait rather than random chance. Claude’s χ² = 2.56 falls below this threshold, suggesting its observed bias could reasonably occur by random variation.

Key Findings:

  • All models show “heads” bias - every LLM tested produced more heads than tails
  • Bias severity varies significantly - ranging from 8% (Claude) to 49% (GPT-o1)
  • Statistical significance - Claude is the only model whose bias isn’t statistically significant
  • OpenAI models show substantially stronger heads bias than Claude

Analysis Details:

  • Most biased: o1 (99% heads) and GPT-4.1 (96% heads)
  • Least biased: Claude 3.7 Sonnet (58% heads)
  • Average bias: 30.7% deviation from perfect balance
  • Chi-square tests confirm statistical significance for all models except Claude

Experiment 2 : Odd vs even

Prompt used: Generate a random number between 0 and 10 (both inclusive). Just output the number. Don’t say anything else. Don’t write code. Don’t use any tools. Don’t explain. Don’t output anything except the number.

Now we repeat the same analysis as above and plot the numbers.

Key Findings:

  • Strong odd number bias in most models - 4 out of 6 models show statistically significant preference for odd numbers
  • Claude shows extreme bias - With 97% odd numbers, Claude 3.7 Sonnet has the strongest bias (47% deviation from expected)
  • GPT-4.5 shows perfect balance - Exactly 50/50 distribution between odd and even
  • Two unbiased models - GPT-4.5-preview and GPT-4.1 show no statistically significant bias

Statistical Analysis:

  • Most biased: Claude 3.7 Sonnet (χ² = 88.36, p < 0.05)
  • Perfectly balanced: GPT-4.5-preview (χ² = 0.00)
  • Average bias magnitude: 18.0% deviation from expected 50/50 split
  • Direction of bias: Most models favor odd numbers, while GPT-4.1 slightly favors even numbers

Its interesting to see Claude being unbiased while tossing coins but being super biased when prediction odd/even numbers.

Raw data

Coin toss

Odd vs Even

·
联系我们 contact @ memedata.com