An interesting thing about humans is that they are not good random number generators.
If you ask a person to "pick a random number between 1 and 100", they are
remarkably predictable. Answers cluster on 37 and 73, on "messy" numbers, and
on memes like 42 and 69, while round numbers are quietly avoided. A true random
generator would instead produce a flat, uniform distribution.
This project asks gpt-4.1 the same question 10,000 times and
characterizes the distribution it produces, measured against a uniform baseline.
Does an LLM, which is trained on human text, behave like a fair die, or does it inherit
the lumpy human pattern?
Full design and methodology: docs/LLM Random Bias Experiment SDD.md.
This experiment is an LLM-focused follow-up to two well-known explorations of human number-picking bias.
Full experimental design is in the SDD; the essentials:
- Model.
gpt-4.1(OpenAI), called via the Responses API. It is a non-reasoning model. It emits a direct answer rather than deliberating; what we're measuring is its raw output distribution, not a reasoning strategy. The exact model string is recorded in every raw-CSV row (Modelcolumn) and indata/raw/run_metadata.json, so the dataset is self-describing. - Sample size. N = 10,000 independent calls — enough for a chi-square goodness-of-fit test and per-number proportions stable to ~±0.5 pp.
- Sampling.
temperature = 1.0, so the model exercises its full sampling distribution. This is the experiment: at low temperature it would just repeat one number. - Prompt. A fixed system prompt instructs the model to output only one
integer between 1 and 100; the user prompt requests the number and carries a
unique
uuid4. (The UUID is request-tracing hygiene, not cache-busting — at temperature 1.0 every call should sample independently regardless.) - Baseline. The result is compared against a uniform distribution — what a fair generator would produce — not against human data (see Assumptions).
- Pipeline. Four stages —
collect → clean → transform → stats, detailed below. Cleaning validates every answer is an integer in [1, 100] and reports the rejection rate.
This is an illustrative probe, not a definitive study. Key caveats — see the SDD's Limitations section for the formal treatment:
- Single model. Results describe
gpt-4.1only and do not generalize to other models or providers. - "Randomness" is a sampling artifact. The model is not a random number generator; it samples a learned token distribution. We characterize that distribution — we do not claim the model is trying to be random.
- Prompt- and temperature-dependent. A different prompt wording or sampling temperature could shift the distribution. Both are fixed and documented.
- Not "ChatGPT the product." This tests a model through the API at a fixed temperature — not the consumer ChatGPT app, which adds routing, tools, and a system prompt outside our control.
gpt-4.1 is emphatically not a uniform random generator. A chi-square goodness-of-fit test against a uniform distribution (N = 10,000, df = 99) returns χ² = 15,604, p ≈ 0 — the deviation is so large it underflows any significance threshold. Asked for a random number, the model produces a lumpy, distinctly human-shaped distribution.
| Number | Picked vs. uniform chance | Human reputation |
|---|---|---|
| 37 | 4.0× | "the most random number" |
| 42 | 4.0× | Hitchhiker's Guide meme |
| 73 | 3.4× | the other well-known spike |
The five most-picked numbers overall — 47, 57, 72, 37, 42 — lean heavily on
numbers ending in 7 (three of the five), the same "number that feels random" pull seen in
humans.
All multiples of 10, except for 10 itself, were picked exactly 0 times in 10,000 calls. 10 was picked exactly once. Humans avoid round numbers — gpt-4.1 essentially refuses them.
One number breaks the human pattern. 69 is a meme number humans over-pick. gpt-4.1 under-picks it (0.29× expected: ~29 occurrences against ~100). The model inherited the "smart" meme (42) and not the crude one. Our hypothesis is that this is a product of safety guardrails during pre-training and post-training. It is the most interesting aspect in the dataset: the model's bias is not a raw copy of human bias but a moderated version of it.
The hypothesis holds. An LLM trained on human text, asked to be random, reproduces human random-number bias: the pull toward 37 and 73, the meme spike at 42, the aversion to round numbers — with one guardrail-likely exception. The interactive distribution chart shows the full 1–100 shape.
All figures from data/processed/stats_summary.csv.
collect → clean → transform → stats. Each stage reads the previous stage's
committed CSV, so any stage can be re-run on its own.
| Stage | Module | Output |
|---|---|---|
| Collect | llm_random_bias.collect |
data/raw/chatgpt_random_results.csv |
| Clean | llm_random_bias.clean |
data/processed/chatgpt_random_clean.csv |
| Transform | llm_random_bias.transform |
data/processed/distribution.csv |
| Stats | llm_random_bias.stats |
data/processed/stats_summary.csv |
This project uses uv for everything.
The raw dataset is committed to this repo, so you can reproduce the entire analysis without spending a cent:
uv run python -m llm_random_bias.clean
uv run python -m llm_random_bias.transform
uv run python -m llm_random_bias.statscp .env.example .env # then edit .env and add your OPENAI_API_KEY
uv run python -m llm_random_bias.collect
# then run clean / transform / stats as in Path 1Cost & runtime: ~10,000 short calls to gpt-4.1 cost roughly US$2 and
finish in a few minutes at the default concurrency. The collector refuses to
overwrite an existing raw CSV — delete it first to re-collect.
The distribution bar chart is built in Exmergo Viz (our AI dashboard agent) directly from
data/processed/distribution.csv. The fully interactive data viz can be viewed here.
uv run ruff check .
uv run ruff format .
uv run mypy src
uv run pytestSee CONTRIBUTING.md.
MIT — see LICENSE.