GPT-5-Codex 比我更优秀的AI研究员。

GPT-5-Codex 比我更优秀的AI研究员。
GPT-5-Codex is a better AI researcher than me

原始链接: https://www.seangoedecke.com/ai-research-with-codex/

## 人工智能研究：一点来自（编码）朋友的帮助这个实验探索了在有限资源下，人工智能研究能达到什么程度——具体来说，就是在笔记本电脑上五分钟训练一个模型。作者最初尝试独立完成，获得了一个能够生成简单故事的180万参数Transformer。然而，真正的突破来自于利用OpenAI的GPT-5-codex，本质上是使用人工智能来*做*人工智能研究。 Codex自主生成想法，运行实验，并迭代训练脚本，显著优于作者的个人努力。这个过程涉及一个反馈循环：Codex会修改脚本，运行测试，并提出改进建议，作者选择下一步。最初使用n-gram模型的尝试速度很快，但缺乏连贯性。Transformer显示出潜力，但仅为困惑度优化会导致重复、无意义的输出。最成功的方法是“提炼”——预训练一个Transformer来模仿快速训练的n-gram模型的语法，然后在TinyStories数据集上对其进行改进。这产生了令人惊讶的连贯且引人入胜的故事。作者将这种协作方法称为“氛围研究”，承认它并不等同于专业的人工智能研究，但提供了一种令人惊讶的易于访问和有效的方式来探索该领域。

## AI 与研究领域的变迁一篇最近的博文在 Hacker News 上引发了关于 GPT-5-Codex 能力以及它是否可能超越人类“AI 研究人员”的争论。作者，一位 GitHub 工程师，详细描述了使用 AI 成功复制一个研究项目，从而得出了这个引人注目的标题。然而，许多评论者淡化了其重要性，认为作者的工作水平仅限于本科生，结果也并非突破性。人们对依赖昂贵的 OpenAI 订阅以及缺乏原创性表示担忧，一些人将其比作仅仅下载一个现有的代码库。一个关键的主题是 AI 能够“提高下限”，使业余爱好者能够进入该领域，但其是否有能力突破界限仍然是一个问题。许多用户强调了 LLM 生成的研究存在的问题——不准确、依赖可疑来源以及缺乏真正的理解。另一些人则对 AI 自动化以前需要专业知识的任务所带来的技能贬值表示担忧。最终，这场讨论表明 AI 正在改变 *研究的方式*，但它是否真的会取代研究人员仍然存在争议。

原文

In What’s the strongest AI model you can train on a laptop in five minutes? I tried my hand at answering a silly AI-research question. You can probably guess what it was.

I chatted with GPT-5 to help me get started with the Python scripts and to bounce ideas off, but it was still me doing the research. I was coming up with the ideas, running the experiments, and deciding what to do next based on the data. The best model I could train was a 1.8M param transformer which produced output like this:

Once upon a time, there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time.

Since then, OpenAI has released GPT-5-codex, and supposedly uses it (plus Codex, their CLI coding tool) to automate a lot of their product development and AI research. I wanted to try the same thing. Codex-plus-me did a much better job than me alone^{. Here’s an example of the best output I got from the model I trained with Codex:}

Once upon a time, in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it.

What was the process like to get there?

What vibe research looks like

I want to call it “vibe research”. Like “vibe coding”, it’s performing a difficult technical task by relying on the model. I have a broad intuitive sense of what approaches are being tried, but I definitely don’t have a deep enough understanding to do this research unassisted. A real AI researcher would get a lot more out of the tool.

Still, it was very easy to get started. I gave Codex the path to my scratch directory, told it “continue the research”, and it immediately began coming up with ideas and running experiments on its own. In a way, the “train in five minutes” challenge is a perfect fit, because the feedback loop is so short.

The basic loop of doing AI research with Codex (at least as an enthusiastic amateur) looks something like this:

Codex makes a change to the training script and does three or four runs (this takes ~20 minutes overall)
Based on the results, Codex suggests two or three things that you could try next
I pick one of them (or very occasionally suggest my own idea) and return to (1).

After two days I did paste the current research notes into GPT-5-Pro, which helped a bit, but the vast majority of my time was spent in this loop. As we’ll see, the best ideas were ones Codex already came up with.

I chewed through a lot of tokens doing this. That’s OK with me, since I paid for the $200-per-month plan^{, but if you don’t want to do that you’ll have to space out your research a bit more slowly. I restarted my Codex process every million tokens or so. It didn’t have any issue continuing where it left off from its previous notes, which was nice.}

I ran Codex with --sandbox danger-full-access. By default it didn’t have access to MPS, which meant it could only train models on the CPU. There’s probably some more principled way of sandboxing it, but I didn’t bother to figure it out. I didn’t run into any runaway-agent problems, unless you count crashing my laptop a few times by using up too much memory.

How did the research go?

Here’s a brief summary of how the research went over the four or five days I spent poking at it. I stayed with the TinyStories dataset for all of this, partially because I think it’s the best choice and partially because I wanted a 1:1 comparison between Codex and my own efforts.

N-gram models

Codex and I started with a series of n-gram models: instead of training a neural network, n-gram models just store the conditional probabilities of a token based on the n tokens that precede it. These models are very quick to produce (seconds, not minutes) but aren’t very good. The main reason is that even a 5-gram model cannot include context from more than five tokens ago, so they struggle to produce coherent text across an entire sentence. Here’s an example:

Once upon a time , in a small school . ” they are friends . they saw a big pond . he pulled and pulled , but the table was still no attention to grow even more . she quickly ran to the house . she says , ” sara said . ” you made him ! ” the smooth more it said , for helping me decorate the cake .

It’s not terrible! There are short segments that are entirely coherent. But it’s kind of like what AI skeptics think LLMs are like: just fragments of the original source, remixed without any unifying through-line. The perplexity is 18.5, worse than basically any of the transformers I trained in my last attempt.

Codex trained 19 different n-gram models, of which the above example (a 4-gram model) was the best^{. In my view, this is one of the strengths of LLM-based AI research: it is trivial to tell the model “go and sweep a bunch of different values for the hyperparameters”. Of course, you can do this yourself. But it’s a lot easier to just tell the model to do it.}

Back to transformers

After this, Codex spent a lot of time working on transformers. It trained ~50 normal transformers with different sizes, number of heads, layers, and so on. Most of this wasn’t particularly fruitful. I was surprised that my hand-picked hyperparameters from my previous attempt were quite competitive - though maybe it shouldn’t have been a shock, since they matched the lower end of the Chinchilla scaling laws.

Still, eventually Codex hit on a 8.53 perplexity model (3 layers, 4 heads, and a dimension of 144), which was a strict improvement over my last attempt. I’m not really convinced this was an architectural improvement. One lesson from training fifty different models is that there’s quite a lot of variance between different seeds. A perplexity improvement of just over 1 is more or less what I was seeing on a “lucky seed”.

This was an interesting approach for the challenge: going for pure volume and hoping for a lucky training run. You can’t do this with a larger model, since it takes so long to train^{, but the five-minute limit makes it possible.}

Minimizing perplexity was a mistake

The next thing Codex tried - based on some feedback I pasted in from GPT-5-Pro - was “shallow fusion”: instead of training a new model, updating the generation logic to blend the transformer-predicted tokens with a n-gram model, a “kNN head” (which looks up hidden states that are “nearby” the current hidden state of the transformer and predicts their tokens), and a “cache head” that makes the model more likely to repeat words that are already in the context.

This immediately dropped perplexity down to 7.38: a whole other point lower than our best transformer. I was excited about that, but the generated content was really bad:

Once upon a time,, in a small house, there lived a boy named Tim. Tim loved to play outside with his ball. One Mr. Skip had a lot of fun. He ran everywhere every day. One One day, Tim was playing with his ball new ball near his house. Tim was playing with his his ball and had a lot of fun. But then, he saw a big tree and decided to climb it. Tim tried to climb the tree, but he was too big. He was too small to reach the top of the tree. But the tree was too high. The little tree was too high for him. Soon, Tim was near the tree. He was brave and climbed the tree. But when he got got to the top, he was sad. Tim saw a bird on

What happened? I over-optimized for perplexity. As it turns out, the pure transformers that were higher-perplexity were better at writing stories. They had more coherence over the entire length of the story, they avoided generating weird repetition artifacts (like ”,,”), and they weren’t as mindlessly repetitive.

I went down a bit of a rabbithole trying to think of how to score my models without just relying on perplexity. I came up with some candidate rubrics, like grammatical coherence, patterns of repetition, and so on, before giving up and just using LLM-as-a-judge. To my shame, I even generated a new API key for the LLM before realizing that I was talking to a strong LLM already via Codex, and I could just ask Codex to rate the model outputs directly.

Distillation from n-grams

The final and most successful idea I tried was distilling a transformer from a n-gram teacher model. First, we train a n-gram model, which only takes ~10 seconds. Then we train a transformer - but for the first 200 training steps, we push the transformer towards predicting the tokens that the n-gram model would predict. After that, the transformer continues to train on the TinyStories data as usual.

Here’s an example of some output:

Once upon a time, in a big forest, there lived a little bunny named Ben. Ben loved to play with his friends in the forest. One day, Ben’s mom saw him and was sad because he couldn’t find his friend. She asked, “Why are you sad, Ben?” Ben said, “I lost my toy. I can’t find it.” Ben wanted to help Ben find his toy. He knew they could fix the toy. He went to Sam’s house and found the toy under a tree. Sam was so happy and said, “Thank you, Ben! You are a very pretty toy!” Ben smiled and said, “Yes, I would love to help you.” They played together all day long. The moral of the story is to help others when they needed it.

I think this is pretty good! It has characters that continue throughout the story. It has a throughline - Ben’s lost toy - though it confuses “toy” and “friend” a bit. It’s a coherent story, with a setup, problem, solution and moral. This is much better than anything else I’ve been able to train in five minutes.

Why is it better? I think the right intuition here is that transformers need to spend a lot of initial compute (say, two minutes) learning how to construct grammatically-correct English sentences. If you begin the training by spending ten seconds training a n-gram model that can already produce sort-of-correct grammar, you can speedrun your way to learning grammar and spend an extra one minute and fifty seconds learning content.

I really like this approach. It’s exactly what I was looking for from the start: a cool architectural trick that genuinely helps, but only really makes sense for this weird challenge^.

Final thoughts

I don’t have any illusions about this making me a real AI researcher, any more than a “vibe coder” is a software engineer. Still, I’m surprised that it actually worked. And it was a lot of fun!

I’ve pushed up the code here if you want to pick up from where I left off, but you may be better off just starting from scratch with Codex or your preferred coding agent.