从头开始训练你自己的LLM

从头开始训练你自己的LLM
Train Your Own LLM from Scratch

原始链接: https://github.com/angelos-p/llm-from-scratch

## 构建你自己的GPT：实践工作坊总结本次工作坊将指导你从*零*开始构建一个GPT语言模型——不允许使用预训练模型或黑盒库。灵感来源于Andrej Karpathy的nanoGPT，旨在通过自己实现每个组件，让你深入理解LLM的工作原理。你将创建一个约1000万参数的模型，能够生成类似莎士比亚风格的文本，并且可以在标准笔记本电脑上在一小时内完成训练。工作坊内容包括：构建字符级分词器，设计Transformer架构（嵌入层、注意力机制、前馈层），实现训练循环（损失函数、优化器），以及通过采样生成文本。该项目被分解为易于管理的部分，并提供清晰的解释，最终生成你亲自编写的功能性`model.py`、`train.py`和`generate.py`文件。它支持Apple Silicon (MPS)、NVIDIA (CUDA)、CPU和Google Colab。它强调字符级分词处理小型数据集，并说明了如何过渡到BPE处理大型数据集。本次工作坊旨在揭开AI的神秘面纱，并使你能够超越仅仅*使用*LLM，真正*理解*它们。

训练你自己的LLM从头开始 (github.com/angelos-p) 16 分，kristianpaul 发表于 50 分钟前 | 隐藏 | 过去 | 收藏 | 1 条评论帮助 iamnotarobotman 4 分钟前 [–] 这对于第一次接触LLM训练来说很棒，而且看起来足够简单，可以在本地尝试。做得好！回复考虑申请YC 2026年夏季项目！申请截止至5月4日指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请YC | 联系搜索：

原文

A hands-on workshop where you write every piece of a GPT training pipeline yourself, understanding what each component does and why.

Andrej Karpathy's nanoGPT was my first real exposure to LLMs and transformers. Seeing how a working language model could be built in a few hundred lines of PyTorch completely changed how I thought about AI and inspired me to go deeper into the space.

This workshop is my attempt to give others that same experience. nanoGPT targets reproducing GPT-2 (124M params) and covers a lot of ground. This project strips it down to the essentials and scales it to a ~10M param model that trains on a laptop in under an hour — designed to be completed in a single workshop session.

No black-box libraries. No model = AutoModel.from_pretrained(). You build it all.

A working GPT model trained from scratch on your MacBook, capable of generating Shakespeare-like text. You'll write:

Tokenizer — turning text into numbers the model can process
Model architecture — the transformer: embeddings, attention, feed-forward layers
Training loop — forward pass, loss, backprop, optimizer, learning rate scheduling
Text generation — sampling from your trained model

Any laptop or desktop (Mac, Linux, or Windows)
Python 3.12+
Comfort reading Python code (you don't need ML experience)

Training uses Apple Silicon GPU (MPS), NVIDIA GPU (CUDA), or CPU automatically. Also works on Google Colab — upload the files and run with !python train.py.

Install uv if you don't have it:

# macOS / Linux
curl -LsSf https://astral.sh/uv/install.sh | sh

# Windows
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

Then set up the project:

uv sync
mkdir scratchpad && cd scratchpad

If you don't have a local setup, upload the repo to Colab and install dependencies:

!pip install torch numpy tqdm tiktoken

Upload data/shakespeare.txt to your Colab files, then write your code in notebook cells or upload .py files and run them with !python train.py.

Work through the docs in order. Each part walks you through writing a piece of the pipeline, explaining what each component does and why. By the end, you'll have a working model.py, train.py, and generate.py that you wrote yourself.

Part	What You'll Write	Concepts
Part 1: Tokenization	Character-level tokenizer	Character encoding, vocabulary size, why BPE fails on small data
Part 2: The Transformer	Full GPT model architecture	Embeddings, self-attention, layer norm, MLP blocks
Part 3: The Training Loop	Complete training pipeline	Loss functions, AdamW, gradient clipping, LR scheduling
Part 4: Text Generation	Inference and sampling	Temperature, top-k, autoregressive decoding
Part 5: Putting It All Together	Train on real data, experiment	Loss curves, scaling experiments, next steps
Part 6: Competition	Train the best AI poet	Find datasets, scale up, submit your best poem

Architecture: GPT at a Glance

Input Text
    │
    ▼
┌─────────────────┐
│   Tokenizer     │  "hello" → [20, 43, 50, 50, 53]  (character-level)
└────────┬────────┘
         ▼
┌─────────────────┐
│  Token Embed +  │  token IDs → vectors (n_embd dimensions)
│  Position Embed │  + positional information
└────────┬────────┘
         ▼
┌─────────────────┐
│  Transformer    │  × n_layer
│  Block:         │
│  ┌────────────┐ │
│  │ LayerNorm  │ │
│  │ Self-Attn  │ │  n_head parallel attention heads
│  │ + Residual │ │
│  ├────────────┤ │
│  │ LayerNorm  │ │
│  │ MLP (FFN)  │ │  expand 4x, GELU, project back
│  │ + Residual │ │
│  └────────────┘ │
└────────┬────────┘
         ▼
┌─────────────────┐
│   LayerNorm     │
│   Linear → logits│  vocab_size outputs (probability over next token)
└─────────────────┘

Model Configs for This Workshop

Config	Params	n_layer	n_head	n_embd	Train Time (M3 Pro)
Tiny	~0.5M	2	2	128	~5 min
Small	~4M	4	4	256	~20 min
Medium (default)	~10M	6	6	384	~45 min

All configs use character-level tokenization (vocab_size=65) and block_size=256.

Tokenization: Characters vs BPE

This workshop uses character-level tokenization on Shakespeare. BPE tokenization (GPT-2's 50k vocab) doesn't work on small datasets — most token bigrams are too rare for the model to learn patterns from.

Tokenizer	Vocab Size	Dataset Size Needed
Character-level	~65	Small (Shakespeare, ~1MB)
BPE (tiktoken)	50,257	Large (TinyStories+, 100MB+)

Part 5 covers switching to BPE for larger datasets.