Tiny-Diffusion：一种极简的概率扩散模型实现

Tiny-Diffusion：一种极简的概率扩散模型实现
Tiny-diffusion: A minimal implementation of probabilistic diffusion models

原始链接: https://github.com/tanelp/tiny-diffusion

该项目使用 PyTorch 实现了用于二维数据集的概率扩散模型，并使用恐龙的比喻来可视化数据点。超参数消融研究表明模型对学习率敏感，最初结果较差，通过调整学习率得到了纠正。“线”数据集的结果显示模型在处理简单的线状数据时存在困难（产生模糊的角），但增加扩散过程的持续时间（时间步数）显著提高了输出质量，生成了更完整的数据。使用二次扩散调度方案的实验并不成功，建议探索其他的调度方案，例如余弦或 sigmoid 函数。模型容量（隐藏层大小）似乎并非限制因素。模型受益于时间步信息，但具体的编码方法并不那么重要。此外，对输入坐标使用正弦嵌入可以改进对高频特征的学习，这与其他低维应用（如像素到颜色的映射）中的发现相呼应。

Hacker News用户正在讨论GitHub上一个名为“Tiny-diffusion”的概率扩散模型的最小化实现。该帖子引起了极大的兴趣，用户强调了它的潜在影响。一位用户分享了他们自己更复杂实现的链接，其中包含类指导。另一位用户赞赏包含超参数搜索输出，认为这是避免调整陷阱的宝贵工具。一个普遍的观点是，像这样的最小化AI实现代表了边缘AI的未来，能够在嵌入式系统上实现专用应用程序。用户认为这种方法，而不是大型模型，对于AI的广泛实际应用至关重要。讨论强调了轻量级和专用AI模型对于实际应用的价值。页面末尾是YC 2025年秋季班的宣传。

原文

A minimal PyTorch implementation of probabilistic diffusion models for 2D datasets. Get started by running python ddpm.py -h to explore the available options for training.

A visualization of the forward diffusion process being applied to a dataset of one thousand 2D points. Note that the dinosaur is not a single training example, it represents each 2D point in the dataset.

This illustration shows how the reverse process recovers the distribution of the training data.

I have run a series of ablations experiments on hyperparameters, such as learning rate and model size, and visualized the learning process. The columns in the graphs represent the checkpoint epoch, and the rows indicate the hyperparameter values. Each cell displays one thousand generated 2D points.

The learning process is sensitive to the learning rate. At first, the model's output was poor, causing me to suspect a bug. However, simply changing the learning rate value resolved the problem.

The current model configuration doesn't work well on the line dataset, which I consider the most basic among them. The corners should be clear and sharp, but they are fuzzy.

A longer diffusion process results in a better output. With fewer timesteps, the dinosaur is incomplete, missing points from the top and bottom.

The quadratic schedule does not yield better results. Other schedules like cosine or sigmoid should also be considered.

The capacity of the model doesn't seem to be a bottleneck, as similar results are obtained across various hidden layer sizes.

As in the hidden size ablation run, the capacity of the model does not seem to be a limiting factor.

positional embedding (timestep)

The model benefits from the timestep information, but the specific method of encoding the timestep is not important.

positional embedding (inputs)

The use of sinusoidal embeddings for the inputs helps with learning high-frequency functions in low-dimensional problem domains, such as mapping each (x, y) pixel coordinate to (r, g, b) color, as demonstrated in this study. The same holds true in the current scenario.