理解类R1-Zero训练：一种批判性视角

理解类R1-Zero训练：一种批判性视角
Understanding R1-Zero-Like Training: A Critical Perspective

原始链接: https://github.com/sail-sg/understand-r1-zero

2025年3月21日，我们发布了关于R1-Zero训练的论文、模型和代码库，这些都由Oat实现，Oat是一个模块化且高效的大型语言模型强化学习框架。我们的研究批判性地考察了R1-Zero式训练中的基础模型和强化学习，揭示了像DeepSeek-V3-Base和Qwen2.5这样的模型即使没有经过广泛的提示也能展现出强大的推理能力。我们还发现了GRPO中的偏差，并提出了Dr. GRPO，这是一个简单的改进，提高了token效率。我们的分析揭示了模板和问题集对齐在强化学习动态中的重要性。我们发现不匹配的模板最初会阻碍推理，而即使是很小的问题集，如果与预训练分布对齐，也能诱导出推理能力。此外，特定领域的预训练增强了强化学习的性能，并且Dr. GRPO可以减轻GRPO中的长度偏差。我们极简的R1-Zero方案包括使用Dr. GRPO在MATH 3-5级问题上对Qwen2.5-Math-7B进行强化学习微调，在计算资源极少的情况下实现了最先进的性能。我们提供了安装说明和用于复现的训练脚本。我们还提供了用于评估已发布模型和基线模型的脚本。

Hacker News 最新 | 过去 | 评论 | 提问 | 展示 | 招聘 | 提交登录理解类似 R1-Zero 的训练：批判性视角 (github.com/sail-sg) 11 分 pama 发布于 53 分钟前 | 隐藏 | 过去 | 收藏 | 讨论加入我们 6 月 16-17 日在旧金山举办的 AI 初创公司学校！指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系我们搜索：

（评论） 2024-05-07

QwQ-32B：拥抱强化学习的力量 2025-03-07

梯子：通过递归问题分解来改进大型语言模型 2025-03-08

(评论) 2025-03-12

关于 OpenAI 新的 o1 思想链模型的笔记 2024-09-14

原文

21/03/2025: 🎉 We release our paper, models and codebase. Our R1-Zero training is implemented with 🌾 Oat, a highly modular, research-friendly and efficient LLM RL framework.

Understanding R1-Zero-Like Training
There May Not Be Aha Moment in R1-Zero-like Training — A Pilot Study
OAT: A research-friendly framework for LLM online alignment

To understand R1-Zero-like training, we critically examine two core components: base models and reinforcement learning. We highlight our findings below.

DeepSeek-V3-Base already exhibit "Aha moment".

As the popular choice for R1-Zero-like training, Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates: the average benchmark scores improve by ~60% (compared to the traditional 4-shot prompting)!

On reinforcement learning:

GRPO leads to biased optimization! We propose a simple fix that improves token efficiency while maintaining reasoning performance, termed as Dr. GRPO (GRPO Done Right).

In R1-Zero-like training, the template and the question set perform a duet to affect the RL dynamics
- (Left Plot) For Qwen2.5-Math-1.5B, a mismatched template (e.g., R1 template) in fact destructs the reasoning capabilities before RL reconstructing it. This makes the improvement impressive on the surface.
- (Middle Plot) However, if a template does not deviate from the pretraining distribution too far, even a small and completely o.o.d. question set (e.g., GSM8K) could induce the reasoning ability equally well, by reinforcing correct reasoning behaviors instead of infusing new knowledge.

Beyond Qwen, Llama can also be RL-tuned from base models. In this case, domain-specific pretraining will improves RL ceiling.
- (Right Plot) GRPO can even make Llama with math knowledge "Aha" by increasing the output length; however, it is likely due to its length bias, which can be removed by Dr. GRPO.

Our minimalist R1-Zero recipe:

Our analysis suggests a minimalist recipe for R1-Zero-like training:

We RL-tune Qwen2.5- Math-7B using the (unbiased) Dr. GRPO algorithm on MATH level 3-5 questions with the Qwen-Math template, and achieve state-of-the-art performance with only 27 hours compute on 8× A100 GPUs.

If you are interested in more details, please check out our paper!

We recommend a clean python==3.10 environment for development.

# Install vllm & oat, the LLM RL framework we developed r1-zero training on.
pip install vllm==0.7.2 && pip install oat-llm==0.0.9

# Install this package locally to use the math grader.
git clone [email protected]:sail-sg/understand-r1-zero.git && cd understand-r1-zero
pip install -e .

We implement R1-Zero training by extending Oat's Learner and Actor components. Please see train_zero_math.py for a step-by-step guide.

# Patch LD_LIBRARY_PATH to avoid dependency errors:
export LD_LIBRARY_PATH=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))"):$LD_LIBRARY_PATH

# Run the experiment (tested on 8 x A100-40G) with Dr. GRPO:
# (change to `--critic_type grpo` for running GRPO)
python train_zero_math.py \
    --critic_type drgrpo \
    --gpus 8 \
    --enable_prefix_caching \
    --collocate \
    --vllm_sleep \
    --vllm_gpu_ratio 0.35 \
    --gradient-checkpointing \
    --flash-attn \
    --bf16 \
    --rnd-seed \
    --learning_rate 0.000001 \
    --lr_scheduler constant \
    --num_ppo_epochs 1 \
    --beta 0 \
    --oracle_type reward \
    --oracle math \
    --pretrain Qwen/Qwen2.5-Math-1.5B \
    --prompt_template r1 \
    --zero-stage 2 \
    --ref_offload \
    --prompt_data ./datasets/train/math_12k \
    --train_split train \
    --input_key problem \
    --output_key answer \
    --max-train 9999999 \
    --num_prompt_epoch 20 \
    --prompt_max_length 1024 \
    --num_samples 8 \
    --temperature 1 \
    --top_p 1 \
    --generate_max_length 3000 \
    --save_steps -1 \
    --train_batch_size 128 \
    --rollout_batch_size 128 \
    --rollout_batch_size_per_device 16 \
    --pi_buffer_maxlen_per_device 128 \
    --eval_batch_size 200 \
    --eval_steps 16 \
    --eval_temperature 0 \
    --eval_generate_max_length 3000 \
    --eval_data ./datasets/evaluation_suite \
    --eval_input_key input \
    --use-wb \
    --wb-run-name qwen2.5-Math-1.5b-r1-zero \
    --wb_project oat-zero

Please see here for more example scripts.

# Evaluate our models:
python evaluate_model.py --model_name sail/Qwen2.5-Math-7B-Oat-Zero
python evaluate_model.py --model_name sail/Qwen2.5-Math-1.5B-Oat-Zero
python evaluate_model.py --model_name sail/Llama-3.2-3B-Oat-Zero --template r1

# Evaluate baseline models:
python evaluate_model.py --model_name Qwen/Qwen2.5-Math-1.5B
python evaluate_model.py --model_name Qwen/Qwen2.5-Math-7B
python evaluate_model.py --model_name hkust-nlp/Qwen-2.5-Math-7B-SimpleRL-Zero
python evaluate_model.py --model_name PRIME-RL/Eurus-2-7B-PRIME-Zero
python evaluate_model.py --model_name Open-Reasoner-Zero/Open-Reasoner-Zero-7B

If you find our work useful for your research, please consider citing:

@misc{liu2025understanding,
  title={Understanding R1-Zero-Like Training: A Critical Perspective},
  author={Zichen Liu and Changyu Chen and Wenjun Li and Penghui Qi and Tianyu Pang and Chao Du and Wee Sun Lee and Min Lin},
  year={2025},
  howpublished={\url{https://github.com/sail-sg/understand-r1-zero}},
}