强化学习是否激励大型语言模型超越基础模型进行推理？

强化学习是否激励大型语言模型超越基础模型进行推理？
Does RL Incentivize Reasoning in LLMs Beyond the Base Model?

数学、编码和视觉推理领域的实验评估了RLVR对大型语言模型推理能力的影响。在数学方面，RLVR提高了GSM8K等基准测试中初始准确率（k值较低），但降低了模型能够解决的整体问题范围（k值较高），表明专注的准确性和广泛的问题覆盖范围之间存在权衡。类似的模式也出现在编码领域，在CodeR1-Zero-Qwen2.5-7B上使用RLVR提高了LiveCodeBench上的单样本准确率（pass@1），但限制了增加尝试次数（k=128）带来的性能提升，表明探索多样性降低。使用Qwen-2.5-VL-7B进行的视觉推理实验显示出一致的结果，这意味着RLVR增强了现有的推理能力，而没有从根本上改变解决问题的策略。在所有领域中，虽然RLVR提高了即时性能，但它似乎也限制了模型探索多样化解决方案和解决更广泛具有挑战性问题的能力。

Hacker News 上正在讨论一篇题为“强化学习是否激励大型语言模型超越基础模型进行推理？”的研究论文。原帖作者不喜欢这种问句式的标题，并总结了论文的发现：强化学习 (RL) 提高了大型语言模型 (LLM) 的采样效率，但代价是最终的推理能力下降。经过 RL 训练的模型在尝试次数有限的情况下优于未经 RL 训练的模型，而未经 RL 训练的模型在尝试次数更多的情况下最终会超越它们。另一位评论者指出了论文方法论中的缺陷，特别质疑了“思维链有效性”的检验。他们举了一个例子，模型最终得到了正确的答案，尽管中间出现了多个巧合抵消的算术错误。这位评论者建议调查 RL 的采样效率提升究竟是由于改进的算术能力（可能通过工具实现）还是更好的策略性问题解决能力。这质疑了 RL 在推理任务中优势的真正来源。

理解类R1-Zero训练：一种批判性视角 2025-03-22

2024-09-13

梯子：通过递归问题分解来改进大型语言模型 2025-03-08

在最先进的法学硕士中展示推理失败的简单任务 2024-06-06

原文

We conducted experiments across three representative domains to evaluate the effect of RLVR on the reasoning ability boundaries of base and RLVR models.

Math

In the math experiments, we evaluate multiple LLM families (Qwen-2.5 and LLaMA-3.1) and their RL-trained variants on benchmarks like GSM8K, MATH500, and AIME24. We analyze pass@k curves to compare base and RL-trained models, observing that RL improves low-k performance but reduces problem coverage at high k. We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses. Additionally, we examine Oat-Zero-trained models and filter guessable problems to focus on challenging cases. The results show base models maintain broader reasoning coverage despite RL's initial accuracy gains.

Coding

In the coding experiments, we evaluate the RLVR-trained model CodeR1-Zero-Qwen2.5-7B, derived from Qwen2.5-7B-Instruct-1M, on benchmarks like LiveCodeBench, HumanEval+, and MBPP+. We assess performance using pass@k metrics, measuring correctness based on predefined test cases. The results show RLVR improves single-sample pass@1 scores but reduces coverage at higher sampling counts (k = 128). The original model exhibits continued potential for improvement with larger k, while RLVR's performance plateaus. This indicates RLVR enhances deterministic accuracy but limits exploration diversity.

Visual Reasoning

In the experiments on visual reasoning, we evaluate Qwen-2.5-VL-7B on filtered visual reasoning benchmarks (MathVista and MathVision), removing multiple-choice questions to focus on robust problem-solving. The improvements from RLVR in visual reasoning align with those seen in math and coding benchmarks, indicating that the original model already covers a broad range of solvable problems, even in multimodal tasks. The consistency across domains suggests that RLVR enhances reasoning capabilities without fundamentally altering the model's problem-solving approach.