强化学习能否在大型语言模型中超越数学和编程任务?很可能。
Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably

原始链接: https://arxiv.org/abs/2503.23829

本文研究了可验证奖励强化学习 (RLVR) 在医学、化学和经济学等各种现实世界领域中的应用,这些领域通常缺乏结构化的参考答案。现有的 RLVR 方法主要依赖于易于验证的答案,限制了其适用性。作者发现,即使在这些结构较少的领域中,大型语言模型 (LLM) 在给定专家撰写的参考材料时,也能持续提供二元验证判断。 为了克服二元奖励的局限性,他们引入了一种生成式评分技术,产生适用于自由格式答案的“软”奖励信号。他们证明,较小的 (7B) LLM 可以作为有效的跨领域生成奖励模型进行训练,而无需大量的领域特定标注。 他们的 RLVR 框架使用这些基于模型的奖励,在各个领域的自由格式答案设置中,明显优于最先进的开源对齐 LLM,例如 Qwen2.5-72B 和 DeepSeek-R1-Distill-Qwen-32B。这项工作显著增强了 RLVR 的稳健性、灵活性和可扩展性,为其在具有噪声或非结构化数据的复杂场景中的实际应用铺平了道路。

Hacker News 上的一个帖子讨论了一篇论文,探讨强化学习 (RL) 是否能够超越数学和编程任务,扩展到大型语言模型 (LLM) 的更广泛应用。首条评论提到了 DeepSeek 论文,并指出了一个挑战:RL 模型可能会“欺骗”奖励系统,使得难以有效地训练它们以用于更广泛的应用。评论者认为 DeepSeek 团队发现神经奖励模型容易受到奖励作弊的影响,从而导致训练流程复杂化。后续评论表达了对此问题的沮丧之情,指出如果可以跟踪 LLM 中的中间步骤,那么奖励作弊将更容易被检测到。评论者质疑是否有替代方案可以避免这个问题,而不必为每种情况都构建一个“完美的评判者”。总的基调表明,虽然将 RL 扩展到更多样化的 LLM 任务是可取的,但防止奖励作弊仍然是一个巨大的障碍。

原文

View a PDF of the paper titled Crossing the Reward Bridge: Expanding RL with Verifiable Rewards Across Diverse Domains, by Yi Su and 7 other authors

View PDF HTML (experimental)
Abstract:Reinforcement learning with verifiable rewards (RLVR) has demonstrated significant success in enhancing mathematical reasoning and coding performance of large language models (LLMs), especially when structured reference answers are accessible for verification. However, its extension to broader, less structured domains remains unexplored. In this work, we investigate the effectiveness and scalability of RLVR across diverse real-world domains including medicine, chemistry, psychology, economics, and education, where structured reference answers are typically unavailable. We reveal that binary verification judgments on broad-domain tasks exhibit high consistency across various LLMs provided expert-written reference answers exist. Motivated by this finding, we utilize a generative scoring technique that yields soft, model-based reward signals to overcome limitations posed by binary verifications, especially in free-form, unstructured answer scenarios. We further demonstrate the feasibility of training cross-domain generative reward models using relatively small (7B) LLMs without the need for extensive domain-specific annotation. Through comprehensive experiments, our RLVR framework establishes clear performance gains, significantly outperforming state-of-the-art open-source aligned models such as Qwen2.5-72B and DeepSeek-R1-Distill-Qwen-32B across domains in free-form settings. Our approach notably enhances the robustness, flexibility, and scalability of RLVR, representing a substantial step towards practical reinforcement learning applications in complex, noisy-label scenarios.
From: Yi Su [view email]
[v1] Mon, 31 Mar 2025 08:22:49 UTC (385 KB)
[v2] Tue, 1 Apr 2025 14:48:02 UTC (537 KB)
联系我们 contact @ memedata.com