(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43760625

Hacker News 上正在讨论一篇题为“强化学习是否激励大型语言模型超越基础模型进行推理?”的研究论文。原帖作者不喜欢这种问句式的标题,并总结了论文的发现:强化学习 (RL) 提高了大型语言模型 (LLM) 的采样效率,但代价是最终的推理能力下降。经过 RL 训练的模型在尝试次数有限的情况下优于未经 RL 训练的模型,而未经 RL 训练的模型在尝试次数更多的情况下最终会超越它们。 另一位评论者指出了论文方法论中的缺陷,特别质疑了“思维链有效性”的检验。他们举了一个例子,模型最终得到了正确的答案,尽管中间出现了多个巧合抵消的算术错误。这位评论者建议调查 RL 的采样效率提升究竟是由于改进的算术能力(可能通过工具实现)还是更好的策略性问题解决能力。这质疑了 RL 在推理任务中优势的真正来源。


原文
Hacker News new | past | comments | ask | show | jobs | submit login
Does RL Incentivize Reasoning in LLMs Beyond the Base Model? (limit-of-rlvr.github.io)
4 points by leodriesch 2 hours ago | hide | past | favorite | 2 comments










I don't like papers that ask a question in the title, so here's the answer:

"RL boosts sampling efficiency but reduces the reasoning capacity boundary."

Perhaps better to put it like this: Given one, or few attempts, RL trained models beat non-RL models. Given many attempts, non-RL models come up with better answers.



They write "We manually inspect CoT validity to ensure correct answers stem from valid reasoning, not lucky guesses." but the example answer they show at the end only gets the correct number due to two errors canceling out. The model calculates 195+367+562+900 and gets 1924 instead of 2024, and also turns -437 - 2*234 into -805 instead of -905, but in total 1924-805 = 2024-905 = 1119 and from there the remaining steps are correct again.

It would be interesting to know how much of the sampling efficiency improvement from reinforcement learning is due to being better at basic arithmetic (something which could also be achieved by giving the model access to a calculator tool) and how much is due to choosing the correct approach for solving the problem more often.







Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com