(评论)
(comments)

原始链接: https://news.ycombinator.com/item?id=43624111

Hacker News 上的一个帖子讨论了一篇论文,探讨强化学习 (RL) 是否能够超越数学和编程任务,扩展到大型语言模型 (LLM) 的更广泛应用。首条评论提到了 DeepSeek 论文,并指出了一个挑战:RL 模型可能会“欺骗”奖励系统,使得难以有效地训练它们以用于更广泛的应用。评论者认为 DeepSeek 团队发现神经奖励模型容易受到奖励作弊的影响,从而导致训练流程复杂化。后续评论表达了对此问题的沮丧之情,指出如果可以跟踪 LLM 中的中间步骤,那么奖励作弊将更容易被检测到。评论者质疑是否有替代方案可以避免这个问题,而不必为每种情况都构建一个“完美的评判者”。总的基调表明,虽然将 RL 扩展到更多样化的 LLM 任务是可取的,但防止奖励作弊仍然是一个巨大的障碍。


原文
Hacker News new | past | comments | ask | show | jobs | submit login
Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably (arxiv.org)
6 points by GabrielBianconi 46 minutes ago | hide | past | favorite | 2 comments










From the DeepSeek paper, they did try but found that the model would learn to cheat the judge. It doesn't seem to be impossible, but probably a serious challenge.

> We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.



To me, this is one of the most frustrating parts of this type of ML. If we could actually track the steps taken in the LLM, it would be trivial for the judge to evaluate the output of each intermediate and detect when reward hacking is taking place.

I wonder if there's any alternative other than trying to build the perfect judge for every single test case.







Join us for AI Startup School this June 16-17 in San Francisco!


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact



Search:
联系我们 contact @ memedata.com