Can reinforcement learning for LLMs scale beyond math and coding tasks? Probably

oersted · 2025-04-08T17:39:46 1744133986

From the DeepSeek paper, they did try but found that the model would learn to cheat the judge. It doesn't seem to be impossible, but probably a serious challenge.

> We do not apply the outcome or process neural reward model in developing DeepSeek-R1-Zero, because we find that the neural reward model may suffer from reward hacking in the large-scale reinforcement learning process, and retraining the reward model needs additional training resources and it complicates the whole training pipeline.

devmor · 2025-04-08T17:58:04 1744135084

To me, this is one of the most frustrating parts of this type of ML. If we could actually track the steps taken in the LLM, it would be trivial for the judge to evaluate the output of each intermediate and detect when reward hacking is taking place.

I wonder if there's any alternative other than trying to build the perfect judge for every single test case.

（评论） (comments)

（评论）
(comments)