来自人类反馈的强化学习
Reinforcement Learning from Human Feedback

原始链接: https://rlhfbook.com/

内森·兰伯特(Nathan Lambert)的著作《从人类反馈中进行强化学习》(RLHF)是一个持续进行的项目,2025年全年及2026年初都有重大更新。该书最初于2025年4月完成(v0),之后根据编辑反馈进行了大量修改,截至2026年1月,其结构已进行重大重组,模仿了Manning书籍的结构。 主要新增内容包括关于工具使用和直接偏好优化(DPO)的章节,以及对推理、策略梯度和近端策略优化(PPO)相关部分的改进。本书还讨论了RLHF在产品开发中的应用,并融入了最新的研究成果。 兰伯特感谢了Costa Huang和Claude等人的贡献,以及更广泛的研究人员和GitHub贡献者。本书目前可在[https://rlhfbook.com](https://rlhfbook.com)在线阅读,引用方式为Lambert, 2025。

黑客新闻 新的 | 过去的 | 评论 | 提问 | 展示 | 工作 | 提交 登录 从人类反馈中进行强化学习 (arxiv.org) 13 分,by onurkanbkrc 1 小时前 | 隐藏 | 过去的 | 收藏 | 1 条评论 klelatti 13 分钟前 [–] 带有链接等的网页版本:https://rlhfbook.com/reply 指南 | 常见问题 | 列表 | API | 安全 | 法律 | 申请 YC | 联系 搜索:
相关文章

原文
RLHF Book by Nathan Lambert

Changelog

Last built: 07 February 2026

January 2026: Major chapter reorganization to match Manning book structure. Old URLs redirect to new locations.

December 2025 : Working on v2 of the book based on editors feedback! Do check back for updates!

2 July 2025 : Add tool use chapter (see PR)

6 June 2025 : v1.1. Lots of RLVR/reasoning improvements (see PR)

14 Apr. - 16 Apr. 2025 : Finish v0. Overoptimization, open questions, etc.

6 Apr. - 12 Apr. 2025.: Evaluation section

28 Mar. - 5 Apr. 2025.: Research on RLHF x Product, cleaning, improving website, reasoning section

17 Mar. - 27 Mar 2025.: Improving policy gradient section, minor changes

6 Mar. - 16 Mar 2025.: Finish DPO, major cleaning

26 Feb. - 5 Mar 2025.: Start DPO chapter, improve intro

20-25 Feb. 2025: Improve SEO, add IFT chapter, minor edits

10-15 Feb. 2025: RM additions, preference data, cleaning, policy gradient finalization

8 Feb. 2025: RM additions, editing, cleaning

4 Feb. 2025: PPO and GAE

2 Feb. 2025: Added changelog, revamped introduction,

Acknowledgements

I would like to thank the following people who helped me directly with this project: Costa Huang, (and of course Claude). Indirect shout-outs go to Ross Taylor, Hamish Ivison, John Schulman, Valentina Pyatkin, Daniel Han, Shane Gu, Joanne Jang, LJ Miranda, and others in my RL sphere.

Additionally, thank you to the contributors on GitHub who helped improve this project.


Citation

@book{rlhf2024,
  author = {Nathan Lambert},
  title = {Reinforcement Learning from Human Feedback},
  year = {2025},
  publisher = {Online},
  url = {https://rlhfbook.com}
}

© 2024-2025 Nathan Lambert

联系我们 contact @ memedata.com