原文
| ||||||||||
| ||||||||||
![]() |
原始链接: https://news.ycombinator.com/item?id=43627354
Hacker News用户正在讨论Gemini 2.5 Pro实验版中谷歌新的“深度研究”功能。用户doctoboggan对用于训练大型语言模型的人类偏好评分表示怀疑,担心这会导致奉承式的回应而非事实准确性。其他用户将Gemini的研究能力与ChatGPT进行了比较。用户infecto发现ChatGPT“听起来更有学问”,并且更擅长扮演角色。另一位用户nico进行了一项关于远房亲戚的研究测试,发现ChatGPT在查找来源、连接信息和构建研究方面明显更好。相反,Jeffbee描述了一次积极的体验,Gemini最初给出了一个无用的答案,但随后生成了一个令人印象深刻的研究报告,其中包含关于阿苏萨一个奇特拓扑特征的120个来源,尽管文本略显冗长。总的来说,讨论展示了人们对不同AI研究工具的体验参差不齐,并突显了关于其优缺点的持续争论。
| ||||||||||
| ||||||||||
![]() |
Are these raters experts in the field the report was written on? Did they rate the reports on factuality, broadness, and insights?
These sort of tests (and RLHF in general) are the reason that LLMs often respond with "Great question, you are exactly right to wonder..." or "Interesting insight, I agree that...". I do not want this obsequious behavior, I want "correct answers"[0]. We need some better benchmarks when it comes to human preference.
[0]: I know there is no objective correct answers for some questions.
reply