思维的错觉:推理模型的优缺点
The Illusion of Thinking: Strengths and Limitations of Reasoning Models

原始链接: https://machinelearning.apple.com/research/illusion-of-thinking

本研究利用可控的谜题环境,比较大型推理模型 (LRM) 和标准大型语言模型 (LLM) 的推理能力。研究重点超越最终答案的准确性,分析模型内部的推理轨迹。关键发现揭示了“复杂性崩溃”现象,即 LRM 的准确性在超过一定水平后急剧下降,以及一个反直觉的规模限制,即即使有足够的 token 预算,推理工作量也会随着问题复杂性的增加而减少。在类似计算量下比较 LRM 和 LLM,发现了三种性能状态:LLM 在简单谜题上优于 LRM;LRM 在中等复杂度谜题上显示出优势;而两种模型在高复杂度谜题上均失败。研究还强调了 LRM 在精确计算、不一致推理和缺乏算法实现方面的局限性。通过分析推理轨迹和解题探索模式,这项工作为 LRM 的优缺点提供了宝贵的见解,质疑了其真正的推理能力,并为未来的发展提出了重要的考虑因素。

Hacker News 上的一个讨论围绕着苹果公司的一篇研究论文展开,该论文强调了人工智能推理模型的局限性。用户们讨论了模型在一些简单的推理任务中失败的经历,例如理解pickleball策略,尽管这些模型能够访问相关信息。一些人建议整合过去的符号推理研究(例如Prolog)来改进理解能力。其他人则提到了 Hacker News 上之前的一次相关讨论。 这篇论文揭示了当前的模型即使拥有强大的计算能力,也难以处理复杂的逻辑,这表明我们距离通用人工智能 (AGI) 的距离比炒作所暗示的要远。可控的益智游戏环境暴露了这些局限性,模型可以记住简单益智游戏的解决方案,但在更复杂的益智游戏中却会失败。这引发了人们的猜测,即大型语言模型 (LLM) 除了在某些特定领域(如代码辅助)之外可能并不广泛适用,并且苹果公司在生成式 AI 方面的明显滞后可能源于他们对高质量和用户体验的关注,而目前的 LLM 通常无法达到这些标准。
相关文章

原文

Recent generations of frontier language models have introduced Large Reasoning Models (LRMs) that generate detailed thinking processes before providing answers. While these models demonstrate improved performance on reasoning benchmarks, their fundamental capabilities, scal- ing properties, and limitations remain insufficiently understood. Current evaluations primarily fo- cus on established mathematical and coding benchmarks, emphasizing final answer accuracy. How- ever, this evaluation paradigm often suffers from data contamination and does not provide insights into the reasoning traces’ structure and quality. In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of composi- tional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter- intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget. By comparing LRMs with their standard LLM counterparts under equivalent inference compute, we identify three performance regimes: (1) low- complexity tasks where standard models surprisingly outperform LRMs, (2) medium-complexity tasks where additional thinking in LRMs demonstrates advantage, and (3) high-complexity tasks where both models experience complete collapse. We found that LRMs have limitations in exact computation: they fail to use explicit algorithms and reason inconsistently across puzzles. We also investigate the reasoning traces in more depth, studying the patterns of explored solutions and analyzing the models’ computational behavior, shedding light on their strengths, limitations, and ultimately raising crucial questions about their true reasoning capabilities.

*Equal contribution.
†Work done during an internship at Apple.

联系我们 contact @ memedata.com