推理模型推理得很好，直到它们失效。

推理模型推理得很好，直到它们失效。
Reasoning models reason well, until they don't

## 大型语言模型的推理局限性近期研究表明，虽然大型语言模型（LLM）——特别是针对推理进行微调的模型（LRM）——在复杂推理任务中表现出色，但它们的能力却出乎意料地有限。尽管有泛化推理能力的说法，这项研究表明LRM在推理问题超过一定复杂度阈值时会显著下降。研究人员创建了一个新的数据集DeepRD，旨在扩展推理难度，并发现LRM的性能随着图连接和证明规划等领域的复杂性增加而急剧下降。这表明当前的基准测试低估了这些模型的真实局限性。该研究还将这些发现与真实世界的数据联系起来，表明LRM虽然能够处理现有知识图谱和数据集中*大多数*示例，但更具挑战性的案例暴露了显著的失败可能性。最终，这项研究强调了LRM当前的实用性，同时也强调了迫切需要新的方法来构建能够真正超越其训练数据局限性的推理模型。

[Submitted on 25 Oct 2025]

View a PDF of the paper titled Reasoning Models Reason Well, Until They Don't, by Revanth Rameshkumar and 4 other authors

View PDF HTML (experimental)

Abstract:Large language models (LLMs) have shown significant progress in reasoning tasks. However, recent studies show that transformers and LLMs fail catastrophically once reasoning problems exceed modest complexity. We revisit these findings through the lens of large reasoning models (LRMs) -- LLMs fine-tuned with incentives for step-by-step argumentation and self-verification. LRM performance on graph and reasoning benchmarks such as NLGraph seem extraordinary, with some even claiming they are capable of generalized reasoning and innovation in reasoning-intensive fields such as mathematics, physics, medicine, and law. However, by more carefully scaling the complexity of reasoning problems, we show existing benchmarks actually have limited complexity. We develop a new dataset, the Deep Reasoning Dataset (DeepRD), along with a generative process for producing unlimited examples of scalable complexity. We use this dataset to evaluate model performance on graph connectivity and natural language proof planning. We find that the performance of LRMs drop abruptly at sufficient complexity and do not generalize. We also relate our LRM results to the distributions of the complexities of large, real-world knowledge graphs, interaction graphs, and proof datasets. We find the majority of real-world examples fall inside the LRMs' success regime, yet the long tails expose substantial failure potential. Our analysis highlights the near-term utility of LRMs while underscoring the need for new methods that generalize beyond the complexity of examples in the training distribution.

From: Revanth Rameshkumar [view email]
[v1] Sat, 25 Oct 2025 17:28:38 UTC (7,546 KB)

推理模型推理得很好，直到它们失效。 Reasoning models reason well, until they don't

推理模型推理得很好，直到它们失效。
Reasoning models reason well, until they don't