展示 HN：TabPFN-2.5 – 表格数据领域的最新基础模型

展示 HN：TabPFN-2.5 – 表格数据领域的最新基础模型
Show HN: TabPFN-2.5 – SOTA foundation model for tabular data

原始链接: https://priorlabs.ai/technical-reports/tabpfn-2-5-model-report

TabPFN-2.5是表格基础模型的最新一代，在TabPFN和TabPFNv2的基础上取得了显著进展，极大地推动了表格AI领域的发展。该模型能够处理包含高达50,000个数据点和2,000个特征的数据集——与前代模型相比，数据容量增加了20倍。 TabPFN-2.5明显优于传统的、经过调优的基于树的模型，并实现了与AutoGluon 1.4（一个复杂的、经过4小时调优的集成模型）相当的准确性，*无需*进行大量的调优。一项关键创新是新的知识蒸馏引擎，可以将TabPFN-2.5转换为更快速、更易于部署的模型，如MLP或树集成模型，同时保持高精度和显著降低的延迟。此版本增强了TabPFN生态系统内的现有应用，并为跨行业的表格数据挑战提供了一种强大、可泛化的解决方案，尤其是在数据稀缺或需要可靠的不确定性估计的情况下。

原文

Abstract

The first tabular foundation model, TabPFN, and its successor TabPFNv2 have impacted tabular AI substantially, with dozens of methods building on it and hundreds of applications across different use cases.

This report introduces TabPFN-2.5, the next generation of our tabular foundation model, scaling to 20× data cells compared to TabPFNv2. On industry standard benchmarks with up to 50,000 data points and 2,000 features, TabPFN-2.5 substantially outperforms tuned tree-based models and matches the accuracy of AutoGluon 1.4, a complex four-hour tuned ensemble that even includes the previous TabPFNv2.

For production use cases, we introduce a new distillation engine that converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment.This new release will immediately strengthen the performance of the many applications andmethods already built on the TabPFN ecosystem.

This new release will substantially strengthen the performance of the many applications and methods already built on TabPFN.

_{TabPFN-2.5 performance on the standard TabArena-lite benchmark, TabPFNv2 classification subset. TabPFN-2.5 outperforms any other model in a forward pass, and marks a strong leap from TabPFNv2. When fine-tuned on real data, Real-TabPFN-2.5 shows even stronger performance. The horizontal dotted line stands for AutoGluon 1.4 extreme mode tuned for 4 hours, an ensemble of models including TabPFNv2.}

Introduction

Tabular data is ubiquitous, forming the backbone of decision-making in countless domains, from finance to healthcare. For decades, traditional tabular machine learning—built on gradient-boosted trees, random forests, and linear or additive models—has been the workhorse of applied data science. Yet these methods remain limited: they require extensive dataset-specific tuning, often provide uncalibrated or unreliable uncertainty estimates without significant modification, and lack the generalization and transferability of modern foundation models.

Tabular foundation models (TFMs) offer a new paradigm. They address these limitations by pretraining on large synthetic distributions of tabular tasks and performing inference via in-context learning instead of gradient descent. They are training-free predictors meta-trained to yield strong calibration, without the need for time-consuming and labor-intensive hyperparameter tuning necessary for gradient-boosted trees. Their strong generalization makes them particularly attractive for data-scarce domains.

Our initial release, TabPFNv1, served as a proof-of-concept that a transformer could learn a Bayesian-like inference algorithm, though it was limited to small (up to 1,000 samples), clean, numerical-only data. Our successor, TabPFNv2, scaled this idea into a practical model for datasets up to 10,000 samples. TabPFNv2 handles the messy and heterogeneous data seen in the real world—including categorical features, missing values & outliers.

What's New in TabPFN-2.5

State-of-the-Art Performance

In a forward pass, TabPFN-2.5 outperforms tuned tree-based models (like XGBoost and CatBoost) and matches the accuracy of AutoGluon 1.4 tuned for 4 hours—a complex ensemble that includes all previous methods, even TabPFNv2.

Improved Scalability

We scale the power of in-context learning to datasets of up to 50,000 samples (5× increase over TabPFNv2) and 2,000 features (4× increase), making TFMs viable for a much wider range of real-world problems.

Fast Inference

We've dramatically improved inference latency. Our proprietary distillation engine converts TabPFN-2.5 into a compact MLP or tree ensemble, preserving most of its accuracy while delivering orders-of-magnitude lower latency and plug-and-play deployment.