arxiv: 2506.16791 · v4 · pith:OHUAB6VQnew · submitted 2025-06-20 · 💻 cs.LG · cs.AI

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Nick Erickson , Lennart Purucker , Andrej Tschalzev , David Holzm\"uller , Prateek Mutalik Desai , David Salinas , Frank Hutter This is my paper

classification 💻 cs.LG cs.AI

keywords modelslearningtabarenatabularbenchmarkdatasetsdeepliving

0 comments p. Extension

Add this Pith Number to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{OHUAB6VQ}

Prints a linked pith:OHUAB6VQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

read the original abstract

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection
cs.LG 2026-02 accept novelty 8.0

MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.
MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
cs.LG 2026-05 unverdicted novelty 7.0

MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 7.0

TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to...
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
RamanBench: A Large-Scale Benchmark for Machine Learning on Raman Spectroscopy
cs.LG 2026-05 unverdicted novelty 7.0

RamanBench unifies 74 datasets into the first large-scale reproducible benchmark for ML on Raman spectra, finding tabular foundation models outperform baselines but no method generalizes across datasets.
Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models
cs.LG 2026-04 unverdicted novelty 7.0

TabDistill distills feature interactions from tabular foundation models via post-hoc attribution and inserts them into GAMs, yielding consistent predictive gains.
OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale
cs.LG 2026-04 unverdicted novelty 7.0

OmniTabBench shows no single model family dominates tabular tasks and maps performance advantages to specific dataset properties like size and skewness.
TS-Arena -- A Live Forecast Pre-Registration Platform
cs.LG 2025-12 conditional novelty 7.0

TS-Arena is a live pre-registration platform that evaluates time series forecasts on future data streams to eliminate information leakage.
TabPFN-3: Technical Report
cs.LG 2026-05 unverdicted novelty 6.0

TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
cs.LG 2026-05 unverdicted novelty 6.0

BoostLLM trains sequential PEFT adapters as weak learners in a residual process, using decision-tree paths as a second input view, to improve few-shot tabular classification over standard LLM fine-tuning and match or ...
BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification
cs.LG 2026-05 unverdicted novelty 6.0

BoostLLM trains sequential PEFT adapters in a boosting framework with tree path inputs to improve LLM performance on few-shot tabular classification, matching or exceeding XGBoost.
TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models
cs.LG 2026-05 unverdicted novelty 6.0

TFM-Retouche is an input-space residual adapter that lifts TabICLv2 performance by 56 Elo points on 51 tabular datasets while remaining architecture-agnostic and computationally light.
Tabular foundation models for in-context prediction of molecular properties
cs.LG 2026-04 unverdicted novelty 6.0

Tabular foundation models achieve high accuracy in molecular property prediction through in-context learning, with up to 100% win rates on MoleculeACE tasks when paired with CheMeleon embeddings.
Benchmarking Optimizers for MLPs in Tabular Deep Learning
cs.LG 2026-04 unverdicted novelty 6.0

Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
cs.LG 2025-11 unverdicted novelty 6.0

TabPFN-2.5 scales tabular foundation models to 20x larger datasets, outperforms tuned tree models on TabArena, achieves near-perfect win rates against default XGBoost, and adds a distillation engine for fast productio...
xRFM: Accurate, scalable, and interpretable feature learning models for tabular data
cs.LG 2025-08 unverdicted novelty 6.0

xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 3...
TabCF: Distributional Control Function Estimation with Tabular Foundation Models
stat.ML 2026-05 unverdicted novelty 5.0

TabCF is a tuning-light method using tabular foundation models for control function regression to estimate distributional causal effects such as interventional means and quantiles.
Heterogeneous Scientific Foundation Model Collaboration
cs.AI 2026-04 unverdicted novelty 5.0

Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.
Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN's Attention Mechanisms
cs.LG 2026-04 unverdicted novelty 4.0

TabPFN maintains high ROC-AUC and structured attention under controlled additions of irrelevant features, nonlinear correlations, and mislabeled targets in binary classification.