pith. sign in

arxiv: 2506.16791 · v4 · pith:OHUAB6VQnew · submitted 2025-06-20 · 💻 cs.LG · cs.AI

TabArena: A Living Benchmark for Machine Learning on Tabular Data

Pith reviewed 2026-05-19 08:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords tabular datamachine learning benchmarkliving benchmarkgradient boosted treesdeep learningmodel ensemblingtabular foundation modelsvalidation overfitting
0
0 comments X p. Extension
pith:OHUAB6VQ Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{OHUAB6VQ}

Prints a linked pith:OHUAB6VQ badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

TabArena launches as a continuously updated benchmark showing that ensembles across tabular models set new performance records while exposing overfitting in some deep learning approaches.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes TabArena as the first living, continuously maintained benchmarking system for tabular machine learning to replace outdated static benchmarks. Through a large-scale initial study on curated datasets and models, it demonstrates that cross-model ensembles advance the state of the art, with deep learning methods reaching parity with gradient-boosted trees when given larger time budgets and ensembling. Foundation models perform particularly well on smaller datasets. The work also identifies that certain deep learning models appear overrepresented in top ensembles due to validation-set overfitting and provides a public leaderboard plus maintenance protocols for ongoing updates.

Core claim

TabArena is a living tabular benchmarking system that initializes a public leaderboard through manual curation of representative datasets and well-implemented models, followed by large-scale experiments. These experiments show that ensembles across different models improve upon single-model or same-model results to advance the state of the art in tabular machine learning. Gradient-boosted trees remain strong, deep learning catches up under larger time budgets when ensembled, and foundation models excel on smaller datasets, while some deep models overfit validation sets and become overrepresented in cross-model ensembles.

What carries the argument

TabArena, the continuously maintained tabular benchmarking system with curated datasets, models, validation protocols, ensembling procedures, and maintenance team that enables ongoing leaderboard updates.

If this is right

  • Validation method choice and ensembling of hyperparameter configurations are required to measure any model's full potential on tabular tasks.
  • Cross-model ensembles consistently outperform both individual models and within-model ensembles on the curated collection.
  • Deep learning models require additional safeguards against validation-set overfitting to participate reliably in cross-model ensembles.
  • Foundation models offer a practical advantage specifically on smaller tabular datasets.
  • The public leaderboard and maintenance protocols will allow new models and dataset fixes to be incorporated without resetting the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Sustained maintenance of TabArena could establish a de facto standard reference for tracking long-term progress in tabular machine learning similar to established benchmarks in other domains.
  • Model developers should prioritize techniques that reduce validation overfitting so their contributions remain effective when combined with other models.
  • The observed performance patterns suggest that future work may benefit from exploring adaptive ensembles that weight tree-based, deep, and foundation models according to dataset size and characteristics.
  • If new data distributions emerge over time, the living nature of the benchmark will make it possible to detect and quantify shifts in which modeling approaches remain effective.

Load-bearing premise

The manually selected collection of datasets accurately represents the practical tabular problems that arise in real applications.

What would settle it

A follow-up study on an independent collection of real-world tabular datasets that produces different performance rankings, particularly reversing the observed benefits of cross-model ensembling or the relative standing of deep learning versus gradient-boosted trees, would falsify the generalizability of TabArena's initial findings.

Figures

Figures reproduced from arXiv: 2506.16791 by Andrej Tschalzev, David Holzm\"uller, David Salinas, Frank Hutter, Lennart Purucker, Nick Erickson, Prateek Mutalik Desai.

Figure 1
Figure 1. Figure 1: TabArena-v0.1 Leaderboard. We evaluate models under default parameters, tuning, and weighted ensembling [1] of hyperparameters. Since TabICL and TabPFNv2 are not applicable to all datasets, we evaluate them on subsets of the benchmark in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Data Curation Results. The figure shows why and how many datasets we filter based on our criteria. We filter datasets that are duplicates, not from a tabular domain, not a real predictive task, tiny, have quality or license issues, and are not IID. modality, such as images, where it is unclear whether tabular machine learning is a reasonable alternative to domain-specific methods; (4) The dataset stems fro… view at source ↗
Figure 3
Figure 3. Figure 3: Characteristics of Datasets in TabArena. On the left, we show the number of datasets per task type, license, source of the dataset, and age group. On the right, we show the number of features (columns) and samples (rows), as well as the percentage of categorical features per dataset [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Leaderboard for TabPFNv2-compatible (left) and TabICL-compatible (right) datasets. For TabPFNv2, we obtain 33 datasets (≤ 10K training samples, ≤ 500 features). For TabICL, we obtain 36 classification datasets (≤ 100K, ≤ 500). Everything but the datasets is identical to [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (Left) Pareto front of improvability and inference time. We report the median inference time per 1000 samples across all datasets. (Right) Improvability tuning trajectories. Time is shown as the tuning time with points from left to right marking ensembles of increasing numbers of random configurations (1, 2, 5, 10, 25, 50, 100, 150, 201). The trajectories are sampled 20 times from all trials and averaged. … view at source ↗
Figure 6
Figure 6. Figure 6: (Left) Model Efficiency. Median training times with cross-validation across TabPFNv2- and TabICL-compatible datasets (see [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Left) TabArena Ensemble. Simulated performance of an ensemble using all models in TabArena compared to the leaderboard from [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
read the original abstract

With the growing popularity of deep learning and foundation models for tabular data, the need for standardized and reliable benchmarks is higher than ever. However, current benchmarks are static. Their design is not updated even if flaws are discovered, model versions are updated, or new models are released. To address this, we introduce TabArena, the first continuously maintained living tabular benchmarking system. To launch TabArena, we manually curate a representative collection of datasets and well-implemented models, conduct a large-scale benchmarking study to initialize a public leaderboard, and assemble a team of experienced maintainers. Our results highlight the influence of validation method and ensembling of hyperparameter configurations to benchmark models at their full potential. While gradient-boosted trees are still strong contenders on practical tabular datasets, we observe that deep learning methods have caught up under larger time budgets with ensembling. At the same time, foundation models excel on smaller datasets. Finally, we show that ensembles across models advance the state-of-the-art in tabular machine learning. We observe that some deep learning models are overrepresented in cross-model ensembles due to validation set overfitting, and we encourage model developers to address this issue. We launch TabArena with a public leaderboard, reproducible code, and maintenance protocols to create a living benchmark available at https://tabarena.ai.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces TabArena as the first continuously maintained living benchmark for tabular machine learning. It describes the manual curation of a representative collection of datasets and well-implemented models, a large-scale benchmarking study that initializes a public leaderboard, and the assembly of a maintenance team. Key empirical observations include the continued competitiveness of gradient-boosted trees on practical datasets, the ability of deep learning methods to match or exceed them under larger time budgets with ensembling, the strength of foundation models on smaller datasets, and the finding that cross-model ensembles advance the state-of-the-art while some deep learning models appear overrepresented in such ensembles due to validation-set overfitting.

Significance. A successfully maintained living benchmark would address a genuine gap in tabular ML evaluation, where static benchmarks rapidly obsolesce. The reported findings on ensembling practices and validation overfitting could usefully guide model developers if the underlying dataset collection is shown to be representative; the provision of reproducible code and maintenance protocols is a concrete strength that supports ongoing community use.

major comments (2)
  1. [Dataset curation section] Dataset curation section: the claim that the manually curated collection is 'representative' of practical tabular problems is not supported by any quantitative comparison (e.g., Kolmogorov-Smirnov or Earth-mover distance on instance count, feature dimensionality, class imbalance, or domain coverage) against reference corpora such as OpenML or Kaggle. This assumption is load-bearing for the generalization of the ensemble SOTA advancement and the overfitting observations.
  2. [Benchmarking study and results sections] Benchmarking study and results sections: the abstract and text reference the influence of validation method and ensembling of hyperparameter configurations, yet the manuscript provides insufficient detail on exact dataset selection criteria, validation protocols, and statistical significance testing for the performance claims (e.g., no reported p-values or confidence intervals for cross-model ensemble gains).
minor comments (2)
  1. [Abstract and maintenance protocols] The abstract states that 'we assemble a team of experienced maintainers' but the main text does not elaborate on their specific roles or expertise, which would strengthen the claim of long-term sustainability.
  2. [Leaderboard presentation] Leaderboard tables or figures should explicitly report the number of runs, random seeds, and any multiple-testing corrections to allow readers to assess the reliability of the reported rankings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript introducing TabArena. The comments highlight opportunities to strengthen the justification for our dataset collection and to improve transparency in the benchmarking methodology. We address each major comment below and will incorporate revisions to enhance the paper accordingly.

read point-by-point responses
  1. Referee: [Dataset curation section] Dataset curation section: the claim that the manually curated collection is 'representative' of practical tabular problems is not supported by any quantitative comparison (e.g., Kolmogorov-Smirnov or Earth-mover distance on instance count, feature dimensionality, class imbalance, or domain coverage) against reference corpora such as OpenML or Kaggle. This assumption is load-bearing for the generalization of the ensemble SOTA advancement and the overfitting observations.

    Authors: We agree that the current manuscript does not include quantitative statistical comparisons to support the representativeness claim. Our curation process relied on expert-driven selection to achieve diversity across dataset sizes, feature dimensionalities, domains, and imbalance levels, informed by practical tabular ML use cases. To address this point directly, the revised manuscript will add a dedicated analysis (likely in an appendix or new subsection) that compares key dataset statistics from our collection against reference corpora such as OpenML and Kaggle, using metrics including Kolmogorov-Smirnov tests and Earth Mover's Distance on instance counts, feature counts, and class imbalance ratios. This addition will provide quantitative grounding for the generalization of our findings on cross-model ensembles and validation overfitting. revision: yes

  2. Referee: [Benchmarking study and results sections] Benchmarking study and results sections: the abstract and text reference the influence of validation method and ensembling of hyperparameter configurations, yet the manuscript provides insufficient detail on exact dataset selection criteria, validation protocols, and statistical significance testing for the performance claims (e.g., no reported p-values or confidence intervals for cross-model ensemble gains).

    Authors: We concur that additional methodological detail is warranted for reproducibility and to support the performance claims. The revised manuscript will expand the relevant sections to explicitly state the dataset selection criteria (including any quantitative thresholds for size, quality, and diversity), provide precise descriptions of the validation protocols (such as the specific train/validation/test splitting strategies and hyperparameter ensembling procedures), and include statistical significance testing. In particular, we will report p-values and confidence intervals for the gains achieved by cross-model ensembles relative to individual models. These clarifications will be added without altering the core empirical results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on external datasets

full rationale

The paper reports outcomes from large-scale empirical runs of models on a manually curated collection of tabular datasets. All central claims—ensemble gains advancing SOTA, overrepresentation of certain deep learning models due to validation-set overfitting, and relative strengths of gradient-boosted trees versus deep learning under different budgets—are direct observations from these external benchmark executions rather than quantities derived from the paper's own equations or self-referential definitions. No load-bearing step reduces by construction to fitted parameters, self-citations, or ansatzes introduced in prior work by the same authors. The manual curation and living-benchmark protocols are descriptive and do not enter the performance measurements as circular inputs. This is a standard empirical benchmarking study whose results remain falsifiable against independent dataset collections.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The central claims rest on the representativeness of manually chosen datasets and models plus the assumption that the observed performance patterns generalize beyond the initial study.

free parameters (2)
  • Dataset curation choices
    Manually selected collection of representative datasets used to initialize the leaderboard.
  • Model implementation selections
    Well-implemented models chosen for the large-scale benchmarking study.
axioms (1)
  • domain assumption The curated datasets represent practical tabular data problems.
    Invoked to support conclusions about model performance and ensembling benefits.
invented entities (1)
  • TabArena no independent evidence
    purpose: Continuously maintained living tabular benchmarking system with public leaderboard and maintenance protocols.
    New system introduced to address static benchmark limitations.

pith-pipeline@v0.9.0 · 5783 in / 1389 out tokens · 43741 ms · 2026-05-19T08:38:16.631863+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MacrOData: New Benchmarks of Thousands of Datasets for Tabular Outlier Detection

    cs.LG 2026-02 accept novelty 8.0

    MacrOData supplies three large, curated benchmark suites totaling 2,446 datasets for tabular outlier detection, complete with standardized splits, metadata, and a public leaderboard.

  2. MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image

    cs.LG 2026-05 unverdicted novelty 7.0

    MulTaBench is a new collection of 40 image-tabular and text-tabular datasets designed to test target-aware representation tuning in multimodal tabular models.

  3. TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 7.0

    TFM-Retouche is an architecture-agnostic input-space residual adapter that improves tabular foundation model accuracy on 51 datasets by learning input corrections through the frozen backbone, with an identity guard to...

  4. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  5. RamanBench: A Large-Scale Benchmark for Machine Learning on Raman Spectroscopy

    cs.LG 2026-05 unverdicted novelty 7.0

    RamanBench unifies 74 datasets into the first large-scale reproducible benchmark for ML on Raman spectra, finding tabular foundation models outperform baselines but no method generalizes across datasets.

  6. Selecting Feature Interactions for Generalized Additive Models by Distilling Foundation Models

    cs.LG 2026-04 unverdicted novelty 7.0

    TabDistill distills feature interactions from tabular foundation models via post-hoc attribution and inserts them into GAMs, yielding consistent predictive gains.

  7. OmniTabBench: Mapping the Empirical Frontiers of GBDTs, Neural Networks, and Foundation Models for Tabular Data at Scale

    cs.LG 2026-04 unverdicted novelty 7.0

    OmniTabBench shows no single model family dominates tabular tasks and maps performance advantages to specific dataset properties like size and skewness.

  8. TS-Arena -- A Live Forecast Pre-Registration Platform

    cs.LG 2025-12 conditional novelty 7.0

    TS-Arena is a live pre-registration platform that evaluates time series forecasts on future data streams to eliminate information leakage.

  9. TabPFN-3: Technical Report

    cs.LG 2026-05 unverdicted novelty 6.0

    TabPFN-3 delivers state-of-the-art tabular prediction performance on benchmarks up to 1M rows, is up to 20x faster than prior versions, and introduces test-time scaling that beats non-TabPFN models by hundreds of Elo points.

  10. BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

    cs.LG 2026-05 unverdicted novelty 6.0

    BoostLLM trains sequential PEFT adapters as weak learners in a residual process, using decision-tree paths as a second input view, to improve few-shot tabular classification over standard LLM fine-tuning and match or ...

  11. BoostLLM: Boosting-inspired LLM Fine-tuning for Few-shot Tabular Classification

    cs.LG 2026-05 unverdicted novelty 6.0

    BoostLLM trains sequential PEFT adapters in a boosting framework with tree path inputs to improve LLM performance on few-shot tabular classification, matching or exceeding XGBoost.

  12. TFM-Retouche: A Lightweight Input-Space Adapter for Tabular Foundation Models

    cs.LG 2026-05 unverdicted novelty 6.0

    TFM-Retouche is an input-space residual adapter that lifts TabICLv2 performance by 56 Elo points on 51 tabular datasets while remaining architecture-agnostic and computationally light.

  13. Tabular foundation models for in-context prediction of molecular properties

    cs.LG 2026-04 unverdicted novelty 6.0

    Tabular foundation models achieve high accuracy in molecular property prediction through in-context learning, with up to 100% win rates on MoleculeACE tasks when paired with CheMeleon embeddings.

  14. Benchmarking Optimizers for MLPs in Tabular Deep Learning

    cs.LG 2026-04 unverdicted novelty 6.0

    Muon optimizer outperforms AdamW across 17 tabular datasets when training MLPs under a shared protocol.

  15. TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    cs.LG 2025-11 unverdicted novelty 6.0

    TabPFN-2.5 scales tabular foundation models to 20x larger datasets, outperforms tuned tree models on TabArena, achieves near-perfect win rates against default XGBoost, and adds a distillation engine for fast productio...

  16. xRFM: Accurate, scalable, and interpretable feature learning models for tabular data

    cs.LG 2025-08 unverdicted novelty 6.0

    xRFM merges kernel-based feature learning with tree structures for scalable, interpretable tabular modeling and reports top performance on 100 regression and competitive results on 200 classification datasets versus 3...

  17. TabCF: Distributional Control Function Estimation with Tabular Foundation Models

    stat.ML 2026-05 unverdicted novelty 5.0

    TabCF is a tuning-light method using tabular foundation models for control function regression to estimate distributional causal effects such as interventional means and quantiles.

  18. Heterogeneous Scientific Foundation Model Collaboration

    cs.AI 2026-04 unverdicted novelty 5.0

    Eywa enables language-based agentic AI systems to collaborate with specialized scientific foundation models for improved performance on structured data tasks.

  19. Noise Immunity in In-Context Tabular Learning: An Empirical Robustness Analysis of TabPFN's Attention Mechanisms

    cs.LG 2026-04 unverdicted novelty 4.0

    TabPFN maintains high ROC-AUC and structured attention under controlled additions of irrelevant features, nonlinear correlations, and mislabeled targets in binary classification.

Reference graph

Works this paper leans on

141 extracted references · 141 canonical work pages · cited by 17 Pith papers · 2 internal anchors

  1. [1]

    Caruana, A

    R. Caruana, A. Niculescu-Mizil, G. Crew, and A. Ksikes. Ensemble selection from libraries of models. In R. Greiner, editor,Proceedings of the 21st International Conference on Machine Learning (ICML’04). Omnipress, 2004

  2. [2]

    Borisov, T

    V . Borisov, T. Leemann, K. Seßler, J. Haug, M. Pawelczyk, and G. Kasneci. Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022

  3. [3]

    van Breugel and M

    B. van Breugel and M. van der Schaar. Why tabular foundation models should be a research priority. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Conference on Machine Learning (ICML’24), volume 251 ofProceedings of Machine Learning Research. PMLR, 2024

  4. [4]

    Herrmann, F

    M. Herrmann, F. Lange, K. Eggensperger, G. Casalicchio, M. Wever, M. Feurer, D. Rügamer, E. Hüllermeier, A.-L. Boulesteix, and B. Bischl. Position: Why we must rethink empirical research in machine learning. In R. Salakhutdinov, Z. Kolter, K. Heller, A. Weller, N. Oliver, 11 J. Scarlett, and F. Berkenkamp, editors,Proceedings of the 41st International Con...

  5. [5]

    Hollmann, S

    N. Hollmann, S. Müller, L. Purucker, A. Krishnakumar, M. Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637(8045):319–326, 2025

  6. [6]

    Representation learning for tabular data: A comprehensive survey.arXiv preprint arXiv:2504.16109, 2025

    Jun-Peng Jiang, Si-Yang Liu, Hao-Run Cai, Qile Zhou, and Han-Jia Ye. Representation learning for tabular data: A comprehensive survey.arXiv preprint arXiv:2504.16109, 2025

  7. [7]

    Kohli, M

    R. Kohli, M. Feurer, B. Bischl, K. Eggensperger, and F. Hutter. Towards quantifying the effect of datasets for benchmarking: A look at tabular machine learning. InData-centric Machine Learning Research Workshop at the International Conference on Learning Representations, 2024

  8. [8]

    A data-centric perspective on evaluating machine learning models for tabular data

    Andrej Tschalzev, Sascha Marton, Stefan Lüdtke, Christian Bartelt, and Heiner Stuckenschmidt. A data-centric perspective on evaluating machine learning models for tabular data. InThe Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

  9. [9]

    Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210, 2024

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. Tabm: Advancing tabular deep learning with parameter-efficient ensembling.arXiv preprint arXiv:2410.24210, 2024

  10. [10]

    Tabred: Ana- lyzing pitfalls and filling the gaps in tabular deep learning benchmarks.arXiv preprint arXiv:2406.19380, 2024

    Ivan Rubachev, Nikolay Kartashev, Yury Gorishniy, and Artem Babenko. Tabred: Ana- lyzing pitfalls and filling the gaps in tabular deep learning benchmarks.arXiv preprint arXiv:2406.19380, 2024

  11. [11]

    Unreflected use of tabular data repositories can undermine research quality

    Andrej Tschalzev, Lennart Purucker, Stefan Lüdtke, Frank Hutter, Christian Bartelt, and Heiner Stuckenschmidt. Unreflected use of tabular data repositories can undermine research quality. InThe Future of Machine Learning Data Practices and Repositories at ICLR 2025, 2025

  12. [12]

    L. Breiman. Random forests. 45:5–32, 2001

  13. [13]

    Geurts, D

    P. Geurts, D. Ernst, and L. Wehenkel. Extremely randomized trees. 63(1):3–42, 2006

  14. [14]

    Chen and C

    T. Chen and C. Guestrin. XGBoost: A scalable tree boosting system. In B. Krishnapuram, M. Shah, A. Smola, C. Aggarwal, D. Shen, and R. Rastogi, editors,Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’16), pages 785–794, 2016

  15. [15]

    G. Ke, Q. Meng, T. Finley, T. Wang, W. Chen, W. Ma, Q. Ye, and T.-Y . Liu. Lightgbm: A highly efficient gradient boosting decision tree. In I. Guyon, U. von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS’17), 2017

  16. [16]

    Prokhorenkova, G

    L. Prokhorenkova, G. Gusev, A. V orobev, A. Dorogush, and A. Gulin. Catboost: Unbiased boosting with categorical features. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Proceedings of the 31st International Conference on Advances in Neural Information Processing Systems (NeurIPS’18), page 6639–6649, 2018

  17. [17]

    Accurate intelligible models with pairwise interactions

    Yin Lou, Rich Caruana, Johannes Gehrke, and Giles Hooker. Accurate intelligible models with pairwise interactions. InProceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 623–631, 2013

  18. [18]

    Interpretml: A unified framework for machine learning interpretability.arXiv preprint arXiv:1909.09223, 2019

    Harsha Nori, Samuel Jenkins, Paul Koch, and Rich Caruana. Interpretml: A unified framework for machine learning interpretability.arXiv preprint arXiv:1909.09223, 2019

  19. [19]

    AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

    N. Erickson, J. Mueller, A. Shirkov, H. Zhang, P. Larroy, M. Li, and A. Smola. Autogluon- tabular: Robust and accurate automl for structured data.arXiv:2003.06505 [stat.ML], 2020

  20. [20]

    Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

    David Holzmüller, Léo Grinsztajn, and Ingo Steinwart. Better by default: Strong pre-tuned mlps and boosted trees on tabular data.Advances in Neural Information Processing Systems, 37:26577–26658, 2024

  21. [21]

    Modern neighborhood components analysis: A deep tabular baseline two decades later.arXiv preprint arXiv:2407.03257, 2024

    Han-Jia Ye, Huai-Hong Yin, and De-Chuan Zhan. Modern neighborhood components analysis: A deep tabular baseline two decades later.arXiv preprint arXiv:2407.03257, 2024

  22. [22]

    Tabicl: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564, 2025

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabicl: A tabular foundation model for in-context learning on large data.arXiv preprint arXiv:2502.05564, 2025. 12

  23. [23]

    Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. Tabdpt: Scaling tabular foundation models.arXiv preprint arXiv:2410.18164, 2024

  24. [24]

    Pedregosa, G

    F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. 12:2825–2830, 2011

  25. [25]

    2023 kaggle ai report, 2023

    Bojan Tunguz, Dieter, Heads or Tails, Karnika Kapoor, Parul Pandey, Paul Mooney, Phil Culliton, Rob Mulla, Sanyam Bhutani, and Will Cukierski. 2023 kaggle ai report, 2023. URL https://kaggle.com/competitions/2023-kaggle-ai-report

  26. [26]

    Carte: Pretraining and transfer for tabular learning

    Myung Jun Kim, Leo Grinsztajn, and Gael Varoquaux. Carte: Pretraining and transfer for tabular learning. InInternational Conference on Machine Learning, pages 23843–23866. PMLR, 2024

  27. [27]

    Neural network ensembles, cross valida- tion, and active learning

    Anders Krogh and Jesper Vedelsby. Neural network ensembles, cross valida- tion, and active learning. In G. Tesauro, D. Touretzky, and T. Leen, edi- tors,Advances in Neural Information Processing Systems, volume 7. MIT Press,

  28. [28]

    URL https://proceedings.neurips.cc/paper_files/paper/1994/file/ b8c37e33defde51cf91e1e03e51657da-Paper.pdf

  29. [29]

    Bischl, G

    B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. Mantovani, J. N. van Rijn, and J. Vanschoren. Openml benchmarking suites and the openml100.arXiv:1708.03731v1 [stat.ML], 2019

  30. [30]

    Bischl, G

    B. Bischl, G. Casalicchio, M. Feurer, F. Hutter, M. Lang, R. Mantovani, J. van Rijn, and J. Vanschoren. OpenML benchmarking suites. In J. Vanschoren and S. Yeung, editors, Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks, 2021

  31. [31]

    Gorishniy, I

    Y . Gorishniy, I. Rubachev, V . Khrulkov, and A. Babenko. Revisiting deep learning models for tabular data. In M. Ranzato, A. Beygelzimer, K. Nguyen, P. Liang, J. Vaughan, and Y . Dauphin, editors,Proceedings of the 34th International Conference on Advances in Neural Information Processing Systems (NeurIPS’21), pages 18932–18943, 2021

  32. [32]

    Tabular data: Deep learning is not all you need

    Ravid Shwartz-Ziv and Amitai Armon. Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90, 2022

  33. [33]

    Grinsztajn, E

    L. Grinsztajn, E. Oyallon, and G. Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data? In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors,Proceedings of the 35th International Conference on Advances in Neural Information Processing Systems (NeurIPS’22), 2022

  34. [34]

    McElfresh, S

    D. McElfresh, S. Khandagale, J. Valverde, V . Prasad C, G. Ramakrishnan, M. Goldblum, and C. White. When do neural nets outperform boosted trees on tabular data? In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Proceedings of the 36th International Conference on Advances in Neural Information Processing Systems (NeurIPS’23),...

  35. [35]

    Fischer, L

    S. Fischer, L. Harutyunyan, M. Feurer, and B. Bischl. Openml-ctr23 – a curated tabular regression benchmarking suite. In A. Faust, C. White, F. Hutter, R. Garnett, and J. Gardner, editors,Second International Conference on Automated Machine Learning - Workshop Track, 2023

  36. [36]

    Gijsbers, M

    P. Gijsbers, M. Bueno, S. Coors, E. LeDell, S. Poirier, J. Thomas, B. Bischl, and J. Vanschoren. Amlb: an automl benchmark. 25(101):1–65, 2024

  37. [37]

    A closer look at deep learning on tabular data.arXiv preprint arXiv:2407.00956, 2024

    Han-Jia Ye, Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and De-Chuan Zhan. A closer look at deep learning on tabular data.arXiv preprint arXiv:2407.00956, 2024

  38. [38]

    Tabrepo: A large scale repository of tabular model evalua- tions and its automl applications

    David Salinas and Nick Erickson. Tabrepo: A large scale repository of tabular model evalua- tions and its automl applications. InAutoML Conference 2024 (ABCD Track), 2024

  39. [39]

    The proposed uscf rating system, its development, theory, and applications

    Arpad E Elo. The proposed uscf rating system, its development, theory, and applications. Chess life, 22(8):242–247, 1967

  40. [40]

    Chatbot 13 arena: An open platform for evaluating llms by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot 13 arena: An open platform for evaluating llms by human preference. InForty-first International Conference on Machine Learning, 2024

  41. [41]

    Maier, F

    J. Maier, F. Möller, and L. Purucker. Hardware aware ensemble selection for balancing predictive accuracy and cost. In M. Lindauer, K. Eggensperger, R. Garnett, J. Vanschoren, and J. Gardner, editors,Third International Conference on Automated Machine Learning - Workshop Track, 2024

  42. [42]

    Nagler, L

    T. Nagler, L. Schneider, B. Bischl, and M. Feurer. Reshuffling resampling splits can improve generalization of hyperparameter optimization. InProceedings of the 37th International Conference on Advances in Neural Information Processing Systems (NeurIPS’24), 2024

  43. [43]

    Overtuning in hyperparameter opti- mization

    Lennart Schneider, Bernd Bischl, and Matthias Feurer. Overtuning in hyperparameter opti- mization. InInternational Conference on Automated Machine Learning, 2025

  44. [44]

    Purucker, L

    L. Purucker, L. Schneider, M. Anastacio, J. Beel, B. Bischl, and H. Hoos. Q(d)o-es: Population- based quality (diversity) optimisation for post hoc ensemble selection in automl. In A. Faust, C. White, F. Hutter, R. Garnett, and J. Gardner, editors,Proceedings of the Second International Conference on Automated Machine Learning. Proceedings of Machine Lear...

  45. [45]

    Purucker and J

    L. Purucker and J. Beel. Cma-es for post hoc ensembling in automl: A great success and salvageable failure. In A. Faust, C. White, F. Hutter, R. Garnett, and J. Gardner, editors, Proceedings of the Second International Conference on Automated Machine Learning. Pro- ceedings of Machine Learning Research, 2023

  46. [46]

    Pytorch tabular: A framework for deep learning with tabular data.arXiv preprint arXiv:2104.13638, 2021

    Manu Joseph. Pytorch tabular: A framework for deep learning with tabular data.arXiv preprint arXiv:2104.13638, 2021

  47. [47]

    Olson, W

    R. Olson, W. La Cava, P. Orzechowski, R. Urbanowicz, and J. Moore. PMLB: a large benchmark suite for machine learning evaluation and comparison.BioData mining, 10:1–13, 2017

  48. [48]

    Pmlb v1.0: an open-source dataset collection for benchmarking machine learning methods.Bioinformatics, 38(3):878–880, 2022

    Joseph D Romano, Trang T Le, William La Cava, John T Gregg, Daniel J Goldberg, Praneel Chakraborty, Natasha L Ray, Daniel Himmelstein, Weixuan Fu, and Jason H Moore. Pmlb v1.0: an open-source dataset collection for benchmarking machine learning methods.Bioinformatics, 38(3):878–880, 2022

  49. [49]

    Pmlbmini: A tabular classification bench- mark suite for data-scarce applications

    Ricardo Knauer, Marvin Grimm, and Erik Rodner. Pmlbmini: A tabular classification bench- mark suite for data-scarce applications. InAutoML Conference 2024 (ABCD Track), 2024

  50. [50]

    Is deep learning finally better than decision trees on tabular data?arXiv preprint arXiv:2402.03970, 2024

    Guri Zabërgja, Arlind Kadra, Christian Frey, and Josif Grabocka. Is deep learning finally better than decision trees on tabular data?arXiv preprint arXiv:2402.03970, 2024

  51. [51]

    Talent: A tabular analytics and learning toolbox.arXiv preprint arXiv:2407.04057, 2024

    Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.arXiv preprint arXiv:2407.04057, 2024

  52. [52]

    Tabr: Tabular deep learning meets nearest neighbors

    Yury Gorishniy, Ivan Rubachev, Nikolay Kartashev, Daniil Shlenskii, Akim Kotelnikov, and Artem Babenko. Tabr: Tabular deep learning meets nearest neighbors. InThe Twelfth International Conference on Learning Representations, 2024

  53. [53]

    Rewardbench: Evaluating reward models for language modeling

    Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, et al. Rewardbench: Evaluating reward models for language modeling.arXiv preprint arXiv:2403.13787, 2024

  54. [54]

    LiveBench: A Challenging, Contamination-Limited LLM Benchmark

    Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314, 2024

  55. [55]

    A framework for few-shot language model evaluation, September 2021

    Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. A framework for few-shot language model evaluation, September 2021. URLhttps://doi.org/10.5281/zenodo.5371628

  56. [56]

    Open llm leaderboard v2

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Szafer, and Thomas Wolf. Open llm leaderboard v2. https://huggingface.co/spaces/open-llm-leaderboard/ open_llm_leaderboard, 2024

  57. [57]

    Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024

    Taha Aksu, Gerald Woo, Juncheng Liu, Xu Liu, Chenghao Liu, Silvio Savarese, Caiming Xiong, and Doyen Sahoo. Gift-eval: A benchmark for general time series forecasting model evaluation.arXiv preprint arXiv:2410.10393, 2024. 14

  58. [58]

    Bouthillier, P

    X. Bouthillier, P. Delaunay, M. Bronzi, A. Trofimov, B. Nichyporuk, J. Szeto, N. Mohammadi Sepahvand, E. Raff, K. Madan, V . V oleti, S. Ebrahimi Kahou, V . Michalski, T. Arbel, C. Pal, G. Varoquaux, and P. Vincent. Accounting for variance in machine learning benchmarks. In A. Smola, A. Dimakis, and I. Stoica, editors,Proceedings of Machine Learning and S...

  59. [59]

    J. Demšar. Statistical comparisons of classifiers over multiple data sets. 7:1–30, 2006

  60. [60]

    S. Herbold. Autorank: A python package for automated ranking of classifiers.Journal of Open Source Software, 5(48):2173–2173, 2020

  61. [61]

    A closer look at tabpfn v2: Strength, limitation, and extension.arXiv preprint arXiv:2502.17361, 2025

    Han-Jia Ye, Si-Yang Liu, and Wei-Lun Chao. A closer look at tabpfn v2: Strength, limitation, and extension.arXiv preprint arXiv:2502.17361, 2025

  62. [62]

    Feurer, J

    M. Feurer, J. van Rijn, A. Kadra, P. Gijsbers, N. Mallik, S. Ravi, A. Müller, J. Vanschoren, and F. Hutter. OpenML-Python: an extensible Python API for OpenML. 22(100):1–5, 2021

  63. [63]

    Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

    Bernd Bischl, Giuseppe Casalicchio, Taniya Das, Matthias Feurer, Sebastian Fischer, Pieter Gijsbers, Subhaditya Mukherjee, Andreas C Müller, László Németh, Luis Oala, et al. Openml: Insights from 10 years and more than a thousand papers.Patterns, 2025

  64. [64]

    Turl: Table understanding through representation learning.ACM SIGMOD Record, 51(1):33–40, 2022

    Xiang Deng, Huan Sun, Alyssa Lees, You Wu, and Cong Yu. Turl: Table understanding through representation learning.ACM SIGMOD Record, 51(1):33–40, 2022

  65. [65]

    Gittables: A large-scale corpus of relational tables.Proceedings of the ACM on Management of Data, 1(1):1–17, 2023

    Madelon Hulsebos, Çagatay Demiralp, and Paul Groth. Gittables: A large-scale corpus of relational tables.Proceedings of the ACM on Management of Data, 1(1):1–17, 2023

  66. [66]

    Knowledge discovery on rfm model using bernoulli sequence.Expert Systems with applications, 36(3):5866–5871, 2009

    I-Cheng Yeh, King-Jang Yang, and Tao-Ming Ting. Knowledge discovery on rfm model using bernoulli sequence.Expert Systems with applications, 36(3):5866–5871, 2009

  67. [67]

    Using the adap learning algorithm to forecast the onset of diabetes mellitus

    Jack W Smith, James E Everhart, William C Dickson, William C Knowler, and Robert Scott Johannes. Using the adap learning algorithm to forecast the onset of diabetes mellitus. In Proceedings of the annual symposium on computer application in medical care, page 261, 1988

  68. [68]

    Annealing

    Unknown. Annealing. https://doi.org/10.24432/C5RW2F, 1990. UCI Machine Learning Repository

  69. [69]

    A similarity- based qsar model for predicting acute toxicity towards the fathead minnow (pimephales promelas).SAR and QSAR in Environmental Research, 26(3):217–243, 2015

    Matteo Cassotti, Davide Ballabio, Roberto Todeschini, and Viviana Consonni. A similarity- based qsar model for predicting acute toxicity towards the fathead minnow (pimephales promelas).SAR and QSAR in Environmental Research, 26(3):217–243, 2015

  70. [70]

    H. Hofmann. Statlog (german credit data) [dataset]. https://doi.org/10.24432/C5NC77,

  71. [72]

    Review and analysis of risk factor of maternal health in remote area using the internet of things (iot)

    Marzia Ahmed, Mohammod Abul Kashem, Mostafijur Rahman, and Sabira Khatun. Review and analysis of risk factor of maternal health in remote area using the internet of things (iot). InInECCE2019: Proceedings of the 5th International Conference on Electrical, Control & Computer Engineering, Kuantan, Pahang, Malaysia, 29th July 2019, pages 357–365. Springer, 2020

  72. [73]

    Modeling of strength of high-performance concrete using artificial neural networks

    I-C Yeh. Modeling of strength of high-performance concrete using artificial neural networks. Cement and Concrete research, 28(12):1797–1808, 1998

  73. [74]

    Quantitative structure–activity relationship models for ready biodegradability of chemicals

    Kamel Mansouri, Tine Ringsted, Davide Ballabio, Roberto Todeschini, and Viviana Consonni. Quantitative structure–activity relationship models for ready biodegradability of chemicals. Journal of chemical information and modeling, 53(4):867–878, 2013

  74. [75]

    Healthcare insurance expenses

    Kaggle User Arunjangir245. Healthcare insurance expenses. https://www.kaggle.com/ datasets/arunjangir245/healthcare-insurance-expenses/, 2023. Kaggle dataset

  75. [76]

    Phishing detection based associative classification data mining.Expert Systems with Applications, 41(13):5948–5959, 2014

    Neda Abdelhamid, Aladdin Ayesh, and Fadi Thabtah. Phishing detection based associative classification data mining.Expert Systems with Applications, 41(13):5948–5959, 2014

  76. [77]

    Fitness club dataset for ml classification

    Kaggle User Ddosad. Fitness club dataset for ml classification. https://www.kaggle.com/ datasets/ddosad/datacamps-data-science-associate-certification , 2023. Kaggle dataset

  77. [78]

    Airfoil self-noise and prediction

    Thomas F Brooks, D Stuart Pope, and Michael A Marcolini. Airfoil self-noise and prediction. Technical report, 1989. 15

  78. [79]

    Another dataset on used fiat 500 (1538 rows).https://www.kaggle

    Kaggle User Paolocons. Another dataset on used fiat 500 (1538 rows).https://www.kaggle. com/datasets/paolocons/another-fiat-500-dataset-1538-rows , 2020. Kaggle dataset

  79. [80]

    Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data.GigaScience, 9(11):giaa128, 2020

    Sergey E Golovenkin, Jonathan Bac, Alexander Chervov, Evgeny M Mirkes, Yuliya V Orlova, Emmanuel Barillot, Alexander N Gorban, and Andrei Zinovyev. Trajectories, bifurcations, and pseudo-time in large clinical datasets: applications to myocardial infarction and diabetes data.GigaScience, 9(11):giaa128, 2020

  80. [81]

    Is this a good customer? https://www.kaggle.com/datasets/ podsyp/is-this-a-good-customer, 2020

    Kaggle User Podsyp. Is this a good customer? https://www.kaggle.com/datasets/ podsyp/is-this-a-good-customer, 2020. Kaggle dataset

Showing first 80 references.