pith. sign in

arxiv: 2605.18696 · v1 · pith:YZBXPWEAnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

Pith reviewed 2026-05-20 12:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords tabular foundation modelsensemblingQ-statisticmodel diversitycalibrationstackingOpenML tasks
0
0 comments X

The pith

Six tabular foundation models are nearly redundant, so ensembles add at most 0.18% accuracy at 253 times the cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that six modern tabular foundation models behave as a near-redundant set because their mean pairwise Q-statistic reaches 0.961. On 153 OpenML classification tasks, every tested ensemble therefore stays close to the performance of the single strongest model. The best method, two-level cascade stacking, improves accuracy by only 0.18 percent while multiplying compute by 253. Logistic-regression stacking raises accuracy yet destroys calibration and produces the worst log-loss rank. A statistical comparison groups the top three ensembles with the best base model, leaving the remaining ensembles clearly inferior. Greedy single-model selection is therefore offered as the practical default.

Core claim

Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is 0.961, close enough to 1 that any convex combination is bounded above. Benchmarking six ensemble strategies over six TFMs on 153 OpenML classification tasks shows the best ensemble, two-level cascade stacking, buys +0.18% accuracy over the strongest single TFM at 253 times the compute. Stacking with a logistic-regression meta-learner improves accuracy and ROC-AUC but ranks worst on log-loss because it sharpens class boundaries and destroys calibration. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly worse.

What carries the argument

Mean pairwise Q-statistic that quantifies agreement between model predictions and thereby bounds the diversity available for ensembling.

Load-bearing premise

The six chosen TFMs and 153 OpenML tasks sufficiently represent the space of current tabular foundation models and tasks.

What would settle it

A new TFM whose predictions show a mean pairwise Q-statistic below 0.85 with the existing six on a comparable set of tasks would falsify the near-redundancy claim.

read the original abstract

Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that six modern tabular foundation models form a near-redundant pool (mean pairwise Q-statistic of 0.961), such that ensembles yield negligible gains. Benchmarking six ensemble strategies on 153 OpenML classification tasks shows the best performer (two-level cascade stacking) improves accuracy by only +0.18% over the strongest single TFM at 253× compute cost. Friedman-Nemenyi analysis places three ensembles and the best base model in one equivalence group, while three ensembles perform significantly worse; stacking with logistic regression is highlighted as creating a calibration trap by improving accuracy at the expense of log-loss.

Significance. If the empirical findings hold, the work has clear practical value for the tabular ML community by documenting a diversity ceiling among current TFMs and recommending greedy selection over complex ensembling. Credit is due for the direct benchmarking against published TFMs on 153 tasks and the use of Friedman-Nemenyi post-hoc tests to establish equivalence groups rather than relying on raw averages.

major comments (3)
  1. [Experimental Setup] The central generalization that the observed Q-statistic of 0.961 demonstrates an inherent diversity ceiling (rather than a property of this particular sample) rests on the representativeness of the six chosen TFMs and 153 OpenML tasks; no explicit selection criteria or coverage argument is provided to support extension beyond this pool.
  2. [Results] No error bars, standard deviations, or confidence intervals accompany the reported accuracy gains, Q-statistic, or Friedman-Nemenyi ranks; this weakens the claim that the +0.18% improvement and equivalence-group findings are robust rather than sensitive to sampling variability.
  3. [Results] The manuscript does not verify that post-hoc selection of the best ensemble among the six tested strategies did not inflate the reported gains; without a pre-specified primary ensemble or correction for multiple comparisons, the conclusion that ensembles are bounded above is harder to interpret.
minor comments (2)
  1. [Abstract] The abstract states that three ensembles are significantly worse but does not name them or report the exact Nemenyi critical differences or p-values.
  2. [Experimental Setup] Hyperparameter details for the base TFMs and the meta-learners in the stacking variants are not provided, limiting reproducibility of the calibration-trap observation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and describe the corresponding revisions planned for the manuscript.

read point-by-point responses
  1. Referee: [Experimental Setup] The central generalization that the observed Q-statistic of 0.961 demonstrates an inherent diversity ceiling (rather than a property of this particular sample) rests on the representativeness of the six chosen TFMs and 153 OpenML tasks; no explicit selection criteria or coverage argument is provided to support extension beyond this pool.

    Authors: The six TFMs were selected as the leading publicly available models with published strong performance on tabular tasks at the time of writing. The 153 tasks are the standard OpenML-CC18 classification benchmark used across multiple prior tabular studies for comparability. We will add an explicit subsection on selection criteria for both models and tasks, including a short coverage argument addressing dataset size, dimensionality, and class balance to better support generalization of the diversity-ceiling observation. revision: yes

  2. Referee: [Results] No error bars, standard deviations, or confidence intervals accompany the reported accuracy gains, Q-statistic, or Friedman-Nemenyi ranks; this weakens the claim that the +0.18% improvement and equivalence-group findings are robust rather than sensitive to sampling variability.

    Authors: We agree that variability estimates would improve robustness. The revised manuscript will include bootstrap 95% confidence intervals for the reported accuracy gains and mean Q-statistic, as well as standard deviations of the average ranks across the 153 tasks. revision: yes

  3. Referee: [Results] The manuscript does not verify that post-hoc selection of the best ensemble among the six tested strategies did not inflate the reported gains; without a pre-specified primary ensemble or correction for multiple comparisons, the conclusion that ensembles are bounded above is harder to interpret.

    Authors: The study evaluated a range of strategies to determine whether any could meaningfully exceed single-model performance. The key result is that even the strongest observed ensemble yields only +0.18% and that three ensembles are significantly worse. To address the post-hoc concern we will designate two-level cascade stacking as the primary ensemble in the revised text, note the exploratory status of the remaining comparisons, and apply a Bonferroni correction to the Friedman-Nemenyi post-hoc tests. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper reports direct empirical measurements: pairwise Q-statistics computed from the six TFMs' predictions on 153 external OpenML tasks, plus accuracy/ROC-AUC/log-loss ranks for six ensemble strategies versus single models. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction; the Q-statistic of 0.961 and the +0.18% ensemble gain are observed quantities, not self-defined or statistically forced outputs. The study relies on published TFMs and public datasets rather than any self-citation chain or ansatz, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that Q-statistic adequately bounds ensemble performance and that the chosen benchmark suite is representative; no free parameters or new entities are introduced.

axioms (1)
  • domain assumption Mean pairwise Q-statistic near 1 implies convex combinations cannot exceed the best base model by more than a small margin
    Invoked to conclude that ensembles are bounded above

pith-pipeline@v0.9.0 · 5754 in / 1259 out tokens · 42661 ms · 2026-05-20T12:54:18.908483+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

  1. [1]

    TabPFN: A transformer that solves small tabular classification problems in a second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations (ICLR), 2023

  2. [2]

    Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

  3. [3]

    TabICL: A tabular foundation model for in-context learning on large data

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning (ICML), 2025

  4. [4]

    Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025b

    Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang. Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025

  5. [5]

    Orion-Bix: Bi-Axial attention for tabular in-context learning

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-Axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA,

  6. [6]

    Association for Computing Machinery

  7. [7]

    TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. TabDPT: Scaling tabular foundation models. arXiv preprint arXiv:2410.18164, 2024

  8. [8]

    Carte: Pretraining and transfer for tabular learning, 2024

    Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: Pretraining and transfer for tabular learning, 2024

  9. [9]

    Real-tabpfn: Im- proving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971,

    Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. Real- TabPFN: Improving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971, 2025

  10. [10]

    TabArena: A living benchmark for machine learning on tabular data

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. TabArena: A living benchmark for machine learning on tabular data. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

  11. [11]

    Bagging predictors.Machine Learning, 24:123–140, 1996

    Leo Breiman. Bagging predictors.Machine Learning, 24:123–140, 1996

  12. [12]

    Dietterich

    Thomas G. Dietterich. Ensemble methods in machine learning. InMultiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer, 2000

  13. [13]

    Ensemble selection from libraries of models

    Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. InInternational Conference on Machine Learning (ICML), 2004

  14. [14]

    Transformers can do bayesian inference

    Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. InInternational Conference on Learning Representations, 2022

  15. [15]

    What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

    Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. In Yingzhen Li, Stephan Mandt, Shipra 6 Ensembling Tabular Foundation Models Agrawal, and Emtiyaz Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligenc...

  16. [16]

    HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026

    Jannis Maier and Lennart Purucker. HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026

  17. [17]

    TabM: Advancing tabular deep learning with parameter- efficient ensembling

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. TabM: Advancing tabular deep learning with parameter- efficient ensembling. InInternational Conference on Learning Representations (ICLR), 2025

  18. [18]

    Exploring fine-tuning for tabular foundation models

    Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA, 2026. Association for Computing Machinery

  19. [19]

    David H. Wolpert. Stacked generalization.Neural Networks, 5(2):241–259, 1992

  20. [20]

    Kai Ming Ting and Ian H. Witten. Issues in stacked generalization.Journal of Artificial Intelligence Research, 10:271–289, 1999

  21. [21]

    Kuncheva and Christopher J

    Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.Machine Learning, 51(2):181–207, 2003

  22. [22]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017

  23. [23]

    Predicting good probabilities with supervised learning

    Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, volume 119 ofACM International Conference Proceeding Series, pages 625–632. ACM, 2005

  24. [24]

    Selective classification for deep neural networks

    Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  25. [25]

    Hashimoto, and Percy Liang

    Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020

  26. [26]

    AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

    Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular: Robust and accurate AutoML for structured data.arXiv preprint arXiv:2003.06505, 2020

  27. [27]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

  28. [28]

    TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

    Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

  29. [29]

    Mantovani, Jan N

    Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021

  30. [30]

    Talent: A tabular analytics and learning toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025

    Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, Huai-Hong Yin, Tao Zhou, Jun-Peng Jiang, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025

  31. [31]

    When do neural nets outperform boosted trees on tabular data?, 2023

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?, 2023

  32. [32]

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    L’eo Grinsztajn, Klemens Floge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jager, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rose Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Buhler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Scholk...

  33. [33]

    Limix: Unleashing structured- data modeling capability for generalist intelligence

    Xingxuan Zhang et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

  34. [34]

    Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

  35. [35]

    Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

  36. [36]

    A comparison of alternative tests of significance for the problem of m rankings.The Annals of Mathematical Statistics, 11(1):86–92, 1940

    Milton Friedman. A comparison of alternative tests of significance for the problem of m rankings.The Annals of Mathematical Statistics, 11(1):86–92, 1940

  37. [37]

    PhD thesis, Princeton University, 1963

    Peter Nemenyi.Distribution-Free Multiple Comparisons. PhD thesis, Princeton University, 1963

  38. [38]

    Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945

    Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945. 7 Ensembling Tabular Foundation Models A Method-name glossary The body and Tables 1 and 3 use compact short-form labels; Figures 3 and 4 render the same methods in long form. Table 2 reconciles the two. Table 2Short-form labels used in prose and Tables 1 and 3...