Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

Aditya Tanna; Mohamed Bouadi; Nassim Bouarour; Pratinav Seth; Vinay Kumar Sankarapu; Yash Desai

arxiv: 2605.18696 · v1 · pith:YZBXPWEAnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

Aditya Tanna , Yash Desai , Pratinav Seth , Mohamed Bouadi , Nassim Bouarour , Vinay Kumar Sankarapu This is my paper

Pith reviewed 2026-05-20 12:54 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords tabular foundation modelsensemblingQ-statisticmodel diversitycalibrationstackingOpenML tasks

0 comments

The pith

Six tabular foundation models are nearly redundant, so ensembles add at most 0.18% accuracy at 253 times the cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that six modern tabular foundation models behave as a near-redundant set because their mean pairwise Q-statistic reaches 0.961. On 153 OpenML classification tasks, every tested ensemble therefore stays close to the performance of the single strongest model. The best method, two-level cascade stacking, improves accuracy by only 0.18 percent while multiplying compute by 253. Logistic-regression stacking raises accuracy yet destroys calibration and produces the worst log-loss rank. A statistical comparison groups the top three ensembles with the best base model, leaving the remaining ensembles clearly inferior. Greedy single-model selection is therefore offered as the practical default.

Core claim

Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is 0.961, close enough to 1 that any convex combination is bounded above. Benchmarking six ensemble strategies over six TFMs on 153 OpenML classification tasks shows the best ensemble, two-level cascade stacking, buys +0.18% accuracy over the strongest single TFM at 253 times the compute. Stacking with a logistic-regression meta-learner improves accuracy and ROC-AUC but ranks worst on log-loss because it sharpens class boundaries and destroys calibration. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly worse.

What carries the argument

Mean pairwise Q-statistic that quantifies agreement between model predictions and thereby bounds the diversity available for ensembling.

Load-bearing premise

The six chosen TFMs and 153 OpenML tasks sufficiently represent the space of current tabular foundation models and tasks.

What would settle it

A new TFM whose predictions show a mean pairwise Q-statistic below 0.85 with the existing six on a comparable set of tasks would falsify the near-redundancy claim.

read the original abstract

Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Six TFMs show high pairwise agreement so ensembles add almost nothing while some ruin calibration.

read the letter

The main thing to know is that these six tabular foundation models have a mean pairwise Q-statistic of 0.961, which means their predictions overlap so much that convex combinations cannot improve much. On 153 OpenML classification tasks the best ensemble (two-level cascade stacking) gains only 0.18% accuracy over the strongest single TFM and costs 253 times the compute; three of the six ensemble methods are statistically worse than the best base model under Friedman-Nemenyi testing. Logistic-regression stacking keeps accuracy but tanks log-loss because it sharpens boundaries at the expense of calibration. The paper recommends greedy selection as the default in practice. What is new is the concrete quantification of diversity limits and the calibration side-effect using established metrics on a sizable set of published TFMs. The Friedman-Nemenyi equivalence groups and the explicit compute-accuracy trade-off are useful for anyone deciding whether to bother with ensembles. The work is straightforward and stays on the data rather than overclaiming. The soft spot is representativeness. The high Q-statistic and tiny gains could be specific to these six models and the 153 tasks chosen; if the models share pretraining data or similar inductive biases, the redundancy may not generalize to the broader space of tabular foundation models. The abstract also omits error bars and exact hyperparameter choices, though the overall pattern looks consistent within the sample. This is for practitioners who already use or are evaluating TFMs and want to know when ensembling is likely to be a waste of resources. It is worth sending to peer review because the empirical findings are grounded and directly relevant to deployment decisions even if the scope is narrow.

Referee Report

3 major / 2 minor

Summary. The paper claims that six modern tabular foundation models form a near-redundant pool (mean pairwise Q-statistic of 0.961), such that ensembles yield negligible gains. Benchmarking six ensemble strategies on 153 OpenML classification tasks shows the best performer (two-level cascade stacking) improves accuracy by only +0.18% over the strongest single TFM at 253× compute cost. Friedman-Nemenyi analysis places three ensembles and the best base model in one equivalence group, while three ensembles perform significantly worse; stacking with logistic regression is highlighted as creating a calibration trap by improving accuracy at the expense of log-loss.

Significance. If the empirical findings hold, the work has clear practical value for the tabular ML community by documenting a diversity ceiling among current TFMs and recommending greedy selection over complex ensembling. Credit is due for the direct benchmarking against published TFMs on 153 tasks and the use of Friedman-Nemenyi post-hoc tests to establish equivalence groups rather than relying on raw averages.

major comments (3)

[Experimental Setup] The central generalization that the observed Q-statistic of 0.961 demonstrates an inherent diversity ceiling (rather than a property of this particular sample) rests on the representativeness of the six chosen TFMs and 153 OpenML tasks; no explicit selection criteria or coverage argument is provided to support extension beyond this pool.
[Results] No error bars, standard deviations, or confidence intervals accompany the reported accuracy gains, Q-statistic, or Friedman-Nemenyi ranks; this weakens the claim that the +0.18% improvement and equivalence-group findings are robust rather than sensitive to sampling variability.
[Results] The manuscript does not verify that post-hoc selection of the best ensemble among the six tested strategies did not inflate the reported gains; without a pre-specified primary ensemble or correction for multiple comparisons, the conclusion that ensembles are bounded above is harder to interpret.

minor comments (2)

[Abstract] The abstract states that three ensembles are significantly worse but does not name them or report the exact Nemenyi critical differences or p-values.
[Experimental Setup] Hyperparameter details for the base TFMs and the meta-learners in the stacking variants are not provided, limiting reproducibility of the calibration-trap observation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and describe the corresponding revisions planned for the manuscript.

read point-by-point responses

Referee: [Experimental Setup] The central generalization that the observed Q-statistic of 0.961 demonstrates an inherent diversity ceiling (rather than a property of this particular sample) rests on the representativeness of the six chosen TFMs and 153 OpenML tasks; no explicit selection criteria or coverage argument is provided to support extension beyond this pool.

Authors: The six TFMs were selected as the leading publicly available models with published strong performance on tabular tasks at the time of writing. The 153 tasks are the standard OpenML-CC18 classification benchmark used across multiple prior tabular studies for comparability. We will add an explicit subsection on selection criteria for both models and tasks, including a short coverage argument addressing dataset size, dimensionality, and class balance to better support generalization of the diversity-ceiling observation. revision: yes
Referee: [Results] No error bars, standard deviations, or confidence intervals accompany the reported accuracy gains, Q-statistic, or Friedman-Nemenyi ranks; this weakens the claim that the +0.18% improvement and equivalence-group findings are robust rather than sensitive to sampling variability.

Authors: We agree that variability estimates would improve robustness. The revised manuscript will include bootstrap 95% confidence intervals for the reported accuracy gains and mean Q-statistic, as well as standard deviations of the average ranks across the 153 tasks. revision: yes
Referee: [Results] The manuscript does not verify that post-hoc selection of the best ensemble among the six tested strategies did not inflate the reported gains; without a pre-specified primary ensemble or correction for multiple comparisons, the conclusion that ensembles are bounded above is harder to interpret.

Authors: The study evaluated a range of strategies to determine whether any could meaningfully exceed single-model performance. The key result is that even the strongest observed ensemble yields only +0.18% and that three ensembles are significantly worse. To address the post-hoc concern we will designate two-level cascade stacking as the primary ensemble in the revised text, note the exploratory status of the remaining comparisons, and apply a Bonferroni correction to the Friedman-Nemenyi post-hoc tests. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper reports direct empirical measurements: pairwise Q-statistics computed from the six TFMs' predictions on 153 external OpenML tasks, plus accuracy/ROC-AUC/log-loss ranks for six ensemble strategies versus single models. No equations, fitted parameters, or derivations are presented that reduce to the inputs by construction; the Q-statistic of 0.961 and the +0.18% ensemble gain are observed quantities, not self-defined or statistically forced outputs. The study relies on published TFMs and public datasets rather than any self-citation chain or ansatz, satisfying the criteria for a self-contained empirical result.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that Q-statistic adequately bounds ensemble performance and that the chosen benchmark suite is representative; no free parameters or new entities are introduced.

axioms (1)

domain assumption Mean pairwise Q-statistic near 1 implies convex combinations cannot exceed the best base model by more than a small margin
Invoked to conclude that ensembles are bounded above

pith-pipeline@v0.9.0 · 5754 in / 1259 out tokens · 42661 ms · 2026-05-20T12:54:18.908483+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is 0.961... any convex combination is bounded above.
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The best ensemble, two-level cascade stacking, buys +0.18% accuracy... at 253× the compute.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

[1]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[2]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

work page 2025
[3]

TabICL: A tabular foundation model for in-context learning on large data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning (ICML), 2025

work page 2025
[4]

Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025b

Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang. Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025

work page arXiv 2025
[5]

Orion-Bix: Bi-Axial attention for tabular in-context learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-Axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA,

work page 2026
[6]

Association for Computing Machinery

work page
[7]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. TabDPT: Scaling tabular foundation models. arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024
[8]

Carte: Pretraining and transfer for tabular learning, 2024

Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: Pretraining and transfer for tabular learning, 2024

work page 2024
[9]

Real-tabpfn: Im- proving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971,

Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. Real- TabPFN: Improving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971, 2025

work page arXiv 2025
[10]

TabArena: A living benchmark for machine learning on tabular data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. TabArena: A living benchmark for machine learning on tabular data. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

work page 2025
[11]

Bagging predictors.Machine Learning, 24:123–140, 1996

Leo Breiman. Bagging predictors.Machine Learning, 24:123–140, 1996

work page 1996
[12]

Dietterich

Thomas G. Dietterich. Ensemble methods in machine learning. InMultiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer, 2000

work page 2000
[13]

Ensemble selection from libraries of models

Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. InInternational Conference on Machine Learning (ICML), 2004

work page 2004
[14]

Transformers can do bayesian inference

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. InInternational Conference on Learning Representations, 2022

work page 2022
[15]

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. In Yingzhen Li, Stephan Mandt, Shipra 6 Ensembling Tabular Foundation Models Agrawal, and Emtiyaz Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligenc...

work page 2025
[16]

HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026

Jannis Maier and Lennart Purucker. HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026

work page arXiv 2026
[17]

TabM: Advancing tabular deep learning with parameter- efficient ensembling

Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. TabM: Advancing tabular deep learning with parameter- efficient ensembling. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[18]

Exploring fine-tuning for tabular foundation models

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA, 2026. Association for Computing Machinery

work page 2026
[19]

David H. Wolpert. Stacked generalization.Neural Networks, 5(2):241–259, 1992

work page 1992
[20]

Kai Ming Ting and Ian H. Witten. Issues in stacked generalization.Journal of Artificial Intelligence Research, 10:271–289, 1999

work page 1999
[21]

Kuncheva and Christopher J

Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.Machine Learning, 51(2):181–207, 2003

work page 2003
[22]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017

work page 2017
[23]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, volume 119 ofACM International Conference Proceeding Series, pages 625–632. ACM, 2005

work page 2005
[24]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[25]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020
[26]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular: Robust and accurate AutoML for structured data.arXiv preprint arXiv:2003.06505, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[27]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[28]

TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

work page arXiv 2025
[29]

Mantovani, Jan N

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021

work page 2021
[30]

Talent: A tabular analytics and learning toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025

Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, Huai-Hong Yin, Tao Zhou, Jun-Peng Jiang, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025

work page 2025
[31]

When do neural nets outperform boosted trees on tabular data?, 2023

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?, 2023

work page 2023
[32]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

L’eo Grinsztajn, Klemens Floge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jager, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rose Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Buhler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Scholk...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025
[34]

Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

work page arXiv 2025
[35]

Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

work page 1950
[36]

A comparison of alternative tests of significance for the problem of m rankings.The Annals of Mathematical Statistics, 11(1):86–92, 1940

Milton Friedman. A comparison of alternative tests of significance for the problem of m rankings.The Annals of Mathematical Statistics, 11(1):86–92, 1940

work page 1940
[37]

PhD thesis, Princeton University, 1963

Peter Nemenyi.Distribution-Free Multiple Comparisons. PhD thesis, Princeton University, 1963

work page 1963
[38]

Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945

Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945. 7 Ensembling Tabular Foundation Models A Method-name glossary The body and Tables 1 and 3 use compact short-form labels; Figures 3 and 4 render the same methods in long form. Table 2 reconciles the two. Table 2Short-form labels used in prose and Tables 1 and 3...

work page arXiv 1945

[1] [1]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023

[2] [2]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

work page 2025

[3] [3]

TabICL: A tabular foundation model for in-context learning on large data

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning (ICML), 2025

work page 2025

[4] [4]

Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025b

Xiyuan Zhang, Danielle C. Maddix, Junming Yin, Nick Erickson, Abdul Fatir Ansari, Boran Han, Shuai Zhang, Leman Akoglu, Christos Faloutsos, Michael W. Mahoney, Cuixiong Hu, Huzefa Rangwala, George Karypis, and Bernie Wang. Mitra: Mixed synthetic priors for enhancing tabular foundation models.arXiv preprint arXiv:2510.21204, 2025

work page arXiv 2025

[5] [5]

Orion-Bix: Bi-Axial attention for tabular in-context learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-Axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA,

work page 2026

[6] [6]

Association for Computing Machinery

work page

[7] [7]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Hamidreza Kamkari, Alex Labach, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Maksims V olkovs, and Anthony L Caterini. TabDPT: Scaling tabular foundation models. arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024

[8] [8]

Carte: Pretraining and transfer for tabular learning, 2024

Myung Jun Kim, Léo Grinsztajn, and Gaël Varoquaux. Carte: Pretraining and transfer for tabular learning, 2024

work page 2024

[9] [9]

Real-tabpfn: Im- proving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971,

Anurag Garg, Muhammad Ali, Noah Hollmann, Lennart Purucker, Samuel Müller, and Frank Hutter. Real- TabPFN: Improving tabular foundation models via continued pre-training with real-world data.arXiv preprint arXiv:2507.03971, 2025

work page arXiv 2025

[10] [10]

TabArena: A living benchmark for machine learning on tabular data

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. TabArena: A living benchmark for machine learning on tabular data. InAdvances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025

work page 2025

[11] [11]

Bagging predictors.Machine Learning, 24:123–140, 1996

Leo Breiman. Bagging predictors.Machine Learning, 24:123–140, 1996

work page 1996

[12] [12]

Dietterich

Thomas G. Dietterich. Ensemble methods in machine learning. InMultiple Classifier Systems, volume 1857 of Lecture Notes in Computer Science, pages 1–15. Springer, 2000

work page 2000

[13] [13]

Ensemble selection from libraries of models

Rich Caruana, Alexandru Niculescu-Mizil, Geoff Crew, and Alex Ksikes. Ensemble selection from libraries of models. InInternational Conference on Machine Learning (ICML), 2004

work page 2004

[14] [14]

Transformers can do bayesian inference

Samuel Müller, Noah Hollmann, Sebastian Pineda Arango, Josif Grabocka, and Frank Hutter. Transformers can do bayesian inference. InInternational Conference on Learning Representations, 2022

work page 2022

[15] [15]

What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization

Yufeng Zhang, Fengzhuo Zhang, Zhuoran Yang, and Zhaoran Wang. What and how does in-context learning learn? bayesian model averaging, parameterization, and generalization. In Yingzhen Li, Stephan Mandt, Shipra 6 Ensembling Tabular Foundation Models Agrawal, and Emtiyaz Khan, editors,Proceedings of The 28th International Conference on Artificial Intelligenc...

work page 2025

[16] [16]

HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026

Jannis Maier and Lennart Purucker. HAPEns: Hardware-aware post-hoc ensembling for tabular data.arXiv preprint arXiv:2603.10582, 2026

work page arXiv 2026

[17] [17]

TabM: Advancing tabular deep learning with parameter- efficient ensembling

Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. TabM: Advancing tabular deep learning with parameter- efficient ensembling. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[18] [18]

Exploring fine-tuning for tabular foundation models

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, WWW ’26, New York, NY , USA, 2026. Association for Computing Machinery

work page 2026

[19] [19]

David H. Wolpert. Stacked generalization.Neural Networks, 5(2):241–259, 1992

work page 1992

[20] [20]

Kai Ming Ting and Ian H. Witten. Issues in stacked generalization.Journal of Artificial Intelligence Research, 10:271–289, 1999

work page 1999

[21] [21]

Kuncheva and Christopher J

Ludmila I. Kuncheva and Christopher J. Whitaker. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.Machine Learning, 51(2):181–207, 2003

work page 2003

[22] [22]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In International Conference on Machine Learning (ICML), 2017

work page 2017

[23] [23]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, volume 119 ofACM International Conference Proceeding Series, pages 625–632. ACM, 2005

work page 2005

[24] [24]

Selective classification for deep neural networks

Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[25] [25]

Hashimoto, and Percy Liang

Shiori Sagawa, Pang Wei Koh, Tatsunori B. Hashimoto, and Percy Liang. Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization. InInternational Conference on Learning Representations (ICLR), 2020

work page 2020

[26] [26]

AutoGluon-Tabular: Robust and Accurate AutoML for Structured Data

Nick Erickson, Jonas Mueller, Alexander Shirkov, Hang Zhang, Pedro Larroy, Mu Li, and Alexander Smola. AutoGluon-Tabular: Robust and accurate AutoML for structured data.arXiv preprint arXiv:2003.06505, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003

[27] [27]

Simple and scalable predictive uncertainty estimation using deep ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017

[28] [28]

TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

work page arXiv 2025

[29] [29]

Mantovani, Jan N

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems (NeurIPS) Track on Datasets and Benchmarks, 2021

work page 2021

[30] [30]

Talent: A tabular analytics and learning toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025

Si-Yang Liu, Hao-Run Cai, Qi-Le Zhou, Huai-Hong Yin, Tao Zhou, Jun-Peng Jiang, and Han-Jia Ye. Talent: A tabular analytics and learning toolbox.Journal of Machine Learning Research, 26(226):1–16, 2025

work page 2025

[31] [31]

When do neural nets outperform boosted trees on tabular data?, 2023

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Benjamin Feuer, Chinmay Hegde, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?, 2023

work page 2023

[32] [32]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

L’eo Grinsztajn, Klemens Floge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jager, Dominik Safaric, Simone Alessi, Adrian Hayler, Mihir Manium, Rose Yu, Felix Jablonski, Shi Bin Hoo, Anurag Garg, Jake Robertson, Magnus Buhler, Vladyslav Moroshan, Lennart Purucker, Clara Cornu, Lilly Charlotte Wehrhahn, Alessandro Bonetto, Bernhard Scholk...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025

[34] [34]

Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

work page arXiv 2025

[35] [35]

Glenn W. Brier. Verification of forecasts expressed in terms of probability.Monthly Weather Review, 78(1):1–3, 1950

work page 1950

[36] [36]

A comparison of alternative tests of significance for the problem of m rankings.The Annals of Mathematical Statistics, 11(1):86–92, 1940

Milton Friedman. A comparison of alternative tests of significance for the problem of m rankings.The Annals of Mathematical Statistics, 11(1):86–92, 1940

work page 1940

[37] [37]

PhD thesis, Princeton University, 1963

Peter Nemenyi.Distribution-Free Multiple Comparisons. PhD thesis, Princeton University, 1963

work page 1963

[38] [38]

Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945

Frank Wilcoxon. Individual comparisons by ranking methods.Biometrics Bulletin, 1(6):80–83, 1945. 7 Ensembling Tabular Foundation Models A Method-name glossary The body and Tables 1 and 3 use compact short-form labels; Figures 3 and 4 render the same methods in long form. Table 2 reconciles the two. Table 2Short-form labels used in prose and Tables 1 and 3...

work page arXiv 1945