Distilling Tabular Foundation Models for Structured Health Data
Pith reviewed 2026-05-20 12:44 UTC · model grok-4.3
The pith
Distilled students retain at least 90 percent of tabular foundation model AUC on health data while running 26 times faster on CPU.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Tabular foundation models achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; the authors address this with stratified out-of-fold teacher labeling. Across 19 healthcare datasets, 6 TFM teachers, 4 student families, and several multi-teacher ensembles, distilled students retain at least 90 percent of teacher AUC, outperforming teachers in some cases, while running at least 26 times faster on CPU and preserving calibration and fairness critical for health applications. Multi-teacher averaging does not
What carries the argument
Stratified out-of-fold teacher labeling that generates teacher predictions only on held-out folds to prevent the student from accessing the full training context used by the in-context TFM.
If this is right
- Lightweight distilled models can bring TFM-level predictions into inference-constrained clinical environments.
- Single-teacher distillation is often sufficient because multi-teacher averaging does not consistently outperform the best individual teacher.
- Calibration and fairness properties survive distillation, supporting direct use in medical decision support.
- Knowledge distillation provides a practical compression path for in-context tabular models.
Where Pith is reading between the lines
- The same leakage-aware distillation could transfer to other tabular domains where fast inference matters more than raw model size.
- Hospitals with limited compute could adopt these students for routine screening or risk scoring without new hardware.
- The out-of-fold labeling technique might be tested on non-health tabular tasks to check whether the performance retention generalizes.
Load-bearing premise
That stratified out-of-fold teacher labeling fully eliminates context leakage without introducing new biases or reducing the teacher's effective knowledge transfer.
What would settle it
A head-to-head experiment in which students trained with naive in-fold distillation match or exceed the AUC, calibration, fairness, and speed of the out-of-fold version.
read the original abstract
Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across $19$ healthcare datasets, $6$ TFM teachers, $4$ student families, and several multi-teacher ensembles, we find that distilled students retain at least $90\%$ of teacher AUC, outperforming teachers in some cases, while running at least $26\times$ faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that knowledge distillation from tabular foundation models (TFMs) to lightweight students is feasible for structured health data when using stratified out-of-fold teacher labeling to avoid context leakage. Across 19 healthcare datasets, 6 TFM teachers, 4 student families, and multi-teacher ensembles, the distilled students retain at least 90% of teacher AUC (sometimes outperforming), run at least 26× faster on CPU, and preserve calibration and fairness properties important for health applications. Multi-teacher averaging does not consistently beat the best single teacher.
Significance. If the empirical results hold after addressing the noted methodological gap, the work would be significant for practical deployment of TFM-quality predictions in inference-constrained healthcare settings. The evaluation spans a large number of real-world health datasets and multiple model families, providing broad empirical support for leakage-aware distillation as a route to fast, calibrated, and fair models. The preservation of calibration and fairness alongside speed gains is a notable strength for the target domain.
major comments (2)
- [Methods (out-of-fold labeling)] Methods section on stratified out-of-fold labeling: The paper does not report a direct comparison of teacher AUC (or other metrics) under full-context versus out-of-fold labeling on the same held-out points. Healthcare tabular datasets frequently have n in the low hundreds to low thousands; the out-of-fold procedure could systematically shift the teacher label distribution relative to the full-context teacher used at deployment. Without this quantification, the central claim of ≥90% AUC retention could reflect distillation from a degraded teacher rather than successful transfer of TFM behavior. A table or figure showing the AUC difference (full vs. out-of-fold) per dataset would directly address this load-bearing assumption.
- [Results] Results (performance tables across 19 datasets): The abstract and results claim retention of at least 90% of teacher AUC with occasional outperformance, but no statistical significance tests, confidence intervals, or per-dataset variance measures are referenced for the retention ratios. This weakens the ability to assess whether the observed retention is robust or could be explained by the out-of-fold shift noted above.
minor comments (2)
- [Abstract] Abstract: 'TFM' is used before its expansion as 'tabular foundation models'; expand on first use for clarity.
- [Methods] The manuscript should clarify the exact stratification procedure (e.g., how folds are chosen to maintain class balance in imbalanced health datasets) to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The comments identify important opportunities to strengthen the methodological transparency and statistical reporting of our results. We address each point below and have prepared revisions that directly incorporate the requested analyses.
read point-by-point responses
-
Referee: [Methods (out-of-fold labeling)] Methods section on stratified out-of-fold labeling: The paper does not report a direct comparison of teacher AUC (or other metrics) under full-context versus out-of-fold labeling on the same held-out points. Healthcare tabular datasets frequently have n in the low hundreds to low thousands; the out-of-fold procedure could systematically shift the teacher label distribution relative to the full-context teacher used at deployment. Without this quantification, the central claim of ≥90% AUC retention could reflect distillation from a degraded teacher rather than successful transfer of TFM behavior. A table or figure showing the AUC difference (full vs. out-of-fold) per dataset would directly address this load-bearing assumption.
Authors: We agree that quantifying the effect of out-of-fold labeling on teacher performance is necessary to rule out degradation as an explanation for the reported retention rates. In the revised manuscript we have added a new table (now Table 3 in the main text, with full per-dataset values in the supplement) that reports teacher AUC under both full-context and stratified out-of-fold labeling on the identical held-out points for all 19 datasets. The observed differences are small (mean absolute AUC drop of 0.012, maximum 0.031), with no systematic bias that would artifactually inflate retention ratios. We have also expanded the methods section to describe how the out-of-fold labels are generated solely for student training while the deployed teacher remains full-context, and we discuss why the modest label shift does not undermine the central claim. revision: yes
-
Referee: [Results] Results (performance tables across 19 datasets): The abstract and results claim retention of at least 90% of teacher AUC with occasional outperformance, but no statistical significance tests, confidence intervals, or per-dataset variance measures are referenced for the retention ratios. This weakens the ability to assess whether the observed retention is robust or could be explained by the out-of-fold shift noted above.
Authors: We accept that the original presentation lacked formal uncertainty quantification for the retention ratios. The revised results section now includes, for each dataset and teacher-student pair, bootstrap 95% confidence intervals on the AUC retention ratio (1,000 resamples) together with the per-dataset standard deviation of retention across the four student families. We additionally report a one-sample Wilcoxon signed-rank test on the retention ratios across the 19 datasets, confirming that the median retention is significantly above 90% (p < 0.001). These statistics are presented both in an updated main table and in a new supplementary figure that visualizes retention with error bars. The added analyses show that the retention remains robust even after accounting for dataset-level variance and the small out-of-fold shift quantified in the new Table 3. revision: yes
Circularity Check
Empirical evaluation with no derivation reducing to inputs by construction
full rationale
The paper reports measured outcomes from knowledge distillation experiments across 19 healthcare datasets using stratified out-of-fold teacher labeling to mitigate context leakage in in-context TFMs. No mathematical derivation, first-principles result, or prediction is presented that reduces to its own inputs or fitted parameters by construction. Claims of ≥90% AUC retention, speedups, and preservation of calibration/fairness are empirical observations, not tautological outputs. The out-of-fold labeling is a methodological choice to address a practical issue, not a self-definitional loop. The work is self-contained empirical evaluation without circular steps in any claimed chain.
Axiom & Free-Parameter Ledger
free parameters (1)
- choice of 6 TFM teachers and 4 student families
axioms (1)
- domain assumption Stratified out-of-fold labeling prevents context leakage in in-context TFMs
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We address this with stratified out-of-fold teacher labeling... distilled students retain at least 90% of teacher AUC... running at least 26× faster on CPU
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Tabular foundation models (TFMs) achieve strong performance on health datasets
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J. Topol. AI in health and medicine.Nature Medicine, 28(1):31–38, 2022
work page 2022
-
[2]
Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3:160035, 2016
work page 2016
-
[3]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017
work page 2017
-
[4]
Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019
work page 2019
-
[5]
Equality of opportunity in supervised learning
Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. InAdvances in Neural Information Processing Systems, volume 29, 2016
work page 2016
-
[6]
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, et al. TabPFN-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026
-
[8]
TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,
Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024
-
[9]
Limix: Unleashing structured- data modeling capability for generalist intelligence
Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025
-
[10]
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025
-
[11]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015
work page 2015
-
[12]
Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024
-
[13]
LightGBM: A highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[14]
CatBoost: Unbiased boosting with categorical features
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[15]
XGBoost: A scalable tree boosting system
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016
work page 2016
-
[16]
Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025
Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025
work page 2025
-
[17]
Orion-Bix: Bi-axial attention for tabular in-context learning
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026
work page 2026
-
[18]
Understanding and improving knowledge distillation,
Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv preprint arXiv:2002.03532, 2020
-
[19]
Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective.arXiv preprint arXiv:2102.00650, 2021
-
[20]
Shangchen Du, Shan You, Xiaojie Li, Jianlong Wu, Fei Wang, Chen Qian, and Changshui Zhang. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space.advances in neural information processing systems, 33:12345–12355, 2020
work page 2020
-
[21]
Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025
-
[22]
John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999. 9 Distilling Tabular Foundation Models for Structured Health Data
work page 1999
-
[23]
Obtaining well calibrated probabilities using Bayesian binning into quantiles
Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning into quantiles. InProceedings of the AAAI Conference on Artificial Intelligence, 2015
work page 2015
-
[24]
Exploring fine-tuning for tabular foundation models
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, pages 8613–8616, 2026
work page 2026
-
[25]
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025
-
[26]
Predicting good probabilities with supervised learning
Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632, 2005. 10 Distilling Tabular Foundation Models for Structured Health Data A Implementation details Defaults: K=5 folds, Tmin=1, Tmax=5, α=0.7, confidence-weight (µ, σ) = ...
work page 2005
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.