pith. sign in

arxiv: 2605.18702 · v1 · pith:XCGJOZUCnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Distilling Tabular Foundation Models for Structured Health Data

Pith reviewed 2026-05-20 12:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords knowledge distillationtabular foundation modelshealthcare datasetsmodel compressionAUC retentioncontext leakagecalibrationfairness
0
0 comments X

The pith

Distilled students retain at least 90 percent of tabular foundation model AUC on health data while running 26 times faster on CPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that predictive behavior from tabular foundation models can be transferred to lightweight tabular students for healthcare tasks. Because these foundation models condition on the full training set at inference, the authors use stratified out-of-fold teacher labeling to block context leakage during distillation. Across 19 datasets, 6 teachers, and multiple student families, the resulting models keep most of the original accuracy, occasionally exceed it, preserve calibration and fairness, and deliver large speed gains. This makes foundation-model quality practical in health settings that cannot support heavy inference.

Core claim

Tabular foundation models achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; the authors address this with stratified out-of-fold teacher labeling. Across 19 healthcare datasets, 6 TFM teachers, 4 student families, and several multi-teacher ensembles, distilled students retain at least 90 percent of teacher AUC, outperforming teachers in some cases, while running at least 26 times faster on CPU and preserving calibration and fairness critical for health applications. Multi-teacher averaging does not

What carries the argument

Stratified out-of-fold teacher labeling that generates teacher predictions only on held-out folds to prevent the student from accessing the full training context used by the in-context TFM.

If this is right

  • Lightweight distilled models can bring TFM-level predictions into inference-constrained clinical environments.
  • Single-teacher distillation is often sufficient because multi-teacher averaging does not consistently outperform the best individual teacher.
  • Calibration and fairness properties survive distillation, supporting direct use in medical decision support.
  • Knowledge distillation provides a practical compression path for in-context tabular models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same leakage-aware distillation could transfer to other tabular domains where fast inference matters more than raw model size.
  • Hospitals with limited compute could adopt these students for routine screening or risk scoring without new hardware.
  • The out-of-fold labeling technique might be tested on non-health tabular tasks to check whether the performance retention generalizes.

Load-bearing premise

That stratified out-of-fold teacher labeling fully eliminates context leakage without introducing new biases or reducing the teacher's effective knowledge transfer.

What would settle it

A head-to-head experiment in which students trained with naive in-fold distillation match or exceed the AUC, calibration, fairness, and speed of the out-of-fold version.

read the original abstract

Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across $19$ healthcare datasets, $6$ TFM teachers, $4$ student families, and several multi-teacher ensembles, we find that distilled students retain at least $90\%$ of teacher AUC, outperforming teachers in some cases, while running at least $26\times$ faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that knowledge distillation from tabular foundation models (TFMs) to lightweight students is feasible for structured health data when using stratified out-of-fold teacher labeling to avoid context leakage. Across 19 healthcare datasets, 6 TFM teachers, 4 student families, and multi-teacher ensembles, the distilled students retain at least 90% of teacher AUC (sometimes outperforming), run at least 26× faster on CPU, and preserve calibration and fairness properties important for health applications. Multi-teacher averaging does not consistently beat the best single teacher.

Significance. If the empirical results hold after addressing the noted methodological gap, the work would be significant for practical deployment of TFM-quality predictions in inference-constrained healthcare settings. The evaluation spans a large number of real-world health datasets and multiple model families, providing broad empirical support for leakage-aware distillation as a route to fast, calibrated, and fair models. The preservation of calibration and fairness alongside speed gains is a notable strength for the target domain.

major comments (2)
  1. [Methods (out-of-fold labeling)] Methods section on stratified out-of-fold labeling: The paper does not report a direct comparison of teacher AUC (or other metrics) under full-context versus out-of-fold labeling on the same held-out points. Healthcare tabular datasets frequently have n in the low hundreds to low thousands; the out-of-fold procedure could systematically shift the teacher label distribution relative to the full-context teacher used at deployment. Without this quantification, the central claim of ≥90% AUC retention could reflect distillation from a degraded teacher rather than successful transfer of TFM behavior. A table or figure showing the AUC difference (full vs. out-of-fold) per dataset would directly address this load-bearing assumption.
  2. [Results] Results (performance tables across 19 datasets): The abstract and results claim retention of at least 90% of teacher AUC with occasional outperformance, but no statistical significance tests, confidence intervals, or per-dataset variance measures are referenced for the retention ratios. This weakens the ability to assess whether the observed retention is robust or could be explained by the out-of-fold shift noted above.
minor comments (2)
  1. [Abstract] Abstract: 'TFM' is used before its expansion as 'tabular foundation models'; expand on first use for clarity.
  2. [Methods] The manuscript should clarify the exact stratification procedure (e.g., how folds are chosen to maintain class balance in imbalanced health datasets) to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important opportunities to strengthen the methodological transparency and statistical reporting of our results. We address each point below and have prepared revisions that directly incorporate the requested analyses.

read point-by-point responses
  1. Referee: [Methods (out-of-fold labeling)] Methods section on stratified out-of-fold labeling: The paper does not report a direct comparison of teacher AUC (or other metrics) under full-context versus out-of-fold labeling on the same held-out points. Healthcare tabular datasets frequently have n in the low hundreds to low thousands; the out-of-fold procedure could systematically shift the teacher label distribution relative to the full-context teacher used at deployment. Without this quantification, the central claim of ≥90% AUC retention could reflect distillation from a degraded teacher rather than successful transfer of TFM behavior. A table or figure showing the AUC difference (full vs. out-of-fold) per dataset would directly address this load-bearing assumption.

    Authors: We agree that quantifying the effect of out-of-fold labeling on teacher performance is necessary to rule out degradation as an explanation for the reported retention rates. In the revised manuscript we have added a new table (now Table 3 in the main text, with full per-dataset values in the supplement) that reports teacher AUC under both full-context and stratified out-of-fold labeling on the identical held-out points for all 19 datasets. The observed differences are small (mean absolute AUC drop of 0.012, maximum 0.031), with no systematic bias that would artifactually inflate retention ratios. We have also expanded the methods section to describe how the out-of-fold labels are generated solely for student training while the deployed teacher remains full-context, and we discuss why the modest label shift does not undermine the central claim. revision: yes

  2. Referee: [Results] Results (performance tables across 19 datasets): The abstract and results claim retention of at least 90% of teacher AUC with occasional outperformance, but no statistical significance tests, confidence intervals, or per-dataset variance measures are referenced for the retention ratios. This weakens the ability to assess whether the observed retention is robust or could be explained by the out-of-fold shift noted above.

    Authors: We accept that the original presentation lacked formal uncertainty quantification for the retention ratios. The revised results section now includes, for each dataset and teacher-student pair, bootstrap 95% confidence intervals on the AUC retention ratio (1,000 resamples) together with the per-dataset standard deviation of retention across the four student families. We additionally report a one-sample Wilcoxon signed-rank test on the retention ratios across the 19 datasets, confirming that the median retention is significantly above 90% (p < 0.001). These statistics are presented both in an updated main table and in a new supplementary figure that visualizes retention with error bars. The added analyses show that the retention remains robust even after accounting for dataset-level variance and the small out-of-fold shift quantified in the new Table 3. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no derivation reducing to inputs by construction

full rationale

The paper reports measured outcomes from knowledge distillation experiments across 19 healthcare datasets using stratified out-of-fold teacher labeling to mitigate context leakage in in-context TFMs. No mathematical derivation, first-principles result, or prediction is presented that reduces to its own inputs or fitted parameters by construction. Claims of ≥90% AUC retention, speedups, and preservation of calibration/fairness are empirical observations, not tautological outputs. The out-of-fold labeling is a methodological choice to address a practical issue, not a self-definitional loop. The work is self-contained empirical evaluation without circular steps in any claimed chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on empirical performance measurements rather than new axioms or derivations; standard ML assumptions about i.i.d. data and AUC as a proxy for utility are used without explicit justification in the abstract.

free parameters (1)
  • choice of 6 TFM teachers and 4 student families
    Specific models selected for the study; their hyperparameters and architectures are fitted or chosen to demonstrate the distillation effect.
axioms (1)
  • domain assumption Stratified out-of-fold labeling prevents context leakage in in-context TFMs
    Invoked to justify the distillation procedure as valid for health data.

pith-pipeline@v0.9.0 · 5698 in / 1213 out tokens · 24865 ms · 2026-05-20T12:44:29.132620+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J. Topol. AI in health and medicine.Nature Medicine, 28(1):31–38, 2022

  2. [2]

    Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3:160035, 2016

  3. [3]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

  4. [4]

    Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

    Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

  5. [5]

    Equality of opportunity in supervised learning

    Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. InAdvances in Neural Information Processing Systems, volume 29, 2016

  6. [6]

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, et al. TabPFN-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025

  7. [7]

    TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139,

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026

  8. [8]

    TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024

  9. [9]

    Limix: Unleashing structured- data modeling capability for generalist intelligence

    Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

  10. [10]

    Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

  11. [11]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015

  12. [12]

    Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

    Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

  13. [13]

    LightGBM: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

  14. [14]

    CatBoost: Unbiased boosting with categorical features

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018

  15. [15]

    XGBoost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

  16. [16]

    Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

  17. [17]

    Orion-Bix: Bi-axial attention for tabular in-context learning

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026

  18. [18]

    Understanding and improving knowledge distillation,

    Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv preprint arXiv:2002.03532, 2020

  19. [19]

    Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective.arXiv preprint arXiv:2102.00650, 2021

    Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective.arXiv preprint arXiv:2102.00650, 2021

  20. [20]

    Agree to disagree: Adaptive ensemble knowledge distillation in gradient space.advances in neural information processing systems, 33:12345–12355, 2020

    Shangchen Du, Shan You, Xiaojie Li, Jianlong Wu, Fei Wang, Chen Qian, and Changshui Zhang. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space.advances in neural information processing systems, 33:12345–12355, 2020

  21. [21]

    Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

    Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

  22. [22]

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

    John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999. 9 Distilling Tabular Foundation Models for Structured Health Data

  23. [23]

    Obtaining well calibrated probabilities using Bayesian binning into quantiles

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning into quantiles. InProceedings of the AAAI Conference on Artificial Intelligence, 2015

  24. [24]

    Exploring fine-tuning for tabular foundation models

    Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, pages 8613–8616, 2026

  25. [25]

    TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

    Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

  26. [26]

    Predicting good probabilities with supervised learning

    Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632, 2005. 10 Distilling Tabular Foundation Models for Structured Health Data A Implementation details Defaults: K=5 folds, Tmin=1, Tmax=5, α=0.7, confidence-weight (µ, σ) = ...