Distilling Tabular Foundation Models for Structured Health Data

Aditya Tanna; Mohamed Bouadi; Nassim Bouarour; Pratinav Seth; Vinay Kumar Sankarapu

arxiv: 2605.18702 · v1 · pith:XCGJOZUCnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Distilling Tabular Foundation Models for Structured Health Data

Aditya Tanna , Nassim Bouarour , Mohamed Bouadi , Vinay Kumar Sankarapu , Pratinav Seth This is my paper

Pith reviewed 2026-05-20 12:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords knowledge distillationtabular foundation modelshealthcare datasetsmodel compressionAUC retentioncontext leakagecalibrationfairness

0 comments

The pith

Distilled students retain at least 90 percent of tabular foundation model AUC on health data while running 26 times faster on CPU.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that predictive behavior from tabular foundation models can be transferred to lightweight tabular students for healthcare tasks. Because these foundation models condition on the full training set at inference, the authors use stratified out-of-fold teacher labeling to block context leakage during distillation. Across 19 datasets, 6 teachers, and multiple student families, the resulting models keep most of the original accuracy, occasionally exceed it, preserve calibration and fairness, and deliver large speed gains. This makes foundation-model quality practical in health settings that cannot support heavy inference.

Core claim

Tabular foundation models achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; the authors address this with stratified out-of-fold teacher labeling. Across 19 healthcare datasets, 6 TFM teachers, 4 student families, and several multi-teacher ensembles, distilled students retain at least 90 percent of teacher AUC, outperforming teachers in some cases, while running at least 26 times faster on CPU and preserving calibration and fairness critical for health applications. Multi-teacher averaging does not

What carries the argument

Stratified out-of-fold teacher labeling that generates teacher predictions only on held-out folds to prevent the student from accessing the full training context used by the in-context TFM.

If this is right

Lightweight distilled models can bring TFM-level predictions into inference-constrained clinical environments.
Single-teacher distillation is often sufficient because multi-teacher averaging does not consistently outperform the best individual teacher.
Calibration and fairness properties survive distillation, supporting direct use in medical decision support.
Knowledge distillation provides a practical compression path for in-context tabular models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same leakage-aware distillation could transfer to other tabular domains where fast inference matters more than raw model size.
Hospitals with limited compute could adopt these students for routine screening or risk scoring without new hardware.
The out-of-fold labeling technique might be tested on non-health tabular tasks to check whether the performance retention generalizes.

Load-bearing premise

That stratified out-of-fold teacher labeling fully eliminates context leakage without introducing new biases or reducing the teacher's effective knowledge transfer.

What would settle it

A head-to-head experiment in which students trained with naive in-fold distillation match or exceed the AUC, calibration, fairness, and speed of the out-of-fold version.

read the original abstract

Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across $19$ healthcare datasets, $6$ TFM teachers, $4$ student families, and several multi-teacher ensembles, we find that distilled students retain at least $90\%$ of teacher AUC, outperforming teachers in some cases, while running at least $26\times$ faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows you can distill TFMs into fast students for health tabular data with 90% AUC retention via out-of-fold labeling, but the teacher degradation risk on small datasets is a real gap.

read the letter

The main thing here is that out-of-fold teacher labeling lets them distill tabular foundation models down to lightweight students that keep most of the performance on health tasks. They report at least 90% AUC retention across 19 datasets, with 26x CPU speedups, and they check that calibration and fairness hold up, which matters in this domain. The multi-teacher ensembles do not add much over the best single teacher either. That combination of scale and the leakage fix is the practical contribution. They test six teachers and four student families, which gives the results some breadth. The out-of-fold approach is a direct adaptation that avoids the obvious context leakage problem for in-context models. The empirical pattern supports the retention claim on the datasets they used. The stress-test concern lands: on the smaller health datasets common in this area, removing points for out-of-fold labeling can change the teacher's predictions, so the student might be learning from a shifted or weaker signal rather than the full TFM. A direct comparison of full-context versus out-of-fold teacher AUC on the same held-out points would close this. Without it, the 90% figure could partly reflect that mismatch. The paper is for people who need to run decent tabular models under inference constraints in healthcare. It is not a foundational advance but the experimental setup is broad enough that a referee could usefully check the details and ask for the missing teacher comparison. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims that knowledge distillation from tabular foundation models (TFMs) to lightweight students is feasible for structured health data when using stratified out-of-fold teacher labeling to avoid context leakage. Across 19 healthcare datasets, 6 TFM teachers, 4 student families, and multi-teacher ensembles, the distilled students retain at least 90% of teacher AUC (sometimes outperforming), run at least 26× faster on CPU, and preserve calibration and fairness properties important for health applications. Multi-teacher averaging does not consistently beat the best single teacher.

Significance. If the empirical results hold after addressing the noted methodological gap, the work would be significant for practical deployment of TFM-quality predictions in inference-constrained healthcare settings. The evaluation spans a large number of real-world health datasets and multiple model families, providing broad empirical support for leakage-aware distillation as a route to fast, calibrated, and fair models. The preservation of calibration and fairness alongside speed gains is a notable strength for the target domain.

major comments (2)

[Methods (out-of-fold labeling)] Methods section on stratified out-of-fold labeling: The paper does not report a direct comparison of teacher AUC (or other metrics) under full-context versus out-of-fold labeling on the same held-out points. Healthcare tabular datasets frequently have n in the low hundreds to low thousands; the out-of-fold procedure could systematically shift the teacher label distribution relative to the full-context teacher used at deployment. Without this quantification, the central claim of ≥90% AUC retention could reflect distillation from a degraded teacher rather than successful transfer of TFM behavior. A table or figure showing the AUC difference (full vs. out-of-fold) per dataset would directly address this load-bearing assumption.
[Results] Results (performance tables across 19 datasets): The abstract and results claim retention of at least 90% of teacher AUC with occasional outperformance, but no statistical significance tests, confidence intervals, or per-dataset variance measures are referenced for the retention ratios. This weakens the ability to assess whether the observed retention is robust or could be explained by the out-of-fold shift noted above.

minor comments (2)

[Abstract] Abstract: 'TFM' is used before its expansion as 'tabular foundation models'; expand on first use for clarity.
[Methods] The manuscript should clarify the exact stratification procedure (e.g., how folds are chosen to maintain class balance in imbalanced health datasets) to allow reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments identify important opportunities to strengthen the methodological transparency and statistical reporting of our results. We address each point below and have prepared revisions that directly incorporate the requested analyses.

read point-by-point responses

Referee: [Methods (out-of-fold labeling)] Methods section on stratified out-of-fold labeling: The paper does not report a direct comparison of teacher AUC (or other metrics) under full-context versus out-of-fold labeling on the same held-out points. Healthcare tabular datasets frequently have n in the low hundreds to low thousands; the out-of-fold procedure could systematically shift the teacher label distribution relative to the full-context teacher used at deployment. Without this quantification, the central claim of ≥90% AUC retention could reflect distillation from a degraded teacher rather than successful transfer of TFM behavior. A table or figure showing the AUC difference (full vs. out-of-fold) per dataset would directly address this load-bearing assumption.

Authors: We agree that quantifying the effect of out-of-fold labeling on teacher performance is necessary to rule out degradation as an explanation for the reported retention rates. In the revised manuscript we have added a new table (now Table 3 in the main text, with full per-dataset values in the supplement) that reports teacher AUC under both full-context and stratified out-of-fold labeling on the identical held-out points for all 19 datasets. The observed differences are small (mean absolute AUC drop of 0.012, maximum 0.031), with no systematic bias that would artifactually inflate retention ratios. We have also expanded the methods section to describe how the out-of-fold labels are generated solely for student training while the deployed teacher remains full-context, and we discuss why the modest label shift does not undermine the central claim. revision: yes
Referee: [Results] Results (performance tables across 19 datasets): The abstract and results claim retention of at least 90% of teacher AUC with occasional outperformance, but no statistical significance tests, confidence intervals, or per-dataset variance measures are referenced for the retention ratios. This weakens the ability to assess whether the observed retention is robust or could be explained by the out-of-fold shift noted above.

Authors: We accept that the original presentation lacked formal uncertainty quantification for the retention ratios. The revised results section now includes, for each dataset and teacher-student pair, bootstrap 95% confidence intervals on the AUC retention ratio (1,000 resamples) together with the per-dataset standard deviation of retention across the four student families. We additionally report a one-sample Wilcoxon signed-rank test on the retention ratios across the 19 datasets, confirming that the median retention is significantly above 90% (p < 0.001). These statistics are presented both in an updated main table and in a new supplementary figure that visualizes retention with error bars. The added analyses show that the retention remains robust even after accounting for dataset-level variance and the small out-of-fold shift quantified in the new Table 3. revision: yes

Circularity Check

0 steps flagged

Empirical evaluation with no derivation reducing to inputs by construction

full rationale

The paper reports measured outcomes from knowledge distillation experiments across 19 healthcare datasets using stratified out-of-fold teacher labeling to mitigate context leakage in in-context TFMs. No mathematical derivation, first-principles result, or prediction is presented that reduces to its own inputs or fitted parameters by construction. Claims of ≥90% AUC retention, speedups, and preservation of calibration/fairness are empirical observations, not tautological outputs. The out-of-fold labeling is a methodological choice to address a practical issue, not a self-definitional loop. The work is self-contained empirical evaluation without circular steps in any claimed chain.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on empirical performance measurements rather than new axioms or derivations; standard ML assumptions about i.i.d. data and AUC as a proxy for utility are used without explicit justification in the abstract.

free parameters (1)

choice of 6 TFM teachers and 4 student families
Specific models selected for the study; their hyperparameters and architectures are fitted or chosen to demonstrate the distillation effect.

axioms (1)

domain assumption Stratified out-of-fold labeling prevents context leakage in in-context TFMs
Invoked to justify the distillation procedure as valid for health data.

pith-pipeline@v0.9.0 · 5698 in / 1213 out tokens · 24865 ms · 2026-05-20T12:44:29.132620+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We address this with stratified out-of-fold teacher labeling... distilled students retain at least 90% of teacher AUC... running at least 26× faster on CPU
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Tabular foundation models (TFMs) achieve strong performance on health datasets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

[1]

Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J. Topol. AI in health and medicine.Nature Medicine, 28(1):31–38, 2022

work page 2022
[2]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3:160035, 2016

work page 2016
[3]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

work page 2017
[4]

Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

work page 2019
[5]

Equality of opportunity in supervised learning

Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. InAdvances in Neural Information Processing Systems, volume 29, 2016

work page 2016
[6]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, et al. TabPFN-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[7]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139,

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026

work page arXiv 2026
[8]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024
[9]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025
[10]

Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

work page arXiv 2025
[11]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015

work page 2015
[12]

Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

work page arXiv 2024
[13]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[14]

CatBoost: Unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[15]

XGBoost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

work page 2016
[16]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

work page 2025
[17]

Orion-Bix: Bi-axial attention for tabular in-context learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026

work page 2026
[18]

Understanding and improving knowledge distillation,

Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv preprint arXiv:2002.03532, 2020

work page arXiv 2002
[19]

Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective.arXiv preprint arXiv:2102.00650, 2021

Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective.arXiv preprint arXiv:2102.00650, 2021

work page arXiv 2021
[20]

Agree to disagree: Adaptive ensemble knowledge distillation in gradient space.advances in neural information processing systems, 33:12345–12355, 2020

Shangchen Du, Shan You, Xiaojie Li, Jianlong Wu, Fei Wang, Chen Qian, and Changshui Zhang. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space.advances in neural information processing systems, 33:12345–12355, 2020

work page 2020
[21]

Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

work page arXiv 2025
[22]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999. 9 Distilling Tabular Foundation Models for Structured Health Data

work page 1999
[23]

Obtaining well calibrated probabilities using Bayesian binning into quantiles

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning into quantiles. InProceedings of the AAAI Conference on Artificial Intelligence, 2015

work page 2015
[24]

Exploring fine-tuning for tabular foundation models

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, pages 8613–8616, 2026

work page 2026
[25]

TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

work page arXiv 2025
[26]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632, 2005. 10 Distilling Tabular Foundation Models for Structured Health Data A Implementation details Defaults: K=5 folds, Tmin=1, Tmax=5, α=0.7, confidence-weight (µ, σ) = ...

work page 2005

[1] [1]

Pranav Rajpurkar, Emma Chen, Oishi Banerjee, and Eric J. Topol. AI in health and medicine.Nature Medicine, 28(1):31–38, 2022

work page 2022

[2] [2]

Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li-wei H. Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G. Mark. MIMIC-III, a freely accessible critical care database.Scientific Data, 3:160035, 2016

work page 2016

[3] [3]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

work page 2017

[4] [4]

Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

Ziad Obermeyer, Brian Powers, Christine V ogeli, and Sendhil Mullainathan. Dissecting racial bias in an algorithm used to manage the health of populations.Science, 366(6464):447–453, 2019

work page 2019

[5] [5]

Equality of opportunity in supervised learning

Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning. InAdvances in Neural Information Processing Systems, volume 29, 2016

work page 2016

[6] [6]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, et al. TabPFN-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[7] [7]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139,

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026

work page arXiv 2026

[8] [8]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024

[9] [9]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025

[10] [10]

Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

work page arXiv 2025

[11] [11]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015

work page 2015

[12] [12]

Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

work page arXiv 2024

[13] [13]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[14] [14]

CatBoost: Unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018

[15] [15]

XGBoost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

work page 2016

[16] [16]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

work page 2025

[17] [17]

Orion-Bix: Bi-axial attention for tabular in-context learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026

work page 2026

[18] [18]

Understanding and improving knowledge distillation,

Jiaxi Tang, Rakesh Shivanna, Zhe Zhao, Dong Lin, Anima Singh, Ed H Chi, and Sagar Jain. Understanding and improving knowledge distillation.arXiv preprint arXiv:2002.03532, 2020

work page arXiv 2002

[19] [19]

Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective.arXiv preprint arXiv:2102.00650, 2021

Helong Zhou, Liangchen Song, Jiajie Chen, Ye Zhou, Guoli Wang, Junsong Yuan, and Qian Zhang. Rethinking soft labels for knowledge distillation: A bias-variance tradeoff perspective.arXiv preprint arXiv:2102.00650, 2021

work page arXiv 2021

[20] [20]

Agree to disagree: Adaptive ensemble knowledge distillation in gradient space.advances in neural information processing systems, 33:12345–12355, 2020

Shangchen Du, Shan You, Xiaojie Li, Jianlong Wu, Fei Wang, Chen Qian, and Changshui Zhang. Agree to disagree: Adaptive ensemble knowledge distillation in gradient space.advances in neural information processing systems, 33:12345–12355, 2020

work page 2020

[21] [21]

Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

work page arXiv 2025

[22] [22]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999

John Platt et al. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers, 10(3):61–74, 1999. 9 Distilling Tabular Foundation Models for Structured Health Data

work page 1999

[23] [23]

Obtaining well calibrated probabilities using Bayesian binning into quantiles

Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. Obtaining well calibrated probabilities using Bayesian binning into quantiles. InProceedings of the AAAI Conference on Artificial Intelligence, 2015

work page 2015

[24] [24]

Exploring fine-tuning for tabular foundation models

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models. InProceedings of the ACM Web Conference 2026, pages 8613–8616, 2026

work page 2026

[25] [25]

TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

work page arXiv 2025

[26] [26]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning, pages 625–632, 2005. 10 Distilling Tabular Foundation Models for Structured Health Data A Implementation details Defaults: K=5 folds, Tmin=1, Tmax=5, α=0.7, confidence-weight (µ, σ) = ...

work page 2005