Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees
Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3
The pith
Distilling tabular foundation models into gradient-boosted trees via out-of-fold soft labels delivers near-teacher performance at CPU speeds under 2 ms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using stratified out-of-fold labeling to create non-leaking soft targets from in-context learning tabular foundation models, distillation into XGBoost produces CPU-native students that retain 96.5 percent of teacher macro-mean AUC at 1.9 ms latency across 153 datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, while showing a Wilcoxon-significant edge over CatBoost.
What carries the argument
Stratified out-of-fold (OOF) teacher labeling to generate soft targets that retain inter-class structure without label leakage from in-context learning models.
If this is right
- Teacher rank transfers exactly to student rank across datasets.
- Gains over CatBoost concentrate on low-dimensional data with fewer than 21 features.
- Multi-teacher averaging improves MLP students but adds negligible value for tree students.
- Distillation degrades performance on high-dimensional tasks where the teacher already trails CatBoost.
Where Pith is reading between the lines
- This method could extend tabular foundation model use to GPU-free environments like mobile or embedded systems.
- The OOF soft label approach might improve distillation in other domains where label leakage occurs in self-supervised or in-context setups.
- Testing on regression tasks or with different student architectures could reveal broader applicability of the leakage-prevention technique.
Load-bearing premise
Stratified out-of-fold teacher labeling prevents label leakage in ICL models and preserves sufficient inter-class structure in the soft targets for effective distillation to tree students.
What would settle it
Running the distillation pipeline with standard in-fold instead of out-of-fold teacher labeling and observing near-one-hot soft targets that lead to students with AUC far below 96 percent of the teacher.
read the original abstract
A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data (< 21 features: +0.011 over CatBoost vs. >21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that distilling tabular foundation models (e.g., TabICLv2) into gradient-boosted trees (XGBoost or CatBoost) via stratified out-of-fold soft targets enables near-teacher performance (0.882 macro-mean AUC, 96.5% recovery) at CPU speeds of ~1.9 ms, yielding 38x–860x speedups over the teachers and a statistically significant edge over tuned CatBoost (Wilcoxon p=0.0008, 51% win rate) across 153 datasets from TALENT, OpenML-CC18, TabZilla, and TabArena. Additional claims include exact transfer of teacher rank to student rank, larger gains on low-dimensional data, limited benefit from multi-teacher averaging for trees, and degradation when the teacher itself underperforms CatBoost.
Significance. If the central empirical claims hold, the work offers a practical bridge for latency-sensitive tabular applications by converting slow but accurate TFMs into natively fast CPU models without sacrificing most performance. Strengths include the large-scale evaluation on 153 datasets with Wilcoxon tests and win-rate reporting, the open-sourced TabTune pipeline for reproducibility, and the identification of dimension-dependent effects and rank transfer. These elements make the contribution verifiable and potentially impactful for deployment scenarios where GPU inference is unavailable.
major comments (1)
- [Abstract and distillation procedure section] Abstract and distillation procedure section: The headline result (96.5% recovery of teacher AUC) depends on the stratified OOF procedure producing soft targets with retained inter-class probability mass rather than near-one-hot vectors. No diagnostics are reported (e.g., per-dataset or aggregate distributions of soft-label entropy or maximum class probability) to confirm that sufficient structure is preserved for the observed gains to be attributable to distillation rather than simply training a strong tree model on the underlying data.
minor comments (2)
- [Results tables] Results tables: The win-rate and p-value reporting is helpful, but adding per-comparison effect sizes or bootstrap confidence intervals on the AUC differences would make the statistical claims more interpretable.
- [Reproducibility] Reproducibility: While the pipeline is open-sourced, the manuscript should explicitly list the exact hyperparameter search spaces and random seeds used for all baselines and students to facilitate exact replication of the reported numbers.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the distillation procedure. We address the major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and distillation procedure section] Abstract and distillation procedure section: The headline result (96.5% recovery of teacher AUC) depends on the stratified OOF procedure producing soft targets with retained inter-class probability mass rather than near-one-hot vectors. No diagnostics are reported (e.g., per-dataset or aggregate distributions of soft-label entropy or maximum class probability) to confirm that sufficient structure is preserved for the observed gains to be attributable to distillation rather than simply training a strong tree model on the underlying data.
Authors: We agree that explicit diagnostics on the soft targets would strengthen the attribution of gains to distillation. The stratified OOF procedure is specifically introduced to prevent the label leakage that occurs with in-context learning teachers on their own training data, which would otherwise collapse the soft targets to near-one-hot vectors. In the revised manuscript we will add aggregate statistics (mean, standard deviation) and distributional plots of soft-label entropy and maximum class probability across the 153 datasets. These will be placed in the distillation procedure section or an appendix. The statistically significant improvement over the tuned CatBoost baseline (Wilcoxon p=0.0008), which is trained on hard labels from the same data, already provides supporting evidence that the soft targets contribute beyond standard supervised tree training. revision: yes
Circularity Check
No circularity: results are direct empirical measurements on held-out data
full rationale
The manuscript reports performance metrics (AUC, latency, win rates) obtained by running distillation experiments on 153 fixed external datasets and comparing against tuned baselines and the teacher model. These quantities are measured outcomes, not quantities that the paper's own equations or self-citations force to equal the inputs by construction. The stratified OOF labeling step is a data-preparation choice whose effect is then evaluated on held-out folds; it does not create a self-definitional loop. No uniqueness theorem, ansatz, or fitted parameter is renamed as a prediction. The central claims therefore remain independent of the paper's internal definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Stratified out-of-fold predictions from an ICL teacher supply soft targets free of label leakage while retaining inter-class structure.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Stratified out-of-fold (OOF) teacher labeling prevents this. ... Hinton mixed loss ... KL term collapses to one-hot cross-entropy
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC ... 38x to 860x speedup
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026
-
[2]
TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models
Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. Tabpfn-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Limix: Unleashing structured- data modeling capability for generalist intelligence
Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025
-
[4]
Distilling the knowledge in a neural network
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015
work page 2015
-
[5]
Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024
-
[6]
TabPFN: A transformer that solves small tabular classification problems in a second
Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations, 2023
work page 2023
-
[7]
Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025
Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025. 7 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees
work page 2025
-
[8]
TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,
Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024
-
[9]
Orion-Bix: Bi-axial attention for tabular in-context learning
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026
work page 2026
-
[10]
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025
-
[11]
Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems, volume 34, 2021
work page 2021
-
[12]
Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023
work page 2023
-
[13]
Why tree-based models still outperform deep learning on tabular data
Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why tree-based models still outperform deep learning on tabular data. InAdvances in Neural Information Processing Systems, volume 35, 2022
work page 2022
-
[14]
Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 38, 2026
work page 2026
-
[15]
Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar
Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational Conference on Machine Learning, pages 1607–1616, 2018
work page 2018
-
[16]
Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InAdvances in Neural Information Processing Systems, volume 27, 2014
work page 2014
-
[17]
Eric C. Polley and Mark J. van der Laan. Super learner in prediction. Technical Report 266, U.C. Berkeley Division of Biostatistics Working Paper Series, 2010
work page 2010
-
[18]
Sercan Ö. Arik and Tomas Pfister. TabNet: Attentive interpretable tabular learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679–6687, 2021
work page 2021
-
[19]
Practical lessons from predicting clicks on ads at Facebook
Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at Facebook. InProceedings of the 8th International Workshop on Data Mining for Online Advertising, pages 1–9, 2014
work page 2014
-
[20]
Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods
John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61–74, 1999
work page 1999
-
[21]
Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017
work page 2017
-
[22]
Predicting good probabilities with supervised learning
Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, 2005
work page 2005
-
[23]
Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025
-
[24]
XGBoost: A scalable tree boosting system
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016
work page 2016
-
[25]
CatBoost: Unbiased boosting with categorical features
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[26]
LightGBM: A highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[27]
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025. 8 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees A Implementation Details K=5 stratified folds. Temperature range [1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.