pith. sign in

arxiv: 2605.18654 · v1 · pith:6WEC2Z2Xnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords distillationtabular foundation modelsXGBoostgradient boosted treesin-context learningout-of-fold labelingCPU inferencetabular classification
0
0 comments X

The pith

Distilling tabular foundation models into gradient-boosted trees via out-of-fold soft labels delivers near-teacher performance at CPU speeds under 2 ms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tabular foundation models deliver strong classification performance but run too slowly for real-time CPU applications such as fraud scoring. The work demonstrates that distilling these models offline into XGBoost or CatBoost using stratified out-of-fold teacher labeling avoids the label leakage that collapses soft targets in in-context learning teachers. This produces student models with 0.882 macro-mean AUC, 96.5 percent of the teacher, at 1.9 milliseconds inference time on CPU. Across 153 datasets the approach yields statistically significant gains over tuned CatBoost while providing speedups from 38x to 860x. Teacher ranks transfer to students and gains are larger on low-dimensional data.

Core claim

By using stratified out-of-fold labeling to create non-leaking soft targets from in-context learning tabular foundation models, distillation into XGBoost produces CPU-native students that retain 96.5 percent of teacher macro-mean AUC at 1.9 ms latency across 153 datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, while showing a Wilcoxon-significant edge over CatBoost.

What carries the argument

Stratified out-of-fold (OOF) teacher labeling to generate soft targets that retain inter-class structure without label leakage from in-context learning models.

If this is right

  • Teacher rank transfers exactly to student rank across datasets.
  • Gains over CatBoost concentrate on low-dimensional data with fewer than 21 features.
  • Multi-teacher averaging improves MLP students but adds negligible value for tree students.
  • Distillation degrades performance on high-dimensional tasks where the teacher already trails CatBoost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method could extend tabular foundation model use to GPU-free environments like mobile or embedded systems.
  • The OOF soft label approach might improve distillation in other domains where label leakage occurs in self-supervised or in-context setups.
  • Testing on regression tasks or with different student architectures could reveal broader applicability of the leakage-prevention technique.

Load-bearing premise

Stratified out-of-fold teacher labeling prevents label leakage in ICL models and preserves sufficient inter-class structure in the soft targets for effective distillation to tree students.

What would settle it

Running the distillation pipeline with standard in-fold instead of out-of-fold teacher labeling and observing near-one-hot soft targets that lead to students with AUC far below 96 percent of the teacher.

read the original abstract

A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data (< 21 features: +0.011 over CatBoost vs. >21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that distilling tabular foundation models (e.g., TabICLv2) into gradient-boosted trees (XGBoost or CatBoost) via stratified out-of-fold soft targets enables near-teacher performance (0.882 macro-mean AUC, 96.5% recovery) at CPU speeds of ~1.9 ms, yielding 38x–860x speedups over the teachers and a statistically significant edge over tuned CatBoost (Wilcoxon p=0.0008, 51% win rate) across 153 datasets from TALENT, OpenML-CC18, TabZilla, and TabArena. Additional claims include exact transfer of teacher rank to student rank, larger gains on low-dimensional data, limited benefit from multi-teacher averaging for trees, and degradation when the teacher itself underperforms CatBoost.

Significance. If the central empirical claims hold, the work offers a practical bridge for latency-sensitive tabular applications by converting slow but accurate TFMs into natively fast CPU models without sacrificing most performance. Strengths include the large-scale evaluation on 153 datasets with Wilcoxon tests and win-rate reporting, the open-sourced TabTune pipeline for reproducibility, and the identification of dimension-dependent effects and rank transfer. These elements make the contribution verifiable and potentially impactful for deployment scenarios where GPU inference is unavailable.

major comments (1)
  1. [Abstract and distillation procedure section] Abstract and distillation procedure section: The headline result (96.5% recovery of teacher AUC) depends on the stratified OOF procedure producing soft targets with retained inter-class probability mass rather than near-one-hot vectors. No diagnostics are reported (e.g., per-dataset or aggregate distributions of soft-label entropy or maximum class probability) to confirm that sufficient structure is preserved for the observed gains to be attributable to distillation rather than simply training a strong tree model on the underlying data.
minor comments (2)
  1. [Results tables] Results tables: The win-rate and p-value reporting is helpful, but adding per-comparison effect sizes or bootstrap confidence intervals on the AUC differences would make the statistical claims more interpretable.
  2. [Reproducibility] Reproducibility: While the pipeline is open-sourced, the manuscript should explicitly list the exact hyperparameter search spaces and random seeds used for all baselines and students to facilitate exact replication of the reported numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the distillation procedure. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and distillation procedure section] Abstract and distillation procedure section: The headline result (96.5% recovery of teacher AUC) depends on the stratified OOF procedure producing soft targets with retained inter-class probability mass rather than near-one-hot vectors. No diagnostics are reported (e.g., per-dataset or aggregate distributions of soft-label entropy or maximum class probability) to confirm that sufficient structure is preserved for the observed gains to be attributable to distillation rather than simply training a strong tree model on the underlying data.

    Authors: We agree that explicit diagnostics on the soft targets would strengthen the attribution of gains to distillation. The stratified OOF procedure is specifically introduced to prevent the label leakage that occurs with in-context learning teachers on their own training data, which would otherwise collapse the soft targets to near-one-hot vectors. In the revised manuscript we will add aggregate statistics (mean, standard deviation) and distributional plots of soft-label entropy and maximum class probability across the 153 datasets. These will be placed in the distillation procedure section or an appendix. The statistically significant improvement over the tuned CatBoost baseline (Wilcoxon p=0.0008), which is trained on hard labels from the same data, already provides supporting evidence that the soft targets contribute beyond standard supervised tree training. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on held-out data

full rationale

The manuscript reports performance metrics (AUC, latency, win rates) obtained by running distillation experiments on 153 fixed external datasets and comparing against tuned baselines and the teacher model. These quantities are measured outcomes, not quantities that the paper's own equations or self-citations force to equal the inputs by construction. The stratified OOF labeling step is a data-preparation choice whose effect is then evaluated on held-out folds; it does not create a self-definitional loop. No uniqueness theorem, ansatz, or fitted parameter is renamed as a prediction. The central claims therefore remain independent of the paper's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that OOF predictions supply unbiased soft targets and that knowledge transfer via distillation works for tree students; no free parameters or invented entities are introduced beyond standard ML practice.

axioms (1)
  • domain assumption Stratified out-of-fold predictions from an ICL teacher supply soft targets free of label leakage while retaining inter-class structure.
    Invoked to justify the teacher labeling step that enables distillation.

pith-pipeline@v0.9.0 · 5845 in / 1233 out tokens · 61147 ms · 2026-05-20T11:46:17.239332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

  1. [1]

    TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139,

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026

  2. [2]

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. Tabpfn-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025

  3. [3]

    Limix: Unleashing structured- data modeling capability for generalist intelligence

    Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

  4. [4]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015

  5. [5]

    Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

    Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

  6. [6]

    TabPFN: A transformer that solves small tabular classification problems in a second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations, 2023

  7. [7]

    Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025. 7 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

  8. [8]

    TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024

  9. [9]

    Orion-Bix: Bi-axial attention for tabular in-context learning

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026

  10. [10]

    Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

  11. [11]

    Mantovani, Jan N

    Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems, volume 34, 2021

  12. [12]

    When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

    Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

  13. [13]

    Why tree-based models still outperform deep learning on tabular data

    Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why tree-based models still outperform deep learning on tabular data. InAdvances in Neural Information Processing Systems, volume 35, 2022

  14. [14]

    Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 38, 2026

    Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 38, 2026

  15. [15]

    Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar

    Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational Conference on Machine Learning, pages 1607–1616, 2018

  16. [16]

    Do deep nets really need to be deep? InAdvances in Neural Information Processing Systems, volume 27, 2014

    Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InAdvances in Neural Information Processing Systems, volume 27, 2014

  17. [17]

    Polley and Mark J

    Eric C. Polley and Mark J. van der Laan. Super learner in prediction. Technical Report 266, U.C. Berkeley Division of Biostatistics Working Paper Series, 2010

  18. [18]

    Arik and Tomas Pfister

    Sercan Ö. Arik and Tomas Pfister. TabNet: Attentive interpretable tabular learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679–6687, 2021

  19. [19]

    Practical lessons from predicting clicks on ads at Facebook

    Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at Facebook. InProceedings of the 8th International Workshop on Data Mining for Online Advertising, pages 1–9, 2014

  20. [20]

    Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

    John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61–74, 1999

  21. [21]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

  22. [22]

    Predicting good probabilities with supervised learning

    Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, 2005

  23. [23]

    Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

    Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

  24. [24]

    XGBoost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

  25. [25]

    CatBoost: Unbiased boosting with categorical features

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018

  26. [26]

    LightGBM: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

  27. [27]

    TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

    Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025. 8 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees A Implementation Details K=5 stratified folds. Temperature range [1...