Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

Aditya Tanna; Mohamed Bouadi; Nassim Bouarour; Pratinav Seth; Vinay Kumar Sankarapu

arxiv: 2605.18654 · v1 · pith:6WEC2Z2Xnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

Aditya Tanna , Nassim Bouarour , Mohamed Bouadi , Vinay kumar Sankarapu , Pratinav Seth This is my paper

Pith reviewed 2026-05-20 11:46 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords distillationtabular foundation modelsXGBoostgradient boosted treesin-context learningout-of-fold labelingCPU inferencetabular classification

0 comments

The pith

Distilling tabular foundation models into gradient-boosted trees via out-of-fold soft labels delivers near-teacher performance at CPU speeds under 2 ms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tabular foundation models deliver strong classification performance but run too slowly for real-time CPU applications such as fraud scoring. The work demonstrates that distilling these models offline into XGBoost or CatBoost using stratified out-of-fold teacher labeling avoids the label leakage that collapses soft targets in in-context learning teachers. This produces student models with 0.882 macro-mean AUC, 96.5 percent of the teacher, at 1.9 milliseconds inference time on CPU. Across 153 datasets the approach yields statistically significant gains over tuned CatBoost while providing speedups from 38x to 860x. Teacher ranks transfer to students and gains are larger on low-dimensional data.

Core claim

By using stratified out-of-fold labeling to create non-leaking soft targets from in-context learning tabular foundation models, distillation into XGBoost produces CPU-native students that retain 96.5 percent of teacher macro-mean AUC at 1.9 ms latency across 153 datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, while showing a Wilcoxon-significant edge over CatBoost.

What carries the argument

Stratified out-of-fold (OOF) teacher labeling to generate soft targets that retain inter-class structure without label leakage from in-context learning models.

If this is right

Teacher rank transfers exactly to student rank across datasets.
Gains over CatBoost concentrate on low-dimensional data with fewer than 21 features.
Multi-teacher averaging improves MLP students but adds negligible value for tree students.
Distillation degrades performance on high-dimensional tasks where the teacher already trails CatBoost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method could extend tabular foundation model use to GPU-free environments like mobile or embedded systems.
The OOF soft label approach might improve distillation in other domains where label leakage occurs in self-supervised or in-context setups.
Testing on regression tasks or with different student architectures could reveal broader applicability of the leakage-prevention technique.

Load-bearing premise

Stratified out-of-fold teacher labeling prevents label leakage in ICL models and preserves sufficient inter-class structure in the soft targets for effective distillation to tree students.

What would settle it

Running the distillation pipeline with standard in-fold instead of out-of-fold teacher labeling and observing near-one-hot soft targets that lead to students with AUC far below 96 percent of the teacher.

read the original abstract

A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data (< 21 features: +0.011 over CatBoost vs. >21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows how stratified OOF labeling lets you distill TabICLv2 into fast XGBoost students that recover 96.5% of teacher AUC at 1.9 ms CPU latency across 153 datasets.

read the letter

The headline result is that distilling TabICLv2 into an XGBoost student via stratified out-of-fold soft labels recovers 96.5% of the teacher's macro AUC while running at 1.9 ms on CPU, with speedups of 38x to 860x and a statistically significant edge over tuned CatBoost. They also note that teacher rank transfers to the student and that the gains show up mostly on low-dimensional data. Multi-teacher averaging helps MLP students but adds almost nothing for trees, and distillation can hurt when the teacher itself is already behind CatBoost on high-dimensional tasks. The full pipeline is open-sourced in TabTune. The paper does a good job on the evaluation side. They pull 153 datasets from TALENT, OpenML-CC18, TabZilla, and TabArena, run Wilcoxon tests, report win rates, and open-source the TabTune pipeline. That reduces the usual worries about cherry-picking or irreproducibility. The observation that teacher rank carries over to the student and that gains are larger on low-dimensional data adds some insight beyond the main numbers. The multi-teacher averaging result for MLP students but not trees is also a useful detail. The soft spot is around the quality of the soft targets themselves. The stress-test concern is fair: if the OOF procedure still produces overconfident, near-hard labels, then the distillation benefit might be overstated and any strong tree model could achieve similar results. The manuscript does not report diagnostics like average entropy or max probability of the soft labels across the datasets. That leaves a small gap in confirming that the inter-class structure is actually being transferred. Still, the consistent outperformance of the distilled model over the baseline makes it unlikely that nothing is happening. This work is aimed at practitioners who need foundation-model level accuracy in sub-2ms CPU environments, such as real-time fraud scoring or other tabular tasks with strict latency constraints. Readers focused on deployment tradeoffs and distillation techniques will get the most out of it. The scale of the experiments and the open code make it worth a serious referee's time, even if some additional checks on the label distributions would help. I would recommend putting it through peer review rather than desk rejecting it.

Referee Report

1 major / 2 minor

Summary. The manuscript claims that distilling tabular foundation models (e.g., TabICLv2) into gradient-boosted trees (XGBoost or CatBoost) via stratified out-of-fold soft targets enables near-teacher performance (0.882 macro-mean AUC, 96.5% recovery) at CPU speeds of ~1.9 ms, yielding 38x–860x speedups over the teachers and a statistically significant edge over tuned CatBoost (Wilcoxon p=0.0008, 51% win rate) across 153 datasets from TALENT, OpenML-CC18, TabZilla, and TabArena. Additional claims include exact transfer of teacher rank to student rank, larger gains on low-dimensional data, limited benefit from multi-teacher averaging for trees, and degradation when the teacher itself underperforms CatBoost.

Significance. If the central empirical claims hold, the work offers a practical bridge for latency-sensitive tabular applications by converting slow but accurate TFMs into natively fast CPU models without sacrificing most performance. Strengths include the large-scale evaluation on 153 datasets with Wilcoxon tests and win-rate reporting, the open-sourced TabTune pipeline for reproducibility, and the identification of dimension-dependent effects and rank transfer. These elements make the contribution verifiable and potentially impactful for deployment scenarios where GPU inference is unavailable.

major comments (1)

[Abstract and distillation procedure section] Abstract and distillation procedure section: The headline result (96.5% recovery of teacher AUC) depends on the stratified OOF procedure producing soft targets with retained inter-class probability mass rather than near-one-hot vectors. No diagnostics are reported (e.g., per-dataset or aggregate distributions of soft-label entropy or maximum class probability) to confirm that sufficient structure is preserved for the observed gains to be attributable to distillation rather than simply training a strong tree model on the underlying data.

minor comments (2)

[Results tables] Results tables: The win-rate and p-value reporting is helpful, but adding per-comparison effect sizes or bootstrap confidence intervals on the AUC differences would make the statistical claims more interpretable.
[Reproducibility] Reproducibility: While the pipeline is open-sourced, the manuscript should explicitly list the exact hyperparameter search spaces and random seeds used for all baselines and students to facilitate exact replication of the reported numbers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the distillation procedure. We address the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and distillation procedure section] Abstract and distillation procedure section: The headline result (96.5% recovery of teacher AUC) depends on the stratified OOF procedure producing soft targets with retained inter-class probability mass rather than near-one-hot vectors. No diagnostics are reported (e.g., per-dataset or aggregate distributions of soft-label entropy or maximum class probability) to confirm that sufficient structure is preserved for the observed gains to be attributable to distillation rather than simply training a strong tree model on the underlying data.

Authors: We agree that explicit diagnostics on the soft targets would strengthen the attribution of gains to distillation. The stratified OOF procedure is specifically introduced to prevent the label leakage that occurs with in-context learning teachers on their own training data, which would otherwise collapse the soft targets to near-one-hot vectors. In the revised manuscript we will add aggregate statistics (mean, standard deviation) and distributional plots of soft-label entropy and maximum class probability across the 153 datasets. These will be placed in the distillation procedure section or an appendix. The statistically significant improvement over the tuned CatBoost baseline (Wilcoxon p=0.0008), which is trained on hard labels from the same data, already provides supporting evidence that the soft targets contribute beyond standard supervised tree training. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on held-out data

full rationale

The manuscript reports performance metrics (AUC, latency, win rates) obtained by running distillation experiments on 153 fixed external datasets and comparing against tuned baselines and the teacher model. These quantities are measured outcomes, not quantities that the paper's own equations or self-citations force to equal the inputs by construction. The stratified OOF labeling step is a data-preparation choice whose effect is then evaluated on held-out folds; it does not create a self-definitional loop. No uniqueness theorem, ansatz, or fitted parameter is renamed as a prediction. The central claims therefore remain independent of the paper's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that OOF predictions supply unbiased soft targets and that knowledge transfer via distillation works for tree students; no free parameters or invented entities are introduced beyond standard ML practice.

axioms (1)

domain assumption Stratified out-of-fold predictions from an ICL teacher supply soft targets free of label leakage while retaining inter-class structure.
Invoked to justify the teacher labeling step that enables distillation.

pith-pipeline@v0.9.0 · 5845 in / 1233 out tokens · 61147 ms · 2026-05-20T11:46:17.239332+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Stratified out-of-fold (OOF) teacher labeling prevents this. ... Hinton mixed loss ... KL term collapses to one-hot cross-entropy
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC ... 38x to 860x speedup

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139,

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026

work page arXiv 2026
[2]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. Tabpfn-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025
[4]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015

work page 2015
[5]

Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

work page arXiv 2024
[6]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations, 2023

work page 2023
[7]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025. 7 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

work page 2025
[8]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024
[9]

Orion-Bix: Bi-axial attention for tabular in-context learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026

work page 2026
[10]

Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

work page arXiv 2025
[11]

Mantovani, Jan N

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021
[12]

When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

work page 2023
[13]

Why tree-based models still outperform deep learning on tabular data

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why tree-based models still outperform deep learning on tabular data. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[14]

Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 38, 2026

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 38, 2026

work page 2026
[15]

Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar

Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational Conference on Machine Learning, pages 1607–1616, 2018

work page 2018
[16]

Do deep nets really need to be deep? InAdvances in Neural Information Processing Systems, volume 27, 2014

Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InAdvances in Neural Information Processing Systems, volume 27, 2014

work page 2014
[17]

Polley and Mark J

Eric C. Polley and Mark J. van der Laan. Super learner in prediction. Technical Report 266, U.C. Berkeley Division of Biostatistics Working Paper Series, 2010

work page 2010
[18]

Arik and Tomas Pfister

Sercan Ö. Arik and Tomas Pfister. TabNet: Attentive interpretable tabular learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679–6687, 2021

work page 2021
[19]

Practical lessons from predicting clicks on ads at Facebook

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at Facebook. InProceedings of the 8th International Workshop on Data Mining for Online Advertising, pages 1–9, 2014

work page 2014
[20]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61–74, 1999

work page 1999
[21]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

work page 2017
[22]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, 2005

work page 2005
[23]

Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

work page arXiv 2025
[24]

XGBoost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

work page 2016
[25]

CatBoost: Unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018
[26]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017
[27]

TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025. 8 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees A Implementation Details K=5 stratified folds. Temperature range [1...

work page arXiv 2025

[1] [1]

TabICLv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139,

Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. Tabiclv2: A better, faster, scalable, and open tabular foundation model.arXiv preprint arXiv:2602.11139, 2026

work page arXiv 2026

[2] [2]

TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. Tabpfn-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Limix: Unleashing structured- data modeling capability for generalist intelligence

Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Jiayun Wu, Lang Mo, et al. LimiX: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505, 2025

work page arXiv 2025

[4] [4]

Distilling the knowledge in a neural network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. InNIPS Deep Learning Workshop, 2015

work page 2015

[5] [5]

Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

Jonibek Mansurov, Akhmed Sakip, and Alham Fikri Aji. Data laundering: Artificially boosting benchmark results through knowledge distillation.arXiv preprint arXiv:2412.15255, 2024

work page arXiv 2024

[6] [6]

TabPFN: A transformer that solves small tabular classification problems in a second

Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations, 2023

work page 2023

[7] [7]

Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025

Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025. 7 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

work page 2025

[8] [8]

TabDPT: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164,

Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Hamidreza Kamkari, Jesse C Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L Caterini, and Maksims V olkovs. Tabdpt: Scaling tabular foundation models on real data.arXiv preprint arXiv:2410.18164, 2024

work page arXiv 2024

[9] [9]

Orion-Bix: Bi-axial attention for tabular in-context learning

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning. InProceedings of the ACM Web Conference 2026, pages 8673–8676, 2026

work page 2026

[10] [10]

Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

work page arXiv 2025

[11] [11]

Mantovani, Jan N

Bernd Bischl, Giuseppe Casalicchio, Matthias Feurer, Pieter Gijsbers, Frank Hutter, Michel Lang, Rafael G. Mantovani, Jan N. van Rijn, and Joaquin Vanschoren. OpenML benchmarking suites. InAdvances in Neural Information Processing Systems, volume 34, 2021

work page 2021

[12] [12]

When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

Duncan McElfresh, Sujay Khandagale, Jonathan Valverde, Vishak Prasad C, Ganesh Ramakrishnan, Micah Goldblum, and Colin White. When do neural nets outperform boosted trees on tabular data?Advances in Neural Information Processing Systems, 36:76336–76369, 2023

work page 2023

[13] [13]

Why tree-based models still outperform deep learning on tabular data

Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why tree-based models still outperform deep learning on tabular data. InAdvances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[14] [14]

Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 38, 2026

Nick Erickson, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Desai, David Salinas, and Frank Hutter. Tabarena: A living benchmark for machine learning on tabular data.Advances in Neural Information Processing Systems, 38, 2026

work page 2026

[15] [15]

Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar

Tommaso Furlanello, Zachary C. Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational Conference on Machine Learning, pages 1607–1616, 2018

work page 2018

[16] [16]

Do deep nets really need to be deep? InAdvances in Neural Information Processing Systems, volume 27, 2014

Jimmy Ba and Rich Caruana. Do deep nets really need to be deep? InAdvances in Neural Information Processing Systems, volume 27, 2014

work page 2014

[17] [17]

Polley and Mark J

Eric C. Polley and Mark J. van der Laan. Super learner in prediction. Technical Report 266, U.C. Berkeley Division of Biostatistics Working Paper Series, 2010

work page 2010

[18] [18]

Arik and Tomas Pfister

Sercan Ö. Arik and Tomas Pfister. TabNet: Attentive interpretable tabular learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6679–6687, 2021

work page 2021

[19] [19]

Practical lessons from predicting clicks on ads at Facebook

Xinran He, Junfeng Pan, Ou Jin, Tianbing Xu, Bo Liu, Tao Xu, Yanxin Shi, Antoine Atallah, Ralf Herbrich, Stuart Bowers, et al. Practical lessons from predicting clicks on ads at Facebook. InProceedings of the 8th International Workshop on Data Mining for Online Advertising, pages 1–9, 2014

work page 2014

[20] [20]

Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods

John Platt. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3):61–74, 1999

work page 1999

[21] [21]

Weinberger

Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. On calibration of modern neural networks. In Proceedings of the 34th International Conference on Machine Learning, pages 1321–1330, 2017

work page 2017

[22] [22]

Predicting good probabilities with supervised learning

Alexandru Niculescu-Mizil and Rich Caruana. Predicting good probabilities with supervised learning. In Proceedings of the 22nd International Conference on Machine Learning, pages 625–632, 2005

work page 2005

[23] [23]

Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

Freya Behrens and Lenka Zdeborová. Dataset distillation for memorized data: Soft labels can leak held-out teacher knowledge.arXiv preprint arXiv:2506.14457, 2025

work page arXiv 2025

[24] [24]

XGBoost: A scalable tree boosting system

Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

work page 2016

[25] [25]

CatBoost: Unbiased boosting with categorical features

Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features. InAdvances in Neural Information Processing Systems, volume 31, 2018

work page 2018

[26] [26]

LightGBM: A highly efficient gradient boosting decision tree

Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

work page 2017

[27] [27]

TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025. 8 Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees A Implementation Details K=5 stratified folds. Temperature range [1...

work page arXiv 2025