Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models
Pith reviewed 2026-05-20 12:01 UTC · model grok-4.3
The pith
For tabular foundation models in credit risk, the way context is resampled matters more than which model family is chosen.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add three to four AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of five thousand to ten thousand examples, the strongest TFMs reach the AUC of classical baselines trained on the full data while also recovering meaningful default-class recall that default-threshold GBDTs do not. The authors frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.
What carries the argument
The seven context-construction strategies (uniform, balanced, and hybrid sampling variants) that determine which examples enter the in-context window for tabular foundation models.
If this is right
- TFMs supplied with balanced contexts of 5K to 10K examples match the AUC of classical models trained on the entire dataset.
- These TFMs recover default-class recall that gradient-boosted trees with default thresholds miss.
- Context construction becomes the main controllable lever for deploying TFMs in imbalanced tabular credit tasks.
- Performance differences from sampling choices exceed differences across TFM families on both Home Credit and Lending Club data.
Where Pith is reading between the lines
- Teams deploying TFMs on imbalanced tabular problems may obtain larger gains by iterating on context resampling before testing additional model families.
- The same emphasis on presentation could apply to other high-stakes imbalanced domains such as fraud detection where in-context learning is feasible.
- Experiments that vary imbalance ratios or test context sizes beyond 50K would clarify whether the reported 5K-to-10K optimum generalizes.
Load-bearing premise
The seven tested context-construction strategies and sizes from 1K to 50K capture the dominant sensitivities of TFMs to input presentation without other untested factors such as feature encoding or latency constraints driving the observed differences.
What would settle it
A new TFM family that produces AUC-ROC gaps larger than the three-to-four-point spread from resampling strategies on the same two datasets and identical context setups would falsify the claim that context strategy dominates architecture choice.
read the original abstract
Credit default prediction is a tabular learning problem with severe class imbalance, heterogeneous features, and tight latency budgets. Tabular Foundation Models (TFMs) approach this problem through in-context learning, which makes their predictions sensitive to how the context window is built. We benchmark four classical models and five TFMs on the Home Credit and Lending Club datasets, varying the context-construction strategy (seven options) and the context size (1K to 50K). On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of 5K to 10K examples, the strongest TFMs reach the AUC of classical baselines trained on the full data, while also recovering meaningful default-class recall that default-threshold GBDTs do not. We frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript benchmarks four classical models against five tabular foundation models (TFMs) for credit default prediction on the Home Credit and Lending Club datasets. It systematically varies seven context-construction strategies (uniform, balanced, and hybrid sampling variants) and context sizes (1K–50K examples), reporting that context strategy accounts for more AUC-ROC variance than TFM family choice, with balanced/hybrid sampling improving performance by 3–4 AUC points over uniform sampling; at 5K–10K balanced context the best TFMs match full-data classical baselines while improving default-class recall.
Significance. If the central empirical claim is substantiated, the work would usefully redirect attention from TFM architecture search to context-construction practices for imbalanced tabular in-context learning. The large experimental grid on public datasets, direct comparison to classical baselines, and recovery of meaningful recall are concrete strengths that would support practical guidance for credit-risk deployment.
major comments (1)
- [Abstract and §4] Abstract and §4 (Results): The claim that context strategy explains more variance in AUC-ROC than TFM family is presented without an explicit variance-partitioning analysis (two-way ANOVA, mixed-effects model, or eta-squared decomposition that includes strategy × family interaction terms). The reported 3–4 AUC gap is therefore an informal range comparison whose robustness across the full design grid (datasets × strategies × sizes × models) cannot be verified from the given description.
minor comments (2)
- [§3] The exact definitions and implementation details of the seven context-construction strategies (including how balanced and hybrid sampling are operationalized and how prompt formatting is controlled) should be expanded in §3 to allow full reproduction.
- [Figures and Tables] Error bars, number of random seeds, and any statistical tests for the reported AUC differences are not described in the abstract or visible results summary; these should be added to all figures and tables.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of our experimental design, the practical relevance for credit-risk applications, and the clear identification of a point that can be strengthened. We agree that an explicit variance-partitioning analysis will make the central claim more rigorous and will add it in the revision.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Results): The claim that context strategy explains more variance in AUC-ROC than TFM family is presented without an explicit variance-partitioning analysis (two-way ANOVA, mixed-effects model, or eta-squared decomposition that includes strategy × family interaction terms). The reported 3–4 AUC gap is therefore an informal range comparison whose robustness across the full design grid (datasets × strategies × sizes × models) cannot be verified from the given description.
Authors: We acknowledge that the manuscript currently supports the claim through consistent numerical gaps observed across the full grid rather than through a formal decomposition. In the revised version we will add a two-way ANOVA (with eta-squared effect sizes) on AUC-ROC that includes context strategy, TFM family, their interaction, and dataset as factors. The analysis will be performed on the complete experimental results (two datasets × seven strategies × five context sizes × five TFMs). We will report the relative variance explained by strategy versus family and will retain the 3–4 AUC-point improvement as a descriptive summary alongside the statistical results. This addition directly addresses the concern while preserving the original empirical findings. revision: yes
Circularity Check
No significant circularity in empirical benchmarking study
full rationale
The paper reports direct empirical comparisons of seven context-construction strategies and five TFM families on fixed public datasets (Home Credit, Lending Club), measuring AUC-ROC as an external performance metric. The central claim that context strategy explains more variance than TFM family rests on observed deltas (3-4 AUC points) from the experimental grid rather than any self-definitional reduction, fitted parameter renamed as prediction, or self-citation load-bearing theorem. No equations, ansatzes, or uniqueness claims appear that reduce by construction to the paper's own inputs. The study is self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Home Credit and Lending Club datasets are representative of real-world credit risk prediction problems with class imbalance and heterogeneous features.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in Neural Information Processing Systems, 35:507–520, 2022
work page 2022
-
[2]
TabPFN: A transformer that solves small tabular classification problems in a second
Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations, 2023
work page 2023
-
[3]
TabICL: A Tabular Foundation Model for In-Context Learning on Large Data
Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning, 2025. arXiv:2502.05564
work page internal anchor Pith review arXiv 2025
-
[4]
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025
-
[5]
Orion-bix: Bi-axial attention for tabular in-context learning.CoRR, abs/2512.00181, 2025
Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning.arXiv preprint arXiv:2512.00181, 2025
-
[6]
Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research, 16:321–357, 2002
work page 2002
-
[7]
Haibo He and Edwardo A Garcia. Learning from imbalanced data.IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009
work page 2009
-
[8]
Benchmarking distribution shift in tabular data with tableshift, 2024
Josh Gardner, Zoran Popovic, and Ludwig Schmidt. Benchmarking distribution shift in tabular data with tableshift, 2024
work page 2024
-
[9]
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025
-
[10]
Why tabular foundation models should be a research priority, 2024
Boris van Breugel and Mihaela van der Schaar. Why tabular foundation models should be a research priority, 2024
work page 2024
-
[11]
Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022
work page 2022
-
[12]
Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. ADASYN: Adaptive synthetic sampling approach for imbalanced learning.IEEE International Joint Conference on Neural Networks, pages 1322–1328, 2008
work page 2008
-
[13]
Guillaume Lemaître, Fernando Nogueira, and Christos K Aridas. Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning.Journal of Machine Learning Research, 18(17):1–5, 2017
work page 2017
-
[14]
Bart Baesens, Tony Van Gestel, Stijn Viaene, Maria Stepanova, Johan Suykens, and Jan Vanthienen. Benchmarking state-of-the-art classification algorithms for credit scoring.Journal of the Operational Research Society, 54(6):627– 635, 2003
work page 2003
-
[15]
Lyn C Thomas, David B Edelman, and Jonathan N Crook.Credit Scoring and Its Applications. SIAM, 2 edition, 2017. 8 Data Presentation Over Architecture
work page 2017
-
[16]
What makes good in-context examples for GPT-3?arXiv preprint arXiv:2101.06804, 2021
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3?arXiv preprint arXiv:2101.06804, 2021
-
[17]
Learning to retrieve prompts for in-context learning.NAACL, 2022
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning.NAACL, 2022
work page 2022
-
[18]
Home Credit Group. Home credit default risk. Kaggle Competition, 2018. https://www.kaggle.com/c/ home-credit-default-risk
work page 2018
-
[19]
Lending Club. Lending club loan data. Kaggle Dataset, 2020. https://www.kaggle.com/datasets/ wordsforthewise/lending-club
work page 2020
-
[20]
Random forests.Machine Learning, 45(1):5–32, 2001
Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001
work page 2001
-
[21]
XGBoost: A scalable tree boosting system
Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016
work page 2016
-
[22]
LightGBM: A highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[23]
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[24]
Robust classification for imprecise environments.Machine Learning, 42:203–231, 2001
Foster Provost and Tom Fawcett. Robust classification for imprecise environments.Machine Learning, 42:203–231, 2001
work page 2001
-
[25]
The class imbalance problem: A systematic study.Intelligent Data Analysis, 6:429–449, 2002
Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study.Intelligent Data Analysis, 6:429–449, 2002
work page 2002
-
[26]
Exploring fine-tuning for tabular foundation models.arXiv preprint arXiv:2601.09654, 2026
Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models.arXiv preprint arXiv:2601.09654, 2026. 9
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.