pith. sign in

arxiv: 2605.18635 · v1 · pith:IPQWRNGSnew · submitted 2026-05-18 · 💻 cs.LG · cs.AI

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

Pith reviewed 2026-05-20 12:01 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords credit risk predictiontabular foundation modelsresampling strategiesclass imbalancein-context learningAUC-ROCcontext constructioncredit default prediction
0
0 comments X

The pith

For tabular foundation models in credit risk, the way context is resampled matters more than which model family is chosen.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper benchmarks four classical models against five tabular foundation models on credit default prediction, a setting marked by severe class imbalance and heterogeneous features. It systematically varies seven context-construction strategies and context sizes from one thousand to fifty thousand examples on the Home Credit and Lending Club datasets. The central finding is that differences in how the in-context examples are selected explain more performance variation in AUC-ROC than the spread across different foundation model families. A sympathetic reader cares because this shifts attention from hunting for newer architectures to a controllable deployment choice that can match full-data baselines while improving recall on the rare default class.

Core claim

On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add three to four AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of five thousand to ten thousand examples, the strongest TFMs reach the AUC of classical baselines trained on the full data while also recovering meaningful default-class recall that default-threshold GBDTs do not. The authors frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.

What carries the argument

The seven context-construction strategies (uniform, balanced, and hybrid sampling variants) that determine which examples enter the in-context window for tabular foundation models.

If this is right

  • TFMs supplied with balanced contexts of 5K to 10K examples match the AUC of classical models trained on the entire dataset.
  • These TFMs recover default-class recall that gradient-boosted trees with default thresholds miss.
  • Context construction becomes the main controllable lever for deploying TFMs in imbalanced tabular credit tasks.
  • Performance differences from sampling choices exceed differences across TFM families on both Home Credit and Lending Club data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams deploying TFMs on imbalanced tabular problems may obtain larger gains by iterating on context resampling before testing additional model families.
  • The same emphasis on presentation could apply to other high-stakes imbalanced domains such as fraud detection where in-context learning is feasible.
  • Experiments that vary imbalance ratios or test context sizes beyond 50K would clarify whether the reported 5K-to-10K optimum generalizes.

Load-bearing premise

The seven tested context-construction strategies and sizes from 1K to 50K capture the dominant sensitivities of TFMs to input presentation without other untested factors such as feature encoding or latency constraints driving the observed differences.

What would settle it

A new TFM family that produces AUC-ROC gaps larger than the three-to-four-point spread from resampling strategies on the same two datasets and identical context setups would falsify the claim that context strategy dominates architecture choice.

read the original abstract

Credit default prediction is a tabular learning problem with severe class imbalance, heterogeneous features, and tight latency budgets. Tabular Foundation Models (TFMs) approach this problem through in-context learning, which makes their predictions sensitive to how the context window is built. We benchmark four classical models and five TFMs on the Home Credit and Lending Club datasets, varying the context-construction strategy (seven options) and the context size (1K to 50K). On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of 5K to 10K examples, the strongest TFMs reach the AUC of classical baselines trained on the full data, while also recovering meaningful default-class recall that default-threshold GBDTs do not. We frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript benchmarks four classical models against five tabular foundation models (TFMs) for credit default prediction on the Home Credit and Lending Club datasets. It systematically varies seven context-construction strategies (uniform, balanced, and hybrid sampling variants) and context sizes (1K–50K examples), reporting that context strategy accounts for more AUC-ROC variance than TFM family choice, with balanced/hybrid sampling improving performance by 3–4 AUC points over uniform sampling; at 5K–10K balanced context the best TFMs match full-data classical baselines while improving default-class recall.

Significance. If the central empirical claim is substantiated, the work would usefully redirect attention from TFM architecture search to context-construction practices for imbalanced tabular in-context learning. The large experimental grid on public datasets, direct comparison to classical baselines, and recovery of meaningful recall are concrete strengths that would support practical guidance for credit-risk deployment.

major comments (1)
  1. [Abstract and §4] Abstract and §4 (Results): The claim that context strategy explains more variance in AUC-ROC than TFM family is presented without an explicit variance-partitioning analysis (two-way ANOVA, mixed-effects model, or eta-squared decomposition that includes strategy × family interaction terms). The reported 3–4 AUC gap is therefore an informal range comparison whose robustness across the full design grid (datasets × strategies × sizes × models) cannot be verified from the given description.
minor comments (2)
  1. [§3] The exact definitions and implementation details of the seven context-construction strategies (including how balanced and hybrid sampling are operationalized and how prompt formatting is controlled) should be expanded in §3 to allow full reproduction.
  2. [Figures and Tables] Error bars, number of random seeds, and any statistical tests for the reported AUC differences are not described in the abstract or visible results summary; these should be added to all figures and tables.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of our experimental design, the practical relevance for credit-risk applications, and the clear identification of a point that can be strengthened. We agree that an explicit variance-partitioning analysis will make the central claim more rigorous and will add it in the revision.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Results): The claim that context strategy explains more variance in AUC-ROC than TFM family is presented without an explicit variance-partitioning analysis (two-way ANOVA, mixed-effects model, or eta-squared decomposition that includes strategy × family interaction terms). The reported 3–4 AUC gap is therefore an informal range comparison whose robustness across the full design grid (datasets × strategies × sizes × models) cannot be verified from the given description.

    Authors: We acknowledge that the manuscript currently supports the claim through consistent numerical gaps observed across the full grid rather than through a formal decomposition. In the revised version we will add a two-way ANOVA (with eta-squared effect sizes) on AUC-ROC that includes context strategy, TFM family, their interaction, and dataset as factors. The analysis will be performed on the complete experimental results (two datasets × seven strategies × five context sizes × five TFMs). We will report the relative variance explained by strategy versus family and will retain the 3–4 AUC-point improvement as a descriptive summary alongside the statistical results. This addition directly addresses the concern while preserving the original empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical benchmarking study

full rationale

The paper reports direct empirical comparisons of seven context-construction strategies and five TFM families on fixed public datasets (Home Credit, Lending Club), measuring AUC-ROC as an external performance metric. The central claim that context strategy explains more variance than TFM family rests on observed deltas (3-4 AUC points) from the experimental grid rather than any self-definitional reduction, fitted parameter renamed as prediction, or self-citation load-bearing theorem. No equations, ansatzes, or uniqueness claims appear that reduce by construction to the paper's own inputs. The study is self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmarking study that applies existing TFMs and classical resampling techniques to standard credit datasets; it therefore rests primarily on domain assumptions about dataset representativeness and standard machine-learning evaluation practices rather than new axioms or invented entities.

axioms (1)
  • domain assumption Home Credit and Lending Club datasets are representative of real-world credit risk prediction problems with class imbalance and heterogeneous features.
    The paper selects these datasets as benchmarks without additional justification or sensitivity analysis in the abstract.

pith-pipeline@v0.9.0 · 5742 in / 1339 out tokens · 51161 ms · 2026-05-20T12:01:29.500574+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · 1 internal anchor

  1. [1]

    Why do tree-based models still outperform deep learning on typical tabular data?Advances in Neural Information Processing Systems, 35:507–520, 2022

    Léo Grinsztajn, Edouard Oyallon, and Gaël Varoquaux. Why do tree-based models still outperform deep learning on typical tabular data?Advances in Neural Information Processing Systems, 35:507–520, 2022

  2. [2]

    TabPFN: A transformer that solves small tabular classification problems in a second

    Noah Hollmann, Samuel Müller, Katharina Eggensperger, and Frank Hutter. TabPFN: A transformer that solves small tabular classification problems in a second. InInternational Conference on Learning Representations, 2023

  3. [3]

    TabICL: A Tabular Foundation Model for In-Context Learning on Large Data

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. TabICL: A tabular foundation model for in-context learning on large data. InInternational Conference on Machine Learning, 2025. arXiv:2502.05564

  4. [4]

    Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-MSP: Multi-scale sparse attention for tabular in-context learning.arXiv preprint arXiv:2511.02818, 2025

  5. [5]

    Orion-bix: Bi-axial attention for tabular in-context learning.CoRR, abs/2512.00181, 2025

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu. Orion-Bix: Bi-axial attention for tabular in-context learning.arXiv preprint arXiv:2512.00181, 2025

  6. [6]

    SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research, 16:321–357, 2002

    Nitesh V Chawla, Kevin W Bowyer, Lawrence O Hall, and W Philip Kegelmeyer. SMOTE: Synthetic minority over-sampling technique.Journal of Artificial Intelligence Research, 16:321–357, 2002

  7. [7]

    Learning from imbalanced data.IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009

    Haibo He and Edwardo A Garcia. Learning from imbalanced data.IEEE Transactions on Knowledge and Data Engineering, 21(9):1263–1284, 2009

  8. [8]

    Benchmarking distribution shift in tabular data with tableshift, 2024

    Josh Gardner, Zoran Popovic, and Ludwig Schmidt. Benchmarking distribution shift in tabular data with tableshift, 2024

  9. [9]

    TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

    Aditya Tanna, Pratinav Seth, Mohamed Bouadi, Utsav Avaiya, and Vinay Kumar Sankarapu. TabTune: A unified library for inference and fine-tuning tabular foundation models.arXiv preprint arXiv:2511.02802, 2025

  10. [10]

    Why tabular foundation models should be a research priority, 2024

    Boris van Breugel and Mihaela van der Schaar. Why tabular foundation models should be a research priority, 2024

  11. [11]

    Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022

    Vadim Borisov, Tobias Leemann, Kathrin Seßler, Johannes Haug, Martin Pawelczyk, and Gjergji Kasneci. Deep neural networks and tabular data: A survey.IEEE Transactions on Neural Networks and Learning Systems, 2022

  12. [12]

    ADASYN: Adaptive synthetic sampling approach for imbalanced learning.IEEE International Joint Conference on Neural Networks, pages 1322–1328, 2008

    Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. ADASYN: Adaptive synthetic sampling approach for imbalanced learning.IEEE International Joint Conference on Neural Networks, pages 1322–1328, 2008

  13. [13]

    Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning.Journal of Machine Learning Research, 18(17):1–5, 2017

    Guillaume Lemaître, Fernando Nogueira, and Christos K Aridas. Imbalanced-learn: A Python toolbox to tackle the curse of imbalanced datasets in machine learning.Journal of Machine Learning Research, 18(17):1–5, 2017

  14. [14]

    Benchmarking state-of-the-art classification algorithms for credit scoring.Journal of the Operational Research Society, 54(6):627– 635, 2003

    Bart Baesens, Tony Van Gestel, Stijn Viaene, Maria Stepanova, Johan Suykens, and Jan Vanthienen. Benchmarking state-of-the-art classification algorithms for credit scoring.Journal of the Operational Research Society, 54(6):627– 635, 2003

  15. [15]

    SIAM, 2 edition, 2017

    Lyn C Thomas, David B Edelman, and Jonathan N Crook.Credit Scoring and Its Applications. SIAM, 2 edition, 2017. 8 Data Presentation Over Architecture

  16. [16]

    What makes good in-context examples for GPT-3?arXiv preprint arXiv:2101.06804, 2021

    Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for GPT-3?arXiv preprint arXiv:2101.06804, 2021

  17. [17]

    Learning to retrieve prompts for in-context learning.NAACL, 2022

    Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning.NAACL, 2022

  18. [18]

    Home credit default risk

    Home Credit Group. Home credit default risk. Kaggle Competition, 2018. https://www.kaggle.com/c/ home-credit-default-risk

  19. [19]

    Lending club loan data

    Lending Club. Lending club loan data. Kaggle Dataset, 2020. https://www.kaggle.com/datasets/ wordsforthewise/lending-club

  20. [20]

    Random forests.Machine Learning, 45(1):5–32, 2001

    Leo Breiman. Random forests.Machine Learning, 45(1):5–32, 2001

  21. [21]

    XGBoost: A scalable tree boosting system

    Tianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 785–794, 2016

  22. [22]

    LightGBM: A highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. LightGBM: A highly efficient gradient boosting decision tree. InAdvances in Neural Information Processing Systems, volume 30, 2017

  23. [23]

    CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems, 31, 2018

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. CatBoost: Unbiased boosting with categorical features.Advances in Neural Information Processing Systems, 31, 2018

  24. [24]

    Robust classification for imprecise environments.Machine Learning, 42:203–231, 2001

    Foster Provost and Tom Fawcett. Robust classification for imprecise environments.Machine Learning, 42:203–231, 2001

  25. [25]

    The class imbalance problem: A systematic study.Intelligent Data Analysis, 6:429–449, 2002

    Nathalie Japkowicz and Shaju Stephen. The class imbalance problem: A systematic study.Intelligent Data Analysis, 6:429–449, 2002

  26. [26]

    Exploring fine-tuning for tabular foundation models.arXiv preprint arXiv:2601.09654, 2026

    Aditya Tanna, Pratinav Seth, Mohamed Bouadi, and Vinay Kumar Sankarapu. Exploring fine-tuning for tabular foundation models.arXiv preprint arXiv:2601.09654, 2026. 9