pith. sign in

arxiv: 2605.10896 · v2 · submitted 2026-05-11 · 💻 cs.LG

V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction

Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3

classification 💻 cs.LG
keywords corporate bankruptcy predictiontabular foundation modelsTabPFNfinancial distressbenchmark datasetimbalanced classificationmulti-horizon forecasting
0
0 comments X

The pith

Finetuned TabPFN matches or exceeds gradient boosting on bankruptcy prediction at longer horizons on a new large benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates V4FinBench, a public dataset of more than one million company-year records from Central European economies that includes 131 features, six forecast horizons, and a distress label combining solvency, profitability, and liquidity signals. It evaluates standard methods, a finetuned tabular foundation model, and a finetuned LLM under realistic class imbalance. The results show that imbalance-aware finetuning lets TabPFN reach or surpass gradient boosting performance on F1 and ROC-AUC especially at longer horizons, while the LLM trails, and the same finetuned TabPFN also improves results on an external US dataset.

Core claim

V4FinBench supplies over one million company-year observations from the Visegrad Group economies (2006-2021) together with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion. Reference evaluations establish that imbalance-aware finetuning of TabPFN produces F1-scores and ROC-AUC values that match or exceed those of gradient boosting at longer horizons, whereas QLoRA-finetuned Llama-3-8B falls behind gradient boosting at every horizon with the gap increasing beyond the shortest one. The V4FinBench-finetuned TabPFN checkpoint further improves over the unfine-tuned model on the separate American Bankruptcy Dataset, indicating that the training,

What carries the argument

The V4FinBench dataset and its composite distress label, used for imbalance-aware finetuning of the TabPFN tabular foundation model.

If this is right

  • TabPFN with targeted finetuning becomes a practical option for multi-horizon financial distress tasks.
  • Gradient boosting remains competitive but is no longer clearly superior once tabular foundation models are adapted.
  • LLM-based approaches need substantial additional work to match specialized tabular methods on this type of imbalanced tabular data.
  • Regional financial benchmarks can produce checkpoints that improve performance on data from other jurisdictions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Releasing larger public financial datasets could speed testing of whether similar gains appear in related tasks such as credit-risk scoring.
  • The composite distress definition may prove useful as a standard label for other multi-factor corporate-health predictions.
  • Hybrid systems that route tabular data to foundation models before LLM reasoning steps could be tested on the same benchmark.

Load-bearing premise

The distress labels defined jointly from solvency, profitability, and liquidity deterioration in the V4 data reflect patterns that hold outside the specific economies and years studied.

What would settle it

No performance gain on the American Bankruptcy Dataset (or any other external dataset) when using the V4FinBench-finetuned TabPFN checkpoint versus the vanilla TabPFN would falsify the claim of transferable distress structure.

Figures

Figures reproduced from arXiv: 2605.10896 by Anna Poberezhna, Julia Farganus, Maciej Zi\k{e}ba, Marcin Kostrzewa, Micha{\l} Furga{\l}a, Oleksii Furman, Roman Furman, Sebastian Tomczak.

Figure 1
Figure 1. Figure 1: V4FinBench overview. Raw EMIS financial statements and company metadata are trans [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: TabPFN context construction under severe class imbalance. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: TabPFN context-construction ablation across prediction horizons. Prototype undersampling [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Finetuned TabPFN (prototype undersampling) against XGBoost and representative standard [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Transfer of V4FinBench-finetuned TabPFN to the American Bankruptcy Dataset [ [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegr\`ad Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both $F_1$-score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on $F_1$-score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces V4FinBench, a public benchmark of over one million company-year records from Visegrád Group economies (2006-2021) with 131 features, six prediction horizons, and a composite distress label capturing solvency, profitability, and liquidity issues. It reports reference evaluations of tabular baselines, imbalance-aware finetuned TabPFN, and QLoRA-finetuned Llama-3-8B, claiming that finetuned TabPFN matches or exceeds gradient boosting at longer horizons on F1 and ROC-AUC, that Llama-3-8B underperforms especially beyond the immediate horizon, and that the V4-finetuned TabPFN checkpoint improves over vanilla TabPFN on the external American Bankruptcy Dataset.

Significance. If the empirical comparisons hold, the work supplies a much-needed large public resource for evaluating tabular foundation models and LLMs on a realistic, severely imbalanced financial prediction task. The positive transfer result on the external dataset and the competitive long-horizon performance of finetuned TabPFN are notable strengths that could guide further development of foundation-model approaches in corporate-finance applications.

major comments (1)
  1. [Abstract] Abstract: the description of data-cleaning rules, exact construction of the composite positive label, and hyperparameter-search protocol is too brief to allow full assessment of possible selection effects or reproducibility of the reported positive rates (0.19–0.36 %).
minor comments (2)
  1. The manuscript should include a dedicated reproducibility section or appendix that lists the precise preprocessing steps, label thresholds, and search ranges used for all methods.
  2. Figure captions and table footnotes should explicitly state the number of runs or seeds underlying the reported means and standard deviations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading and the constructive suggestion regarding the abstract. We agree that the original abstract was too concise on the data-cleaning rules, composite label construction, and hyperparameter protocol, which could hinder reproducibility assessment. We have revised the abstract to incorporate these details while respecting length limits. The point-by-point response follows.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the description of data-cleaning rules, exact construction of the composite positive label, and hyperparameter-search protocol is too brief to allow full assessment of possible selection effects or reproducibility of the reported positive rates (0.19–0.36 %).

    Authors: We agree with the referee that the abstract provided insufficient detail on these elements. In the revised manuscript we have expanded the abstract to state: (i) data-cleaning rules consist of dropping observations with missing values in any of the 131 core financial ratios or with extreme outliers beyond three standard deviations in leverage or profitability variables; (ii) the composite positive label is triggered when a firm meets at least one of three conditions in the target year—solvency ratio below 0.5, negative net income for two consecutive years, or current ratio below 0.8; and (iii) hyperparameter search for all models used a 5-fold temporal cross-validation grid over learning rate, batch size, and (for TabPFN) the number of ensemble members. These additions preserve the abstract’s brevity while enabling readers to evaluate selection effects and to reproduce the reported positive rates of 0.19–0.36 %. Full algorithmic specifications and exact threshold values remain in Section 3.2 of the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical benchmarking study that introduces the V4FinBench dataset and reports direct performance measurements (F1, ROC-AUC) of models including TabPFN, gradient boosting, and Llama-3-8B on held-out test splits plus an external American Bankruptcy Dataset transfer evaluation. All reported results follow standard train/test protocols with no fitted parameters redefined as predictions, no self-definitional equations, and no load-bearing self-citations or ansatzes that reduce the central claims to their own inputs. The composite distress labels and V4 distribution are explicitly tested for transfer rather than assumed by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is an empirical benchmark release; its claims rest on the assumption that the chosen distress criterion and V4 sample are representative of general corporate failure dynamics, with no new theoretical entities or fitted constants introduced by the authors themselves.

axioms (1)
  • domain assumption The composite distress criterion (solvency + profitability + liquidity deterioration) produces valid positive labels for bankruptcy prediction
    This labeling rule is used to define the rare positive class across all horizons.

pith-pipeline@v0.9.0 · 5651 in / 1430 out tokens · 51608 ms · 2026-05-14T20:40:56.369157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 32 canonical work pages

  1. [1]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md

  2. [2]

    Benchmarking machine learning models to predict corporate bankruptcy.Journal of Credit Risk, 2023

    Emmanuel Alanis, Sudheer Chava, and Agam Shah. Benchmarking machine learning models to predict corporate bankruptcy.Journal of Credit Risk, 2023. ISSN 1755-9723. doi: 10.21314/ jcr.2023.002. URLhttp://dx.doi.org/10.21314/JCR.2023.002

  3. [3]

    Edward I. Altman. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy.The Journal of Finance, 23(4):589–609, 1968

  4. [4]

    FinBERT: Financial sentiment analysis with pre-trained language models, 2019

    Dogu Tan Araci. FinBERT: Financial sentiment analysis with pre-trained language models, 2019

  5. [5]

    Machine learning models and bankruptcy prediction.Expert Systems with Applications, 83:405–417, 2017

    Flavio Barboza, Herbert Kimura, and Edward Altman. Machine learning models and bankruptcy prediction.Expert Systems with Applications, 83:405–417, 2017. doi: 10.1016/j.eswa.2017.04. 006

  6. [6]

    Chen and C

    Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York, NY , USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939785. URL https://doi.org/10.1145/ 2939672.2939785

  7. [7]

    Qlora: efficient finetuning of quantized llms

    Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: efficient finetuning of quantized llms. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc

  8. [8]

    Artificial neural network and decision tree- based modelling of non-prosperity of companies.Equilibrium

    Marek Durica, Jaroslav Frnda, and Lucia Svabova. Artificial neural network and decision tree- based modelling of non-prosperity of companies.Equilibrium. Quarterly Journal of Economics and Economic Policy, 18(4):1105–1131, 2023

  9. [9]

    Artificial intelligence in predicting the bankruptcy of non-financial corporations.Oeconomia Copernicana, 13(4):1215–1251, 2022

    Beata Gavurova, Sylvia Jencova, Radovan Bacik, Marta Miskufova, and Stanislav Letkovsky. Artificial intelligence in predicting the bankruptcy of non-financial corporations.Oeconomia Copernicana, 13(4):1215–1251, 2022

  10. [10]

    Corporate failure prediction models: A literature review and an empirical study of V4 firms.Equilibrium

    Andrzej Geise, Mariola Piłatowska, and Aneta Wlodarczyk. Corporate failure prediction models: A literature review and an empirical study of V4 firms.Equilibrium. Quarterly Journal of Economics and Economic Policy, 16(3):593–625, 2021

  11. [11]

    Tabllm: Few-shot classification of tabular data with large language models

    Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, 2023

  12. [12]

    Accurate predictions on small data with a tab- ular foundation model.Nature, 637(8045):319–326, 2025

    Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025. doi: 10.1038/s41586-024-08328-6

  13. [13]

    Lightgbm: a highly efficient gradient boosting decision tree

    Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: a highly efficient gradient boosting decision tree. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 3149–3157, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

  14. [14]

    Are foundation models useful for bankruptcy prediction?, 2025

    Marcin Kostrzewa, Oleksii Furman, Roman Furman, Sebastian Tomczak, and Maciej Zi˛ eba. Are foundation models useful for bankruptcy prediction?, 2025

  15. [15]

    Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study.European Journal of Operational Research, 252(2):561–572, 2016

    Deron Liang, Chih-Chuan Lu, Chih-Fong Tsai, and Guan-An Shih. Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study.European Journal of Operational Research, 252(2):561–572, 2016. doi: 10.1016/j.ejor.2016.01.012. 10

  16. [16]

    Pardalos, and Agostino Poggi

    Gianfranco Lombardo, Mattia Pellegrino, George Adosoglou, Stefano Cagnoni, Panos M. Pardalos, and Agostino Poggi. Machine learning for bankruptcy prediction in the american stock market: Dataset and benchmarks.Future Internet, 14(8), 2022. ISSN 1999-5903. doi: 10.3390/fi14080244. URLhttps://www.mdpi.com/1999-5903/14/8/244

  17. [17]

    Deep learning models for bankruptcy prediction using textual disclosures.European Journal of Operational Research, 274(2): 743–758, 2019

    Feng Mai, Shaonan Tian, Chihoon Lee, and Ling Ma. Deep learning models for bankruptcy prediction using textual disclosures.European Journal of Operational Research, 274(2): 743–758, 2019. doi: 10.1016/j.ejor.2018.10.024

  18. [18]

    Predicting distresses using deep learning of text segments in annual reports, 2019

    Rastin Matin, Casper Hansen, Christian Hansen, and Pia Mølgaard. Predicting distresses using deep learning of text segments in annual reports, 2019

  19. [19]

    James A. Ohlson. Financial ratios and the probabilistic prediction of bankruptcy.Journal of Accounting Research, 18(1):109–131, 1980

  20. [20]

    Platt and Marjorie B

    Harlan D. Platt and Marjorie B. Platt. Understanding differences between financial distress and bankruptcy.Review of Applied Economics, 2(2):141–157, 2006

  21. [21]

    Catboost: unbiased boosting with categorical features

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. InProceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 6639–6649, Red Hook, NY , USA, 2018. Curran Associates Inc

  22. [22]

    Selection of over time stability ratios using machine learning techniques.Operations Research and Decisions, 36(2):1–25, 2026

    Sebastian Klaudiusz Tomczak and Aleksander Denisiuk. Selection of over time stability ratios using machine learning techniques.Operations Research and Decisions, 36(2):1–25, 2026. doi: 10.37190/ord/215251

  23. [23]

    Sebastian Klaudiusz Tomczak, Michal Karas, Tamas Kristof, Lucia Duricova, and Marek Durica. Identification of key indicators, model development, and validation methods for bankruptcy prediction: A systematic review for the visegrad group.Forum Scientiae Oeconomia, 13(3): 156–196, 2025

  24. [24]

    Bankruptcy prediction in the post-pandemic period: A case study of visegrad group countries.Oeconomia Copernicana, 14 (1):253–293, 2023

    Katarina Valaskova, Dominika Gajdosikova, and Jaroslav Belas. Bankruptcy prediction in the post-pandemic period: A case study of visegrad group countries.Oeconomia Copernicana, 14 (1):253–293, 2023

  25. [25]

    Datasets for advanced bankruptcy prediction: A survey and taxonomy, 2024

    Xinlin Wang, Zsófia Kräussl, and Mats Brorsson. Datasets for advanced bankruptcy prediction: A survey and taxonomy, 2024

  26. [26]

    BloombergGPT: A large language model for finance, 2023

    Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. BloombergGPT: A large language model for finance, 2023

  27. [27]

    PIXIU: A large language model, instruction data and evaluation benchmark for finance, 2023

    Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. PIXIU: A large language model, instruction data and evaluation benchmark for finance, 2023

  28. [28]

    FinGPT: Open-source financial large language models, 2023

    Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-source financial large language models, 2023

  29. [29]

    FinBERT: A pretrained language model for financial communications

    Yi Yang, Mark Christopher Siy Uy, and Allen Huang. FinBERT: A pretrained language model for financial communications. InProceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), 2020

  30. [30]

    InvestLM: A large language model for investment using financial domain instruction tuning, 2023

    Yi Yang, Yixuan Tang, and Kar Yan Tam. InvestLM: A large language model for investment using financial domain instruction tuning, 2023

  31. [31]

    enterprise in crisis

    Maciej Zi˛ eba, Sebastian K. Tomczak, and Jakub M. Tomczak. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction.Expert Systems with Applications, 58:93–101, 2016. doi: 10.1016/j.eswa.2016.04.001. Polish Companies Bankruptcy dataset, UCI Machine Learning Repository. 11 Table 3: TabPFN finetuning configuration. ...

  32. [32]

    Guidelines: • The answer [N/A] means that the paper does not involve crowdsourcing nor research with human subjects

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...