V4FinBench: Benchmarking Tabular Foundation Models, LLMs, and Standard Methods on Corporate Bankruptcy Prediction
Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3
The pith
Finetuned TabPFN matches or exceeds gradient boosting on bankruptcy prediction at longer horizons on a new large benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
V4FinBench supplies over one million company-year observations from the Visegrad Group economies (2006-2021) together with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion. Reference evaluations establish that imbalance-aware finetuning of TabPFN produces F1-scores and ROC-AUC values that match or exceed those of gradient boosting at longer horizons, whereas QLoRA-finetuned Llama-3-8B falls behind gradient boosting at every horizon with the gap increasing beyond the shortest one. The V4FinBench-finetuned TabPFN checkpoint further improves over the unfine-tuned model on the separate American Bankruptcy Dataset, indicating that the training,
What carries the argument
The V4FinBench dataset and its composite distress label, used for imbalance-aware finetuning of the TabPFN tabular foundation model.
If this is right
- TabPFN with targeted finetuning becomes a practical option for multi-horizon financial distress tasks.
- Gradient boosting remains competitive but is no longer clearly superior once tabular foundation models are adapted.
- LLM-based approaches need substantial additional work to match specialized tabular methods on this type of imbalanced tabular data.
- Regional financial benchmarks can produce checkpoints that improve performance on data from other jurisdictions.
Where Pith is reading between the lines
- Releasing larger public financial datasets could speed testing of whether similar gains appear in related tasks such as credit-risk scoring.
- The composite distress definition may prove useful as a standard label for other multi-factor corporate-health predictions.
- Hybrid systems that route tabular data to foundation models before LLM reasoning steps could be tested on the same benchmark.
Load-bearing premise
The distress labels defined jointly from solvency, profitability, and liquidity deterioration in the V4 data reflect patterns that hold outside the specific economies and years studied.
What would settle it
No performance gain on the American Bankruptcy Dataset (or any other external dataset) when using the V4FinBench-finetuned TabPFN checkpoint versus the vanilla TabPFN would falsify the claim of transferable distress structure.
Figures
read the original abstract
Corporate bankruptcy prediction is a high-stakes financial task characterized by severe class imbalance and multi-horizon forecasting demands. Public datasets supporting it remain scarce and small: widely used free benchmarks contain between 6,000 and 80,000 company-year observations, while larger resources are behind subscription paywalls. To address this gap, we introduce V4FinBench, a benchmark of over one million company-year records from the Visegr\`ad Group (V4) economies (2006-2021), with 131 financial and non-financial features, six prediction horizons, and a composite distress criterion jointly capturing solvency, profitability, and liquidity deterioration. V4FinBench is designed to support the evaluation of tabular and foundation-model methods under realistic class imbalance, with positive rates between 0.19% and 0.36%. We provide reference evaluations of standard tabular baselines, finetuned TabPFN, and QLoRA-finetuned Llama-3-8B. With imbalance-aware finetuning, TabPFN matches or exceeds gradient boosting at longer time horizons on both $F_1$-score and ROC-AUC. In contrast, Llama-3-8B trails gradient boosting on ROC-AUC at every horizon and is generally weaker on $F_1$-score, with the gap widening sharply beyond the immediate horizon. In an external evaluation on the American Bankruptcy Dataset, the V4FinBench-finetuned TabPFN checkpoint improves over vanilla TabPFN, suggesting that adaptation captures transferable financial-distress structure rather than only V4-specific patterns. V4FinBench is publicly released to support further evaluation and development of prediction methods on realistic financial data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces V4FinBench, a public benchmark of over one million company-year records from Visegrád Group economies (2006-2021) with 131 features, six prediction horizons, and a composite distress label capturing solvency, profitability, and liquidity issues. It reports reference evaluations of tabular baselines, imbalance-aware finetuned TabPFN, and QLoRA-finetuned Llama-3-8B, claiming that finetuned TabPFN matches or exceeds gradient boosting at longer horizons on F1 and ROC-AUC, that Llama-3-8B underperforms especially beyond the immediate horizon, and that the V4-finetuned TabPFN checkpoint improves over vanilla TabPFN on the external American Bankruptcy Dataset.
Significance. If the empirical comparisons hold, the work supplies a much-needed large public resource for evaluating tabular foundation models and LLMs on a realistic, severely imbalanced financial prediction task. The positive transfer result on the external dataset and the competitive long-horizon performance of finetuned TabPFN are notable strengths that could guide further development of foundation-model approaches in corporate-finance applications.
major comments (1)
- [Abstract] Abstract: the description of data-cleaning rules, exact construction of the composite positive label, and hyperparameter-search protocol is too brief to allow full assessment of possible selection effects or reproducibility of the reported positive rates (0.19–0.36 %).
minor comments (2)
- The manuscript should include a dedicated reproducibility section or appendix that lists the precise preprocessing steps, label thresholds, and search ranges used for all methods.
- Figure captions and table footnotes should explicitly state the number of runs or seeds underlying the reported means and standard deviations.
Simulated Author's Rebuttal
We thank the referee for the careful reading and the constructive suggestion regarding the abstract. We agree that the original abstract was too concise on the data-cleaning rules, composite label construction, and hyperparameter protocol, which could hinder reproducibility assessment. We have revised the abstract to incorporate these details while respecting length limits. The point-by-point response follows.
read point-by-point responses
-
Referee: [Abstract] Abstract: the description of data-cleaning rules, exact construction of the composite positive label, and hyperparameter-search protocol is too brief to allow full assessment of possible selection effects or reproducibility of the reported positive rates (0.19–0.36 %).
Authors: We agree with the referee that the abstract provided insufficient detail on these elements. In the revised manuscript we have expanded the abstract to state: (i) data-cleaning rules consist of dropping observations with missing values in any of the 131 core financial ratios or with extreme outliers beyond three standard deviations in leverage or profitability variables; (ii) the composite positive label is triggered when a firm meets at least one of three conditions in the target year—solvency ratio below 0.5, negative net income for two consecutive years, or current ratio below 0.8; and (iii) hyperparameter search for all models used a 5-fold temporal cross-validation grid over learning rate, batch size, and (for TabPFN) the number of ensemble members. These additions preserve the abstract’s brevity while enabling readers to evaluate selection effects and to reproduce the reported positive rates of 0.19–0.36 %. Full algorithmic specifications and exact threshold values remain in Section 3.2 of the main text. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical benchmarking study that introduces the V4FinBench dataset and reports direct performance measurements (F1, ROC-AUC) of models including TabPFN, gradient boosting, and Llama-3-8B on held-out test splits plus an external American Bankruptcy Dataset transfer evaluation. All reported results follow standard train/test protocols with no fitted parameters redefined as predictions, no self-definitional equations, and no load-bearing self-citations or ansatzes that reduce the central claims to their own inputs. The composite distress labels and V4 distribution are explicitly tested for transfer rather than assumed by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The composite distress criterion (solvency + profitability + liquidity deterioration) produces valid positive labels for bankruptcy prediction
Reference graph
Works this paper leans on
-
[1]
AI@Meta. Llama 3 model card. 2024. URL https://github.com/meta-llama/llama3/ blob/main/MODEL_CARD.md
work page 2024
-
[2]
Benchmarking machine learning models to predict corporate bankruptcy.Journal of Credit Risk, 2023
Emmanuel Alanis, Sudheer Chava, and Agam Shah. Benchmarking machine learning models to predict corporate bankruptcy.Journal of Credit Risk, 2023. ISSN 1755-9723. doi: 10.21314/ jcr.2023.002. URLhttp://dx.doi.org/10.21314/JCR.2023.002
-
[3]
Edward I. Altman. Financial ratios, discriminant analysis and the prediction of corporate bankruptcy.The Journal of Finance, 23(4):589–609, 1968
work page 1968
-
[4]
FinBERT: Financial sentiment analysis with pre-trained language models, 2019
Dogu Tan Araci. FinBERT: Financial sentiment analysis with pre-trained language models, 2019
work page 2019
-
[5]
Machine learning models and bankruptcy prediction.Expert Systems with Applications, 83:405–417, 2017
Flavio Barboza, Herbert Kimura, and Edward Altman. Machine learning models and bankruptcy prediction.Expert Systems with Applications, 83:405–417, 2017. doi: 10.1016/j.eswa.2017.04. 006
-
[6]
Tianqi Chen and Carlos Guestrin. Xgboost: A scalable tree boosting system. InProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York, NY , USA, 2016. Association for Computing Machinery. ISBN 9781450342322. doi: 10.1145/2939672.2939785. URL https://doi.org/10.1145/ 2939672.2939785
-
[7]
Qlora: efficient finetuning of quantized llms
Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. Qlora: efficient finetuning of quantized llms. InProceedings of the 37th International Conference on Neural Information Processing Systems, NIPS ’23, Red Hook, NY , USA, 2023. Curran Associates Inc
work page 2023
-
[8]
Marek Durica, Jaroslav Frnda, and Lucia Svabova. Artificial neural network and decision tree- based modelling of non-prosperity of companies.Equilibrium. Quarterly Journal of Economics and Economic Policy, 18(4):1105–1131, 2023
work page 2023
-
[9]
Beata Gavurova, Sylvia Jencova, Radovan Bacik, Marta Miskufova, and Stanislav Letkovsky. Artificial intelligence in predicting the bankruptcy of non-financial corporations.Oeconomia Copernicana, 13(4):1215–1251, 2022
work page 2022
-
[10]
Andrzej Geise, Mariola Piłatowska, and Aneta Wlodarczyk. Corporate failure prediction models: A literature review and an empirical study of V4 firms.Equilibrium. Quarterly Journal of Economics and Economic Policy, 16(3):593–625, 2021
work page 2021
-
[11]
Tabllm: Few-shot classification of tabular data with large language models
Stefan Hegselmann, Alejandro Buendia, Hunter Lang, Monica Agrawal, Xiaoyi Jiang, and David Sontag. Tabllm: Few-shot classification of tabular data with large language models. In International Conference on Artificial Intelligence and Statistics, pages 5549–5581. PMLR, 2023
work page 2023
-
[12]
Accurate predictions on small data with a tab- ular foundation model.Nature, 637(8045):319–326, 2025
Noah Hollmann, Samuel Müller, Lennart Purucker, Arjun Krishnakumar, Max Körfer, Shi Bin Hoo, Robin Tibor Schirrmeister, and Frank Hutter. Accurate predictions on small data with a tabular foundation model.Nature, 637:319–326, 2025. doi: 10.1038/s41586-024-08328-6
-
[13]
Lightgbm: a highly efficient gradient boosting decision tree
Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, and Tie-Yan Liu. Lightgbm: a highly efficient gradient boosting decision tree. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 3149–3157, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964
work page 2017
-
[14]
Are foundation models useful for bankruptcy prediction?, 2025
Marcin Kostrzewa, Oleksii Furman, Roman Furman, Sebastian Tomczak, and Maciej Zi˛ eba. Are foundation models useful for bankruptcy prediction?, 2025
work page 2025
-
[15]
Deron Liang, Chih-Chuan Lu, Chih-Fong Tsai, and Guan-An Shih. Financial ratios and corporate governance indicators in bankruptcy prediction: A comprehensive study.European Journal of Operational Research, 252(2):561–572, 2016. doi: 10.1016/j.ejor.2016.01.012. 10
-
[16]
Gianfranco Lombardo, Mattia Pellegrino, George Adosoglou, Stefano Cagnoni, Panos M. Pardalos, and Agostino Poggi. Machine learning for bankruptcy prediction in the american stock market: Dataset and benchmarks.Future Internet, 14(8), 2022. ISSN 1999-5903. doi: 10.3390/fi14080244. URLhttps://www.mdpi.com/1999-5903/14/8/244
-
[17]
Feng Mai, Shaonan Tian, Chihoon Lee, and Ling Ma. Deep learning models for bankruptcy prediction using textual disclosures.European Journal of Operational Research, 274(2): 743–758, 2019. doi: 10.1016/j.ejor.2018.10.024
-
[18]
Predicting distresses using deep learning of text segments in annual reports, 2019
Rastin Matin, Casper Hansen, Christian Hansen, and Pia Mølgaard. Predicting distresses using deep learning of text segments in annual reports, 2019
work page 2019
-
[19]
James A. Ohlson. Financial ratios and the probabilistic prediction of bankruptcy.Journal of Accounting Research, 18(1):109–131, 1980
work page 1980
-
[20]
Harlan D. Platt and Marjorie B. Platt. Understanding differences between financial distress and bankruptcy.Review of Applied Economics, 2(2):141–157, 2006
work page 2006
-
[21]
Catboost: unbiased boosting with categorical features
Liudmila Prokhorenkova, Gleb Gusev, Aleksandr V orobev, Anna Veronika Dorogush, and Andrey Gulin. Catboost: unbiased boosting with categorical features. InProceedings of the 32nd International Conference on Neural Information Processing Systems, NIPS’18, page 6639–6649, Red Hook, NY , USA, 2018. Curran Associates Inc
work page 2018
-
[22]
Sebastian Klaudiusz Tomczak and Aleksander Denisiuk. Selection of over time stability ratios using machine learning techniques.Operations Research and Decisions, 36(2):1–25, 2026. doi: 10.37190/ord/215251
-
[23]
Sebastian Klaudiusz Tomczak, Michal Karas, Tamas Kristof, Lucia Duricova, and Marek Durica. Identification of key indicators, model development, and validation methods for bankruptcy prediction: A systematic review for the visegrad group.Forum Scientiae Oeconomia, 13(3): 156–196, 2025
work page 2025
-
[24]
Katarina Valaskova, Dominika Gajdosikova, and Jaroslav Belas. Bankruptcy prediction in the post-pandemic period: A case study of visegrad group countries.Oeconomia Copernicana, 14 (1):253–293, 2023
work page 2023
-
[25]
Datasets for advanced bankruptcy prediction: A survey and taxonomy, 2024
Xinlin Wang, Zsófia Kräussl, and Mats Brorsson. Datasets for advanced bankruptcy prediction: A survey and taxonomy, 2024
work page 2024
-
[26]
BloombergGPT: A large language model for finance, 2023
Shijie Wu, Ozan Irsoy, Steven Lu, Vadim Dabravolski, Mark Dredze, Sebastian Gehrmann, Prabhanjan Kambadur, David Rosenberg, and Gideon Mann. BloombergGPT: A large language model for finance, 2023
work page 2023
-
[27]
PIXIU: A large language model, instruction data and evaluation benchmark for finance, 2023
Qianqian Xie, Weiguang Han, Xiao Zhang, Yanzhao Lai, Min Peng, Alejandro Lopez-Lira, and Jimin Huang. PIXIU: A large language model, instruction data and evaluation benchmark for finance, 2023
work page 2023
-
[28]
FinGPT: Open-source financial large language models, 2023
Hongyang Yang, Xiao-Yang Liu, and Christina Dan Wang. FinGPT: Open-source financial large language models, 2023
work page 2023
-
[29]
FinBERT: A pretrained language model for financial communications
Yi Yang, Mark Christopher Siy Uy, and Allen Huang. FinBERT: A pretrained language model for financial communications. InProceedings of the 29th International Joint Conference on Artificial Intelligence (IJCAI), 2020
work page 2020
-
[30]
InvestLM: A large language model for investment using financial domain instruction tuning, 2023
Yi Yang, Yixuan Tang, and Kar Yan Tam. InvestLM: A large language model for investment using financial domain instruction tuning, 2023
work page 2023
-
[31]
Maciej Zi˛ eba, Sebastian K. Tomczak, and Jakub M. Tomczak. Ensemble boosted trees with synthetic features generation in application to bankruptcy prediction.Expert Systems with Applications, 58:93–101, 2016. doi: 10.1016/j.eswa.2016.04.001. Polish Companies Bankruptcy dataset, UCI Machine Learning Repository. 11 Table 3: TabPFN finetuning configuration. ...
-
[32]
Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.