pith. sign in

arxiv: 2603.24215 · v3 · pith:RKTZIMFSnew · submitted 2026-03-25 · 💱 q-fin.ST · stat.AP

Adapting Altman's bankruptcy prediction model to the compositional data methodology

Pith reviewed 2026-05-21 09:41 UTC · model grok-4.3

classification 💱 q-fin.ST stat.AP
keywords bankruptcy predictioncompositional datafinancial ratiosAltman modellogistic regressionrandom forestsmachine learningSpanish firms
0
0 comments X

The pith

Compositional log-ratios improve sensitivity in bankruptcy prediction over standard financial ratios.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper adapts Altman's classical bankruptcy prediction model and extensions to the compositional data methodology by replacing standard financial ratios with pairwise log-ratios. It applies this approach to data from over 31,000 Spanish wholesale trade firms using logistic regression, k-nearest neighbours and random forests, after downsampling the training set for balance. Results show compositional versions deliver higher sensitivity in identifying bankrupt firms than conventional ratios, with compositional random forests and logistic regression performing best overall. This matters because standard ratios commonly produce outliers, asymmetry and non-normality that distort predictions. A reader would see value in methods that handle these issues without outlier removal while improving detection of failures.

Core claim

The paper establishes that adapting Altman's bankruptcy prediction model and some of its extensions to the compositional methodology with pairwise log-ratios and three common statistical and machine learning tools leads to better predictive performance than standard financial ratios, particularly in sensitivity, on a large sample of Spanish wholesale trade firms.

What carries the argument

Pairwise log-ratios computed from compositional data methodology, which transform financial statement components to address statistical problems like extreme outliers and non-normality in bankruptcy models.

If this is right

  • Compositional random forests and compositional logistic regression achieve the strongest results among the tested methods.
  • Compositional approaches maintain advantages without removing any outliers from the data.
  • Sensitivity gains help detect more actual bankrupt cases compared with models using standard ratios.
  • The methodology can be directly applied to other financial prediction problems that rely on ratio data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pairwise log-ratio approach could be tested on bankruptcy data from other countries or industries to assess broader applicability.
  • Adding more recent machine learning classifiers might produce even larger gains in recall.
  • The interpretability of the resulting log-ratio coefficients could be examined for practical use by credit analysts.
  • This framework might extend to related tasks such as predicting financial distress rather than outright bankruptcy.

Load-bearing premise

The assumption that downsampling the training dataset to a 1:1 ratio of healthy to bankrupt firms produces unbiased performance estimates that generalise to the original imbalanced population.

What would settle it

Re-running the performance evaluation on the full imbalanced validation set without downsampling, or on an independent dataset from another sector or time period, to check whether the sensitivity gains remain.

read the original abstract

Using standard financial ratios as variables in statistical analyses has been related to several serious problems, such as extreme outliers, asymmetry, non-normality, and non-linearity. The compositional-data methodology has been successfully applied to solve these problems and has always yielded substantially different results when compared to standard financial ratios. An under-researched area is the use of financial log-ratios computed with the compositional-data methodology to predict bankruptcy or the related terms of business default, insolvency or failure. Another under-researched area is the use of machine learning methods in combination with compositional log-ratios. The present article adapts the classical Altman bankruptcy prediction model and some of its extensions to the compositional methodology with pairwise log-ratios and three common statistical and machine learning tools: logistic regression models, k-nearest neighbours, and random forests, and compares the results with standard financial ratios. Data from the sector in the Spanish economy with the largest number of bankrupt firms according to the first two digits of the NACE code (46XX "wholesale trade, except of motor vehicles and motorcycles") were obtained from the Iberian Balance sheet Analysis System. The sample size (31,131 firms, of which 97 were bankrupt) was divided into a training and a validation dataset. The training dataset was downsampled to one healthy firm to each bankrupt firm. No outliers were removed. Focusing on predictive performance, the results show that compositional methods are better than standard ratios in terms of sensitivity (recall), with mixed results regarding specificity, compositional random forests and compositional logistic regression behaving the best.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper adapts Altman's classical bankruptcy prediction model to the compositional data methodology by replacing standard financial ratios with pairwise log-ratios. It applies logistic regression, k-nearest neighbors, and random forests to both representations and compares their predictive performance on a sample of 31,131 Spanish wholesale-trade firms (NACE 46XX) containing 97 bankrupt cases. The training set is downsampled to a 1:1 healthy-to-bankrupt ratio with no outlier removal; the abstract reports that compositional variants yield higher sensitivity (recall) than standard ratios, with mixed specificity results and the best overall performance from compositional random forests and compositional logistic regression.

Significance. If the central performance comparison survives correction for the training-test prior mismatch, the work would usefully extend compositional data analysis to an under-researched bankruptcy-prediction setting and demonstrate the practical value of log-ratio transformations when financial ratios exhibit the usual statistical pathologies. The explicit head-to-head design on a stated sample and the inclusion of both classical and machine-learning classifiers are positive features.

major comments (1)
  1. The training procedure downsamples the training set to a 1:1 healthy:bankrupt ratio while leaving the validation set at the original ~0.3 % prevalence. Default decision thresholds (0.5 for logistic regression and random forests) or unweighted voting are then used. This mismatch can inflate sensitivity on the imbalanced validation data without any intrinsic advantage from the pairwise log-ratio transformation. The claim that compositional methods improve sensitivity therefore rests on a potentially confounded comparison. (Abstract and training/validation split description.)
minor comments (2)
  1. The abstract states the headline result but supplies no numerical sensitivity or specificity values, no cross-validation scheme, and no description of how the specific pairwise log-ratios were selected from the available financial statements.
  2. It would be helpful to report proper scoring rules (e.g., Brier score or AUC) in addition to threshold-dependent metrics, or to show calibration plots, so that the comparison is less sensitive to the arbitrary 0.5 threshold.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the major comment regarding the training-validation split and decision thresholds below, providing a point-by-point response.

read point-by-point responses
  1. Referee: The training procedure downsamples the training set to a 1:1 healthy:bankrupt ratio while leaving the validation set at the original ~0.3 % prevalence. Default decision thresholds (0.5 for logistic regression and random forests) or unweighted voting are then used. This mismatch can inflate sensitivity on the imbalanced validation data without any intrinsic advantage from the pairwise log-ratio transformation. The claim that compositional methods improve sensitivity therefore rests on a potentially confounded comparison. (Abstract and training/validation split description.)

    Authors: We agree that the combination of downsampling only the training set and applying default thresholds (0.5) to the original imbalanced validation distribution can influence absolute sensitivity values. However, the identical protocol—including the 1:1 downsampling, lack of outlier removal, and fixed thresholds—is applied to both the standard-ratio and compositional log-ratio representations. Consequently, any inflation or bias in sensitivity arising from the prior mismatch affects both approaches equally and does not confound the relative performance comparison. Differences in sensitivity can therefore still be attributed to the compositional transformation. We will revise the manuscript to explicitly note this design choice and clarify that conclusions concern comparative rather than absolute performance under matched experimental conditions. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical comparison on held-out data

full rationale

The paper is an empirical study that adapts Altman's Z-score ratios to compositional log-ratios, applies logistic regression, KNN and random forests, downsamples the training set to 1:1 balance, and evaluates predictive performance (sensitivity, specificity) on a separate untouched validation set from the original imbalanced population. No mathematical derivation or first-principles result is claimed; performance metrics are computed directly from model outputs on held-out data. No fitted parameter is renamed as a prediction, no self-citation chain supports a uniqueness theorem, and no ansatz or known result is smuggled in. The central claims rest on standard cross-validation-style comparison rather than any self-referential reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the premise that compositional log-ratios remove the statistical defects of ordinary ratios and that the chosen classifiers remain valid after downsampling; these are standard domain assumptions rather than new postulates.

axioms (2)
  • domain assumption Compositional data methodology resolves extreme outliers, asymmetry, non-normality and non-linearity in financial ratios.
    Invoked in the opening paragraph of the abstract as the motivation for switching to log-ratios.
  • domain assumption Downsampling to a 1:1 healthy-to-bankrupt ratio in training does not distort sensitivity and specificity estimates on the validation set.
    Described in the abstract's data-preparation sentence without further justification.

pith-pipeline@v0.9.0 · 5849 in / 1463 out tokens · 44587 ms · 2026-05-21T09:41:05.347870+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    The D-1=6 plr below follow both rules and can be interpreted according to common financial concepts: asset tangibility: log(NCA/CA), current-asset turnover: log(OR/CA), margin: log(OR/OE), current ratio: log(CA/CL), debt maturity: log(NCL/CL), retained earnings over non-current liabilities: log(RE/NCL).

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.