pith. sign in

arxiv: 2604.20742 · v1 · submitted 2026-04-22 · 💻 cs.SE

Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading

Pith reviewed 2026-05-09 23:39 UTC · model grok-4.3

classification 💻 cs.SE
keywords software defect predictionROC curveAUCmodel evaluationthreshold analysistrue positive ratefalse positive rate
0
0 comments X

The pith

A high AUC does not guarantee that a software defect prediction model outperforms random guessing at every threshold.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether ROC curves and AUC values reliably show that software defect prediction models are better than random. It finds that a model can have a high AUC while still producing worse true positive rates or false positive rates than random at some thresholds. The authors demonstrate this by marking specific threshold points on ROC curves and plotting the rates directly against the threshold. This matters because teams using these models to flag faulty code modules could reach incorrect conclusions about quality if they trust AUC alone.

Core claim

A high value of AUC does not guarantee that both the True Positive Rate and the False Positive Rate of a model are better than the random model's for all possible thresholds.

What carries the argument

Decorated ROC curves that mark the points for concrete threshold values, together with separate plots of True Positive Rate and False Positive Rate versus the threshold.

If this is right

  • Commonly used ROC and AUC criteria can produce incorrect assessments of SDP model quality.
  • A model judged good by AUC may still classify faulty modules worse than random at some thresholds.
  • Evaluating only the aggregate AUC hides cases where the model is worse than random for one class.
  • Either decorated ROC curves or threshold-based rate plots are required to see all relevant performance aspects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams should inspect performance at the specific thresholds they will actually apply rather than trusting the single AUC number.
  • The same limitation likely appears in other fields that rely on ROC/AUC for binary classifiers with imbalanced outcomes.
  • New visualizations or metrics focused on practical thresholds could reduce the risk of selecting misleading models.

Load-bearing premise

That the diagonal line in ROC space is the right baseline for random performance and that models must be checked across every possible threshold rather than only the ones used in practice.

What would settle it

An SDP model with AUC above 0.5 for which either the true positive rate or the false positive rate falls below the random baseline at some threshold value that is actually used in deployment.

Figures

Figures reproduced from arXiv: 2604.20742 by Gabriele Rotoloni, Luigi Lavazza, Sandro Morasca.

Figure 1
Figure 1. Figure 1: The ROC curve of the random model. In the figures, points ◦, △, +, ×, ♦, ▽, ⊠, ∗, ⊞ are associated with thresholds 0.1, 0.2, ..., 0.9, respectively. The random model provides an easy-to-implement and inexpen￾sive way to make predictions, which does not consider any feature of a module. This characteristic of the random model qualifies it as [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The ROC curve of the BLR model for the poi 3.0 project (AUC=0.803) [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The ROC curve of the BLR model for the poi 3.0 project. Points representing specific values of the threshold are highlighted. the BLR model has both greater TPR and smaller FPR than the ran￾dom model. However, for values of 𝑡 ≤0.4 (e.g., points ◦, △, + and ×) the BLR model has worse FPR than the random model: it performs very well in classifying positive (faulty) modules, at the expense of a poor classific… view at source ↗
Figure 6
Figure 6. Figure 6: The ROC curve of the RF model for the ant 1.5 project (AUC=0.8). is possible to see that for all the highlighted values of the threshold the model yields worse than random TPR [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: ROC curve of SVM model for ckjm. However, even such a model could provide sub-optimal perfor￾mance, for some threshold values. As an example, take the SVM model obtained for project ckjm, whose ROC curve is in [PITH_FULL_IMAGE:figures/full_fig_p005_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: ROC curves of RF and BLR models for poi 2.5. Note that, when condition (2) holds, ROC curve A dominates ROC curve B, while the reverse is not true in general. Many published papers adopted dominance as the criterion used to conclude that the model proposed in the paper outperforms previously published models, both in Software Engineering and in other fields. However, it may be the case that dominance is s… view at source ↗
Figure 11
Figure 11. Figure 11: Comparison of RF and BLR models for poi 2.5. When 𝑡=0.6 (point ▽) the RF model is better than the BLR model with respect to both TPR and FPR. However, for multiple threshold values, it is not so: when 𝑡=0.5 (point ♦) the BLR model achieves better TPR and worse FPR, while with 𝑡=0.8 (point ∗) the BLR model achieves slightly better FPR and worse TPR. For several threshold values, the two models provide diff… view at source ↗
Figure 14
Figure 14. Figure 14: TPR and FPR of RF and BLR models for jedit 4.2 as functions of the threshold. 5 The Effect of Imbalance In imbalanced datasets, probabilities are biased towards the most common event [6, 11]. Thus, a large prevalence of positive modules (𝜌 close to 1) leads to maximizing TPR, at the expense of having poor FPR, for most values of the threshold. Similarly, a very small prevalence (𝜌 close to 0) leads to max… view at source ↗
Figure 13
Figure 13. Figure 13: shows that the ROC curve of the BLR model (AUC=0.85) dominates the ROC curve of RF model (AUC=0.8). However, Fig￾ure 14 clearly shows that both models yield TPR values that are generally largely worse than random, hence probably neither model would be considered usable by a practitioner [PITH_FULL_IMAGE:figures/full_fig_p007_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: TPR(𝑡) and FPR(𝑡), when prevalence 𝜌 is high (log4j 1.2, left) and low (e-learning, right) [PITH_FULL_IMAGE:figures/full_fig_p008_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: ROC curves with thresholds highlighted, when prevalence [PITH_FULL_IMAGE:figures/full_fig_p008_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Comparison when condition (2) does not hold. [PITH_FULL_IMAGE:figures/full_fig_p010_17.png] view at source ↗
read the original abstract

Background: Receiver Operating Characteristic (ROC) curves are widely used to evaluate the performance of Software Defect Prediction (SDP) models that estimate module fault-proneness, i.e., the probability that a module is faulty. A ROC curve maps a model's performance in terms of True Positive Rate and False Positive Rate for any possible threshold set on fault-proneness. The Area Under the ROC Curve (AUC) summarizes the performance of a model across all possible thresholds. Traditionally, ROC curves completely above the bisector of the ROC space are considered better than random, and high AUC values are associated with good performance. Aim: We investigate whether these beliefs are correct, hence if SDP model evaluation based on ROC curves and AUC is reliable. Method: We decorate ROC curves by highlighting the points corresponding to threshold values. We also represent True Positive Rate and False Positive Rate as functions of the threshold. Thus, we can evaluate whether a model classifies both faulty and non-faulty modules better than the random model. Results: We show that commonly used evaluation criteria may lead to wrong conclusions. Conclusions: A high value of AUC does not guarantee that both the True Positive Rate and the False Positive Rate of a model are better than the random model's for all possible thresholds. Either decorated ROC curves or alternative representations are needed to appreciate all the relevant aspects of SDP models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that evaluating software defect prediction (SDP) models via ROC curves and the Area Under the ROC Curve (AUC) can be misleading. A high AUC does not guarantee that both the True Positive Rate (TPR) and False Positive Rate (FPR) exceed those of a random model for every threshold, because an ROC curve with AUC > 0.5 can still cross the diagonal. The authors propose decorating ROC plots with threshold points or plotting TPR/FPR explicitly against the threshold to expose such behavior and enable more reliable assessment.

Significance. If the result holds, this observation has moderate significance for empirical software engineering. It draws attention to a direct consequence of the integral definition of AUC that can produce incorrect conclusions when SDP researchers rely solely on the scalar AUC value, especially with the imbalanced data typical in defect prediction. The suggested visualization remedies are low-cost and directly actionable. The argument rests on standard ROC definitions with no free parameters or self-referential derivations.

minor comments (2)
  1. The Results section should explicitly name the SDP datasets, models, and the numerical AUC values plus crossing points for any illustrated curves so that readers can verify the claimed crossings.
  2. Figure captions and the Method section need to state precisely how threshold values are chosen for decoration and whether they correspond to practically used operating points in SDP.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive review and recommendation of minor revision. The referee's summary accurately reflects the central claim and proposed remedies in our manuscript. We address the points raised below.

read point-by-point responses
  1. Referee: The paper claims that evaluating software defect prediction (SDP) models via ROC curves and the Area Under the ROC Curve (AUC) can be misleading. A high AUC does not guarantee that both the True Positive Rate (TPR) and False Positive Rate (FPR) exceed those of a random model for every threshold, because an ROC curve with AUC > 0.5 can still cross the diagonal. The authors propose decorating ROC plots with threshold points or plotting TPR/FPR explicitly against the threshold to expose such behavior and enable more reliable assessment.

    Authors: We appreciate the referee's precise and accurate summary of our contribution. The manuscript demonstrates through standard ROC properties that an AUC > 0.5 does not preclude the curve from crossing the diagonal, so that for some thresholds neither TPR nor FPR exceeds the random baseline. The visualizations we advocate—decorated ROC curves with explicit threshold markers and separate TPR/FPR-vs-threshold plots—are presented exactly to reveal such cases and support more reliable evaluation, especially under the class imbalance typical in SDP. No further elaboration is required. revision: no

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a methodological critique of AUC-based evaluation for software defect prediction models. It relies on the standard integral definition of AUC and the geometric properties of ROC curves in [0,1]x[0,1] space. The central observation—that an AUC > 0.5 does not entail the entire curve lying above the diagonal—is a direct consequence of the definition of the integral (area can be accumulated even if the curve crosses the line y=x). No equations are fitted to data, no parameters are estimated and then renamed as predictions, and no load-bearing steps invoke self-citations or uniqueness theorems. The proposed remedies (decorated ROC plots and TPR/FPR-vs-threshold plots) are straightforward visualizations of the same underlying definitions. The derivation chain is therefore self-contained against external mathematical benchmarks and contains no reductions to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper relies entirely on established statistical definitions of ROC curves, TPR, FPR, and AUC from prior literature without introducing any new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.0 · 5542 in / 1000 out tokens · 80952 ms · 2026-05-09T23:39:23.436047+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages

  1. [1]

    Erik Arisholm, Lionel C Briand, and Magnus Fuglerud. 2007. Data mining tech- niques for building fault-proneness models in telecom Java software. InThe 18th IEEE International Symposium on Software Reliability (ISSRE’07). IEEE, 215–224

  2. [2]

    2010.A systematic review of fault prediction approaches used in software engineering

    Sarah Beecham, Tracy Hall, David Bowes, David Gray, Steve Counsell, and Sue Black. 2010.A systematic review of fault prediction approaches used in software engineering. Technical Report. Technical Report Lero-TR-2010-04, Lero

  3. [3]

    Cagatay Catal. 2012. Performance evaluation metrics for software fault prediction studies.Acta Polytechnica Hungarica9, 4 (2012), 193–206

  4. [4]

    Cagatay Catal and Banu Diri. 2009. A systematic review of software fault predic- tion studies.Expert systems with applications36, 4 (2009), 7346–7354

  5. [5]

    Davide Chicco, Matthijs J Warrens, and Giuseppe Jurman. 2021. The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment.Ieee Access9 (2021), 78368–78381

  6. [6]

    Jan Salomon Cramer. 1999. Predictive performance of the binary logit model in unbalanced samples.Journal of the Royal Statistical Society: Series D (The Statistician)48, 1 (1999), 85–94

  7. [7]

    Giuseppe Destefanis, Leila Yousefi, Martin Shepperd, Allan Tucker, Stephen Swift, Steve Counsell, and Mahir Arzoky. 2026. An audit of machine learning experiments on software defect prediction.Empirical Software Engineering31, 4 (2026), 83

  8. [8]

    Tom Fawcett. 2006. An Introduction to ROC Analysis.Pattern Recogn. Lett.27, 8 (June 2006), 861–874. doi:10.1016/j.patrec.2005.10.010

  9. [9]

    Cesar Ferri, Peter Flach, José Hernández-Orallo, and Athmane Senad. 2005. Mod- ifying ROC curves to incorporate predicted probabilities. InProceedings of the second workshop on ROC analysis in machine learning, Vol. 4140. International Conference on Machine Learning, 33–40

  10. [10]

    Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 49–60

  11. [11]

    Hosmer and Stanley Lemeshow

    David W. Hosmer and Stanley Lemeshow. 2000.Applied logistic regression (Wi- ley Series in probability and statistics)(2 ed.). Wiley-Interscience Publication, Hoboken, NJ

  12. [12]

    2013.Applied logistic regression

    David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013.Applied logistic regression. John Wiley & Sons

  13. [13]

    Marian Jureczko and Lech Madeyski. 2010. Towards identifying software project clusters with regard to defect prediction. InProceedings of the 6th International Conference on Predictive Models in Software Engineering. 1–10

  14. [14]

    Luigi Lavazza and Sandro Morasca. 2022. Comparing 𝜙 and the F-measure as Performance Metrics for Software-related Classifications.EMSE27, 7 (2022)

  15. [15]

    Luigi Lavazza, Sandro Morasca, and Gabriele Rotoloni. 2023. On the Reliability of the Area Under the ROC Curve in Empirical Software Engineering. InProceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering (EASE). Association for Computing Machinery (ACM)

  16. [16]

    Luigi Lavazza, Sandro Morasca, and Gabriele Rotoloni. 2025. Software Defect Prediction evaluation: New metrics based on the ROC curve.Information and Software Technology(2025), 107865

  17. [17]

    Sandro Morasca and Luigi Lavazza. 2020. On the assessment of software defect prediction models via ROC curves.Empirical Software Engineering25, 5 (2020), 3977–4019

  18. [18]

    Rebecca Moussa and Federica Sarro. 2022. On the Use of Evaluation Measures for Defect Prediction Studies. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM

  19. [19]

    John Platt et al. 1999. Probabilistic outputs for support vector machines and com- parisons to regularized likelihood methods.Advances in large margin classifiers 10, 3 (1999), 61–74

  20. [20]

    Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data quality: Some comments on the NASA software defect datasets.IEEE Transactions on software engineering39, 9 (2013), 1208–1215

  21. [21]

    Yogesh Singh, Arvinder Kaur, and Ruchika Malhotra. 2010. Empirical validation of object-oriented metrics for predicting fault proneness models.Software quality journal18, 1 (2010), 3

  22. [22]

    David L Streiner and John Cairney. 2007. What’s under the ROC? An introduction to receiver operating characteristics curves.The Canadian Journal of Psychiatry 52, 2 (2007), 121–128

  23. [23]

    Ben Van Calster, Ewout W Steyerberg, Ralph B D’Agostino Sr, and Michael J Pencina. 2014. Sensitivity and specificity can change in opposite directions when new predictive markers are added to risk models.Medical Decision Making34, 4 (2014), 513–522

  24. [24]

    Jan Y Verbakel, Ewout W Steyerberg, Hajime Uno, Bavo De Cock, Laure Wynants, Gary S Collins, and Ben Van Calster. 2020. ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models.Journal of Clinical Epidemiology 126 (2020), 207–216