Evaluating Software Defect Prediction Models via the Area Under the ROC Curve Can Be Misleading
Pith reviewed 2026-05-09 23:39 UTC · model grok-4.3
The pith
A high AUC does not guarantee that a software defect prediction model outperforms random guessing at every threshold.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A high value of AUC does not guarantee that both the True Positive Rate and the False Positive Rate of a model are better than the random model's for all possible thresholds.
What carries the argument
Decorated ROC curves that mark the points for concrete threshold values, together with separate plots of True Positive Rate and False Positive Rate versus the threshold.
If this is right
- Commonly used ROC and AUC criteria can produce incorrect assessments of SDP model quality.
- A model judged good by AUC may still classify faulty modules worse than random at some thresholds.
- Evaluating only the aggregate AUC hides cases where the model is worse than random for one class.
- Either decorated ROC curves or threshold-based rate plots are required to see all relevant performance aspects.
Where Pith is reading between the lines
- Teams should inspect performance at the specific thresholds they will actually apply rather than trusting the single AUC number.
- The same limitation likely appears in other fields that rely on ROC/AUC for binary classifiers with imbalanced outcomes.
- New visualizations or metrics focused on practical thresholds could reduce the risk of selecting misleading models.
Load-bearing premise
That the diagonal line in ROC space is the right baseline for random performance and that models must be checked across every possible threshold rather than only the ones used in practice.
What would settle it
An SDP model with AUC above 0.5 for which either the true positive rate or the false positive rate falls below the random baseline at some threshold value that is actually used in deployment.
Figures
read the original abstract
Background: Receiver Operating Characteristic (ROC) curves are widely used to evaluate the performance of Software Defect Prediction (SDP) models that estimate module fault-proneness, i.e., the probability that a module is faulty. A ROC curve maps a model's performance in terms of True Positive Rate and False Positive Rate for any possible threshold set on fault-proneness. The Area Under the ROC Curve (AUC) summarizes the performance of a model across all possible thresholds. Traditionally, ROC curves completely above the bisector of the ROC space are considered better than random, and high AUC values are associated with good performance. Aim: We investigate whether these beliefs are correct, hence if SDP model evaluation based on ROC curves and AUC is reliable. Method: We decorate ROC curves by highlighting the points corresponding to threshold values. We also represent True Positive Rate and False Positive Rate as functions of the threshold. Thus, we can evaluate whether a model classifies both faulty and non-faulty modules better than the random model. Results: We show that commonly used evaluation criteria may lead to wrong conclusions. Conclusions: A high value of AUC does not guarantee that both the True Positive Rate and the False Positive Rate of a model are better than the random model's for all possible thresholds. Either decorated ROC curves or alternative representations are needed to appreciate all the relevant aspects of SDP models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that evaluating software defect prediction (SDP) models via ROC curves and the Area Under the ROC Curve (AUC) can be misleading. A high AUC does not guarantee that both the True Positive Rate (TPR) and False Positive Rate (FPR) exceed those of a random model for every threshold, because an ROC curve with AUC > 0.5 can still cross the diagonal. The authors propose decorating ROC plots with threshold points or plotting TPR/FPR explicitly against the threshold to expose such behavior and enable more reliable assessment.
Significance. If the result holds, this observation has moderate significance for empirical software engineering. It draws attention to a direct consequence of the integral definition of AUC that can produce incorrect conclusions when SDP researchers rely solely on the scalar AUC value, especially with the imbalanced data typical in defect prediction. The suggested visualization remedies are low-cost and directly actionable. The argument rests on standard ROC definitions with no free parameters or self-referential derivations.
minor comments (2)
- The Results section should explicitly name the SDP datasets, models, and the numerical AUC values plus crossing points for any illustrated curves so that readers can verify the claimed crossings.
- Figure captions and the Method section need to state precisely how threshold values are chosen for decoration and whether they correspond to practically used operating points in SDP.
Simulated Author's Rebuttal
We thank the referee for the positive review and recommendation of minor revision. The referee's summary accurately reflects the central claim and proposed remedies in our manuscript. We address the points raised below.
read point-by-point responses
-
Referee: The paper claims that evaluating software defect prediction (SDP) models via ROC curves and the Area Under the ROC Curve (AUC) can be misleading. A high AUC does not guarantee that both the True Positive Rate (TPR) and False Positive Rate (FPR) exceed those of a random model for every threshold, because an ROC curve with AUC > 0.5 can still cross the diagonal. The authors propose decorating ROC plots with threshold points or plotting TPR/FPR explicitly against the threshold to expose such behavior and enable more reliable assessment.
Authors: We appreciate the referee's precise and accurate summary of our contribution. The manuscript demonstrates through standard ROC properties that an AUC > 0.5 does not preclude the curve from crossing the diagonal, so that for some thresholds neither TPR nor FPR exceeds the random baseline. The visualizations we advocate—decorated ROC curves with explicit threshold markers and separate TPR/FPR-vs-threshold plots—are presented exactly to reveal such cases and support more reliable evaluation, especially under the class imbalance typical in SDP. No further elaboration is required. revision: no
Circularity Check
No significant circularity
full rationale
The paper presents a methodological critique of AUC-based evaluation for software defect prediction models. It relies on the standard integral definition of AUC and the geometric properties of ROC curves in [0,1]x[0,1] space. The central observation—that an AUC > 0.5 does not entail the entire curve lying above the diagonal—is a direct consequence of the definition of the integral (area can be accumulated even if the curve crosses the line y=x). No equations are fitted to data, no parameters are estimated and then renamed as predictions, and no load-bearing steps invoke self-citations or uniqueness theorems. The proposed remedies (decorated ROC plots and TPR/FPR-vs-threshold plots) are straightforward visualizations of the same underlying definitions. The derivation chain is therefore self-contained against external mathematical benchmarks and contains no reductions to its own inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Erik Arisholm, Lionel C Briand, and Magnus Fuglerud. 2007. Data mining tech- niques for building fault-proneness models in telecom Java software. InThe 18th IEEE International Symposium on Software Reliability (ISSRE’07). IEEE, 215–224
work page 2007
-
[2]
2010.A systematic review of fault prediction approaches used in software engineering
Sarah Beecham, Tracy Hall, David Bowes, David Gray, Steve Counsell, and Sue Black. 2010.A systematic review of fault prediction approaches used in software engineering. Technical Report. Technical Report Lero-TR-2010-04, Lero
work page 2010
-
[3]
Cagatay Catal. 2012. Performance evaluation metrics for software fault prediction studies.Acta Polytechnica Hungarica9, 4 (2012), 193–206
work page 2012
-
[4]
Cagatay Catal and Banu Diri. 2009. A systematic review of software fault predic- tion studies.Expert systems with applications36, 4 (2009), 7346–7354
work page 2009
-
[5]
Davide Chicco, Matthijs J Warrens, and Giuseppe Jurman. 2021. The Matthews correlation coefficient (MCC) is more informative than Cohen’s Kappa and Brier score in binary classification assessment.Ieee Access9 (2021), 78368–78381
work page 2021
-
[6]
Jan Salomon Cramer. 1999. Predictive performance of the binary logit model in unbalanced samples.Journal of the Royal Statistical Society: Series D (The Statistician)48, 1 (1999), 85–94
work page 1999
-
[7]
Giuseppe Destefanis, Leila Yousefi, Martin Shepperd, Allan Tucker, Stephen Swift, Steve Counsell, and Mahir Arzoky. 2026. An audit of machine learning experiments on software defect prediction.Empirical Software Engineering31, 4 (2026), 83
work page 2026
-
[8]
Tom Fawcett. 2006. An Introduction to ROC Analysis.Pattern Recogn. Lett.27, 8 (June 2006), 861–874. doi:10.1016/j.patrec.2005.10.010
-
[9]
Cesar Ferri, Peter Flach, José Hernández-Orallo, and Athmane Senad. 2005. Mod- ifying ROC curves to incorporate predicted probabilities. InProceedings of the second workshop on ROC analysis in machine learning, Vol. 4140. International Conference on Machine Learning, 33–40
work page 2005
-
[10]
Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the 2017 11th joint meeting on foundations of software engineering. 49–60
work page 2017
-
[11]
David W. Hosmer and Stanley Lemeshow. 2000.Applied logistic regression (Wi- ley Series in probability and statistics)(2 ed.). Wiley-Interscience Publication, Hoboken, NJ
work page 2000
-
[12]
2013.Applied logistic regression
David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant. 2013.Applied logistic regression. John Wiley & Sons
work page 2013
-
[13]
Marian Jureczko and Lech Madeyski. 2010. Towards identifying software project clusters with regard to defect prediction. InProceedings of the 6th International Conference on Predictive Models in Software Engineering. 1–10
work page 2010
-
[14]
Luigi Lavazza and Sandro Morasca. 2022. Comparing 𝜙 and the F-measure as Performance Metrics for Software-related Classifications.EMSE27, 7 (2022)
work page 2022
-
[15]
Luigi Lavazza, Sandro Morasca, and Gabriele Rotoloni. 2023. On the Reliability of the Area Under the ROC Curve in Empirical Software Engineering. InProceedings of the 24th International Conference on Evaluation and Assessment in Software Engineering (EASE). Association for Computing Machinery (ACM)
work page 2023
-
[16]
Luigi Lavazza, Sandro Morasca, and Gabriele Rotoloni. 2025. Software Defect Prediction evaluation: New metrics based on the ROC curve.Information and Software Technology(2025), 107865
work page 2025
-
[17]
Sandro Morasca and Luigi Lavazza. 2020. On the assessment of software defect prediction models via ROC curves.Empirical Software Engineering25, 5 (2020), 3977–4019
work page 2020
-
[18]
Rebecca Moussa and Federica Sarro. 2022. On the Use of Evaluation Measures for Defect Prediction Studies. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA). ACM
work page 2022
-
[19]
John Platt et al. 1999. Probabilistic outputs for support vector machines and com- parisons to regularized likelihood methods.Advances in large margin classifiers 10, 3 (1999), 61–74
work page 1999
-
[20]
Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data quality: Some comments on the NASA software defect datasets.IEEE Transactions on software engineering39, 9 (2013), 1208–1215
work page 2013
-
[21]
Yogesh Singh, Arvinder Kaur, and Ruchika Malhotra. 2010. Empirical validation of object-oriented metrics for predicting fault proneness models.Software quality journal18, 1 (2010), 3
work page 2010
-
[22]
David L Streiner and John Cairney. 2007. What’s under the ROC? An introduction to receiver operating characteristics curves.The Canadian Journal of Psychiatry 52, 2 (2007), 121–128
work page 2007
-
[23]
Ben Van Calster, Ewout W Steyerberg, Ralph B D’Agostino Sr, and Michael J Pencina. 2014. Sensitivity and specificity can change in opposite directions when new predictive markers are added to risk models.Medical Decision Making34, 4 (2014), 513–522
work page 2014
-
[24]
Jan Y Verbakel, Ewout W Steyerberg, Hajime Uno, Bavo De Cock, Laure Wynants, Gary S Collins, and Ben Van Calster. 2020. ROC curves for clinical prediction models part 1. ROC plots showed no added value above the AUC when evaluating the performance of clinical prediction models.Journal of Clinical Epidemiology 126 (2020), 207–216
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.