pith. sign in

arxiv: 2607.01842 · v1 · pith:7U7RTRA6new · submitted 2026-07-02 · 💻 cs.SE

Understanding Software Defect Prediction: A Large-scale Empirical Study Across Uncertainty Quantification and Performance Evaluation

Pith reviewed 2026-07-03 09:16 UTC · model grok-4.3

classification 💻 cs.SE
keywords software defect predictionuncertainty quantificationcross-project defect predictioncalibration metricsperformance evaluationempirical studywithin-project prediction
0
0 comments X

The pith

Uncertainty quantification in software defect prediction correlates inconsistently with performance and fails to transfer reliably across projects.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a large-scale empirical analysis of how probability-based uncertainty quantification metrics relate to performance and calibration in software defect prediction classifiers. It compares these relationships in within-project settings using 36 datasets and in cross-project settings using 32 feature-compatible datasets, across 16 classifiers and multiple metrics. Results indicate that uncertainty metrics align more consistently with certain performance measures like false positive rate and AUC in within-project cases, but these links vary by classifier category and dataset. Performance and calibration prove related yet distinct, while many correlations weaken or reverse when predictions cross projects. The work concludes that uncertainty signals require task-specific evaluation and that transferred probabilities need revalidation.

Core claim

Through experiments on 16 classifiers, the study establishes that uncertainty quantification is highly context-dependent in software defect prediction. In within-project defect prediction, uncertainty metrics correlate more consistently with false positive rate and AUC than with MCC or F1 score, with these patterns also differing across classifier types and dataset collections. Classifiers can achieve strong discrimination while still showing large calibration error. In cross-project defect prediction, several uncertainty-performance and uncertainty-calibration correlations weaken or reverse, demonstrating that uncertainty signals do not reliably transfer across projects.

What carries the argument

Comparative correlation analysis of five uncertainty quantification metrics against six performance metrics and three calibration metrics under within-project and cross-project defect prediction settings.

If this is right

  • Uncertainty quantification must be assessed against specific performance objectives rather than assumed to indicate performance in general.
  • Calibration requires independent evaluation with multiple metrics because it is not interchangeable with discrimination performance.
  • Probabilities transferred from one project to another should be revalidated before guiding quality-assurance decisions.
  • Classifier category and dataset collection influence the strength of uncertainty-to-performance links.
  • Context-specific checks replace any assumption of universal properties for uncertainty measures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could test adaptive recalibration techniques on target projects to see whether they restore reliable uncertainty signals.
  • The observed pattern suggests similar evaluation needs may exist for uncertainty measures in other transferred prediction tasks.
  • Practitioners might collect limited target-project data specifically to validate uncertainty metrics before model deployment.
  • Methods that detect when a new project diverges enough to break existing uncertainty correlations could be developed.

Load-bearing premise

The chosen 36 benchmark datasets for within-project prediction, 32 feature-compatible datasets for cross-project prediction, and 16 representative classifiers capture sufficient real-world variability to support general claims about context-dependence.

What would settle it

A follow-up study that finds stable correlations between uncertainty quantification metrics and performance metrics when tested on a fresh, diverse collection of projects outside the original 68 datasets would challenge the context-dependence finding.

Figures

Figures reproduced from arXiv: 2607.01842 by Ranjun Peng, Rubing Huang, WenBin He, Xuan Xie, Zhijie Wang.

Figure 1
Figure 1. Figure 1: Dendrogram of 40 candidate classifiers clustered using COD scores. Each leaf node at the bottom [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Correlations between UQ and performance by dataset collection and correlation method. Each heatmap shows the mean correlation across the classifiers between five UQ metrics (rows) and nine target metrics (columns). Color scale: blue (−1) to red (+1). Cell labels report absolute correlation magnitudes on the original correlation scale; the sign is encoded by color. Open circles mark cells where the 16 class… view at source ↗
Figure 3
Figure 3. Figure 3: Alignment scores between UQ and performance across six performance metrics. Each panel shows the alignment score (Spearman correlation between dataset-level UQ values and performance metric values, sign-adjusted so that higher is better) with the minimum, average, and maximum across the five UQ metrics displayed for each classifier. Classifiers are sorted by the average alignment score within each panel. f… view at source ↗
Figure 4
Figure 4. Figure 4: Uncertainty distribution by prediction granularity-aware dataset group. Each panel shows the dataset-level mean of one UQ metric for class (AEEEM and PROMISE), function/method (NASA), and file (ReLink) datasets. Box plots show median and interquartile range; half violin plots show density; points are individual datasets; diamonds mark means. The file-level ReLink group has the widest uncertainty spread. AU… view at source ↗
Figure 5
Figure 5. Figure 5: The correlation between performance and calibration across all metric pairs. Rows correspond to calibration metrics (ECE, NLL, Brier); columns correspond to performance metrics (MCC, Precision, Recall, FPR, AUC, F1 score). Each panel shows all classifier and dataset observations; color indicates local density; the red line is a linear fit; annotation reports Spearman 𝜌. The largest correlations are AUC vs.… view at source ↗
Figure 6
Figure 6. Figure 6: Calibration distribution by classifier category. Panels show ECE, NLL, and Brier Score. Each panel displays the distribution of dataset-level calibration errors for all valid classifier and dataset runs within a category: half violin density, box plot, individual points, and mean marker. The categories follow the order of mean ECE. Low ECE does not always coincide with low NLL: KNN and CART have low ECE bu… view at source ↗
read the original abstract

Software defect prediction (SDP) classifiers produce probabilities used for inspection prioritization, threshold tuning, and risk communication. Probability-based uncertainty quantification (UQ) characterizes prediction confidence, but whether common UQ metrics reliably indicate performance and calibration remains unclear. We conducted a large-scale empirical study of probability-based UQ for SDP. We evaluated five UQ metrics, six performance metrics, and three calibration metrics for 16 representative classifiers. We analyzed these relationships under two prediction settings: within-project defect prediction (WPDP), using 36 benchmark datasets, and cross-project defect prediction (CPDP), using 32 feature-compatible datasets. Results showed that UQ was highly context-dependent. Under WPDP, UQ correlated more consistently with false positive rate and AUC than with MCC, F1 score, and other metrics; these correlations also varied across classifier categories and dataset collections. Performance and calibration were related but not interchangeable; classifiers with strong discrimination could still exhibit large calibration error. Under CPDP, several UQ-performance and UQ-calibration correlations weakened or reversed, indicating that uncertainty signals do not reliably transfer across projects. Thus, UQ should be evaluated against specific performance objectives. Calibration should be assessed independently using multiple metrics. Transferred probabilities should be revalidated before guiding quality-assurance decisions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale empirical study of probability-based uncertainty quantification (UQ) in software defect prediction (SDP). It evaluates five UQ metrics against six performance and three calibration metrics across 16 classifiers on 36 within-project (WPDP) and 32 cross-project (CPDP) datasets, concluding that UQ-performance and UQ-calibration correlations are highly context-dependent: they vary by setting (WPDP vs. CPDP), classifier category, and metric; performance and calibration are related but not interchangeable; and UQ signals weaken or reverse under CPDP, so they should be evaluated against specific objectives and revalidated when transferred.

Significance. If the results hold, the work provides a broad empirical basis for caution against assuming that UQ metrics reliably indicate performance or transfer across projects in SDP applications such as inspection prioritization. The experimental scale (36+32 datasets, 16 classifiers, multiple metric families, two settings) is a clear strength that supports observing systematic variation rather than isolated effects.

major comments (2)
  1. [§3.1] §3.1 (Datasets): The selection and diversity criteria for the 36 WPDP and 32 CPDP datasets (e.g., stratification by project size, language, defect density, or preprocessing steps) are not described in sufficient detail to evaluate whether the observed weakening/reversal of correlations under CPDP is a general property of uncertainty signals or specific to the chosen collections.
  2. [§4.3] §4.3 (CPDP results): The claim that several UQ-performance and UQ-calibration correlations 'weakened or reversed' lacks reporting of statistical significance tests, effect sizes, or multiple-comparison corrections; without these, it is unclear whether the changes are robust enough to support the non-transferability conclusion.
minor comments (2)
  1. [Table 1] Table 1: The mapping of the 16 classifiers to the four categories (e.g., which models fall under 'tree-based') should be listed explicitly for reproducibility.
  2. [Figure 3] Figure 3: Axis labels and color legends for the correlation heatmaps are too small to read without magnification, hindering interpretation of the WPDP vs. CPDP differences.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the positive evaluation of the study's scale and implications. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3.1] §3.1 (Datasets): The selection and diversity criteria for the 36 WPDP and 32 CPDP datasets (e.g., stratification by project size, language, defect density, or preprocessing steps) are not described in sufficient detail to evaluate whether the observed weakening/reversal of correlations under CPDP is a general property of uncertainty signals or specific to the chosen collections.

    Authors: The datasets were selected from established public SDP benchmarks to ensure comparability with prior literature and to support feature compatibility in the CPDP setting. We agree that greater transparency on selection and diversity criteria would help readers evaluate the scope of the findings. In the revised manuscript we will expand §3.1 with an explicit description of the sources, any filtering applied for feature alignment, and summary characteristics (size, defect density, languages) of the collections. revision: yes

  2. Referee: [§4.3] §4.3 (CPDP results): The claim that several UQ-performance and UQ-calibration correlations 'weakened or reversed' lacks reporting of statistical significance tests, effect sizes, or multiple-comparison corrections; without these, it is unclear whether the changes are robust enough to support the non-transferability conclusion.

    Authors: We accept that formal statistical support would strengthen the comparison between settings. The current results are based on descriptive patterns across a large number of dataset-classifier combinations. In the revision we will add paired statistical tests (e.g., Wilcoxon signed-rank) on the correlation values between WPDP and CPDP, report effect sizes, and apply suitable multiple-comparison corrections to substantiate the observed weakening or reversal. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical reporting of observed correlations on benchmark data

full rationale

This is a purely empirical study that evaluates five UQ metrics, six performance metrics, and three calibration metrics across 16 classifiers on 36 WPDP and 32 CPDP benchmark datasets, then reports the observed correlations and their variation by setting, classifier category, and metric. No derivations, equations, fitted parameters, or self-citations are invoked to support the central claims; all results follow directly from the controlled experiments on the chosen data. The representativeness assumption noted by the skeptic is a standard external-validity concern for empirical work, not a circularity issue in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study relies on standard statistical assumptions for correlation analysis and the representativeness of existing benchmark datasets and classifiers drawn from prior literature; no new free parameters, ad-hoc axioms, or invented entities are introduced.

axioms (2)
  • standard math Statistical assumptions underlying correlation measures used to link UQ metrics with performance and calibration scores
    The reported relationships depend on these standard assumptions for interpreting correlations.
  • domain assumption The selected benchmark datasets and classifiers adequately represent the space of SDP tasks
    Generalization of context-dependence findings rests on this premise.

pith-pipeline@v0.9.1-grok · 5770 in / 1527 out tokens · 40629 ms · 2026-07-03T09:16:48.622466+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

83 extracted references · 2 canonical work pages · 1 internal anchor

  1. [1]

    Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U Rajendra Acharya, et al . 2021. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information fusion76 (2021), 243–297

  2. [2]

    better data

    Amritanshu Agrawal and Tim Menzies. 2018. Is" better data" better than" better data miners"? on the benefits of tuning SMOTE for defect prediction. InProceedings of the 40th International Conference on Software engineering. 1050–1061

  3. [3]

    Sousuke Amasaki. 2018. Cross-version defect prediction using cross-project defect prediction approaches: Does it work?. InProceedings of the 14th International Conference on predictive models and data analytics in software engineering. 32–41

  4. [4]

    Rok Blagus and Lara Lusa. 2013. SMOTE for high-dimensional class-imbalanced data.BMC bioinformatics14, 1 (2013), 106

  5. [5]

    David Bowes, Tracy Hall, and Jean Petrić. 2018. Software defect prediction: do different classifiers find the same defects?Software Quality Journal26, 2 (2018), 525–552

  6. [6]

    Leo Breiman. 1996. Bagging predictors.Machine learning24, 2 (1996), 123–140

  7. [7]

    Gerardo Canfora, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, Annibale Panichella, and Sebastiano Panichella. 2015. Defect prediction as a multiobjective optimization problem.Software Testing, Verification and Reliability25, 4 (2015), 426–459

  8. [8]

    Cagatay Catal and Banu Diri. 2009. A systematic review of software fault prediction studies.Expert systems with applications36, 4 (2009), 7346–7354. , Vol. 1, No. 1, Article . Publication date: July 2026. Understanding Software Defect Prediction: A Large-scale Empirical Study Across Uncertainty Quantification and Performance Evaluation 27

  9. [9]

    Andrea Dal Pozzolo, Olivier Caelen, Reid A Johnson, and Gianluca Bontempi. 2015. Calibrating Probability with Undersampling for Unbalanced Classification.. InSSCI. 159–166

  10. [10]

    Marco D’Ambros, Michele Lanza, and Romain Robbes. 2010. An extensive comparison of bug prediction approaches. In2010 7th IEEE working conference on mining software repositories (MSR 2010). IEEE, 31–41

  11. [11]

    Joost CF De Winter, Samuel D Gosling, and Jeff Potter. 2016. Comparing the Pearson and Spearman correlation coefficients across distributions and sample sizes: A tutorial using simulations and empirical data.Psychological methods21, 3 (2016), 273

  12. [12]

    Ran El-Yaniv et al. 2010. On the Foundations of Noise-free Selective Classification.Journal of Machine Learning Research11, 5 (2010)

  13. [13]

    T Elhassan, M Aljurf, et al. 2016. Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rus) as a data reduction method.Global J Technol Optim S1, S1 (2016)

  14. [14]

    Yuanrui Fan, Xin Xia, Daniel Alencar Da Costa, David Lo, Ahmed E Hassan, and Shanping Li. 2019. The impact of mislabeled changes by szz on just-in-time defect prediction.IEEE transactions on software engineering47, 8 (2019), 1559–1586

  15. [15]

    Yang Feng, Qingkai Shi, Xinyu Gao, Jun Wan, Chunrong Fang, and Zhenyu Chen. 2020. Deepgini: prioritizing massive tests to enhance the robustness of deep neural networks. InProceedings of the 29th ACM SIGSOFT international symposium on software testing and analysis. 177–188

  16. [16]

    Yarin Gal and Zoubin Ghahramani. 2016. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. Ininternational conference on machine learning. PMLR, 1050–1059

  17. [17]

    Baljinder Ghotra, Shane McIntosh, and Ahmed E Hassan. 2015. Revisiting the impact of classification techniques on the performance of defect prediction models. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 1. IEEE, 789–800

  18. [18]

    W Brier Glenn et al. 1950. Verification of forecasts expressed in terms of probability.Monthly weather review78, 1 (1950), 1–3

  19. [19]

    Tilmann Gneiting and Adrian E Raftery. 2007. Strictly proper scoring rules, prediction, and estimation.Journal of the American statistical Association102, 477 (2007), 359–378

  20. [20]

    Lina Gong, Haoxiang Zhang, Jingxuan Zhang, Mingqiang Wei, and Zhiqiu Huang. 2022. A comprehensive investigation of the impact of class overlap on software defect prediction.IEEE transactions on software engineering49, 4 (2022), 2440–2458

  21. [21]

    Somya R Goyal. 2025. Current Trends in Class Imbalance Learning for Software Defect Prediction.IEEE Access(2025)

  22. [22]

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. 2017. On calibration of modern neural networks. In International conference on machine learning. PMLR, 1321–1330

  23. [23]

    Tracy Hall, Sarah Beecham, David Bowes, David Gray, and Steve Counsell. 2011. A systematic literature review on fault prediction performance in software engineering.IEEE Transactions on Software Engineering38, 6 (2011), 1276–1304

  24. [24]

    Hui Han, Wen-Yuan Wang, and Bing-Huan Mao. 2005. Borderline-SMOTE: a new over-sampling method in imbalanced data sets learning. InInternational conference on intelligent computing. Springer, 878–887

  25. [25]

    Ahmed E Hassan. 2009. Predicting faults using the complexity of code changes. In2009 IEEE 31st international conference on software engineering. IEEE, 78–88

  26. [26]

    Haibo He, Yang Bai, Edwardo A Garcia, and Shutao Li. 2008. ADASYN: Adaptive synthetic sampling approach for imbalanced learning. In2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence). Ieee, 1322–1328

  27. [27]

    Wenchong He, Zhe Jiang, Tingsong Xiao, Zelin Xu, and Yukun Li. 2025. A survey on uncertainty quantification methods for deep learning.Comput. Surveys(2025)

  28. [28]

    Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and out-of-distribution examples in neural networks.arXiv preprint arXiv:1610.02136(2016)

  29. [29]

    Eyke Hüllermeier and Willem Waegeman. 2021. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods.Machine learning110, 3 (2021), 457–506

  30. [30]

    Xiaolin Ju, Yi Cao, Xiang Chen, Lina Gong, Vaskar Chakma, and Xin Zhou. 2025. JIT-CF: Integrating contrastive learning with feature fusion for enhanced just-in-time defect prediction.Information and Software Technology182 (2025), 107706

  31. [31]

    Marian Jureczko and Lech Madeyski. 2010. Towards identifying software project clusters with regard to defect prediction. InProceedings of the 6th international conference on predictive models in software engineering. 1–10

  32. [32]

    Maurice G Kendall. 1938. A new measure of rank correlation.Biometrika30, 1-2 (1938), 81–93

  33. [33]

    Sunghun Kim, Hongyu Zhang, Rongxin Wu, and Liang Gong. 2011. Dealing with noise in defect prediction. In Proceedings of the 33rd international conference on software engineering. 481–490

  34. [34]

    Ludmila I Kuncheva and Christopher J Whitaker. 2003. Measures of diversity in classifier ensembles and their relationship with the ensemble accuracy.Machine learning51, 2 (2003), 181–207. , Vol. 1, No. 1, Article . Publication date: July 2026. 28 Ranjun Peng, Xuan Xie, Rubing Huang, Wenbin He, and Zhijie Wang

  35. [35]

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. 2017. Simple and scalable predictive uncertainty estimation using deep ensembles.Advances in neural information processing systems30 (2017)

  36. [36]

    Issam H Laradji, Mohammad Alshayeb, and Lahouari Ghouti. 2015. Software defect prediction using ensemble learning on selected features.Information and Software Technology58 (2015), 388–402

  37. [37]

    Stefan Lessmann, Bart Baesens, Christophe Mues, and Swantje Pietsch. 2008. Benchmarking classification models for software defect prediction: A proposed framework and novel findings.IEEE transactions on software engineering34, 4 (2008), 485–496

  38. [38]

    David D Lewis. 1995. A sequential algorithm for training text classifiers: Corrigendum and additional data. InAcm sigir forum, Vol. 29. ACM New York, NY, USA, 13–19

  39. [39]

    Ke Li, Zilin Xiang, Tao Chen, Shuo Wang, and Kay Chen Tan. 2020. Understanding the automated parameter optimization on transfer learning for cross-project defect prediction: an empirical study. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 566–577

  40. [40]

    Zhiqiang Li, Xiao-Yuan Jing, and Xiaoke Zhu. 2018. Progress on approaches to software defect prediction.Iet Software 12, 3 (2018), 161–175

  41. [41]

    Wangshu Liu, Ye Yue, Xiang Chen, Qing Gu, Pengzhan Zhao, Xuejun Liu, and Jianjun Zhao. 2024. SeDPGK: Semi- supervised software defect prediction with graph representation learning and knowledge distillation.Information and Software Technology174 (2024), 107510

  42. [42]

    Ana C Lorena, Luís PF Garcia, Jens Lehmann, Marcilio CP Souto, and Tin Kam Ho. 2019. How complex is your classification problem? a survey on measuring classification complexity.ACM Computing Surveys (CSUR)52, 5 (2019), 1–34

  43. [43]

    Wei Ma, Mike Papadakis, Anestis Tsakmalis, Maxime Cordy, and Yves Le Traon. 2021. Test selection for deep learning systems.ACM Transactions on Software Engineering and Methodology (TOSEM)30, 2 (2021), 1–22

  44. [44]

    Faseeha Matloob, Taher M Ghazal, Nasser Taleb, Shabib Aftab, Munir Ahmad, Muhammad Adnan Khan, Sagheer Abbas, and Tariq Rahim Soomro. 2021. Software defect prediction using ensemble learning: A systematic literature review.IEEe Access9 (2021), 98754–98771

  45. [45]

    Tim Menzies, Jeremy Greenwald, and Art Frank. 2007. Data mining static code attributes to learn defect predictors. IEEE transactions on software engineering33, 1 (2007), 2–13

  46. [46]

    Tim Menzies, Zach Milton, Burak Turhan, Bojan Cukic, Yue Jiang, and Ayse Bener. 2010. Defect prediction from static code features: current results, limitations, new approaches.Automated Software Engineering17, 4 (2010), 375–407

  47. [47]

    Matthias Minderer, Josip Djolonga, Rob Romijnders, Frances Hubis, Xiaohua Zhai, Neil Houlsby, Dustin Tran, and Mario Lucic. 2021. Revisiting the calibration of modern neural networks.Advances in neural information processing systems34 (2021), 15682–15694

  48. [48]

    Raimund Moser, Witold Pedrycz, and Giancarlo Succi. 2008. A comparative analysis of the efficiency of change metrics and static code attributes for defect prediction. InProceedings of the 30th international conference on Software engineering. 181–190

  49. [49]

    Nachiappan Nagappan and Thomas Ball. 2005. Use of relative code churn measures to predict system defect density. InProceedings of the 27th international conference on Software engineering. 284–292

  50. [50]

    Jaechang Nam and Sunghun Kim. 2015. Heterogeneous defect prediction. InProceedings of the 2015 10th joint meeting on foundations of software engineering. 508–519

  51. [51]

    Jaechang Nam, Sinno Jialin Pan, and Sunghun Kim. 2013. Transfer defect learning. In2013 35th international conference on software engineering (ICSE). IEEE, 382–391

  52. [52]

    Alexandru Niculescu-Mizil and Rich Caruana. 2005. Predicting good probabilities with supervised learning. In Proceedings of the 22nd international conference on Machine learning. 625–632

  53. [53]

    Jeremy Nixon, Michael W Dusenberry, Linchuan Zhang, Ghassen Jerfel, and Dustin Tran. 2019. Measuring calibration in deep learning.. InCVPR workshops, Vol. 2

  54. [54]

    Safa Omri and Carsten Sinz. 2020. Deep learning for software defect prediction: A survey. InProceedings of the IEEE/ACM 42nd international conference on software engineering workshops. 209–214

  55. [55]

    Yaniv Ovadia, Emily Fertig, Jie Ren, Zachary Nado, David Sculley, Sebastian Nowozin, Joshua Dillon, Balaji Lakshmi- narayanan, and Jasper Snoek. 2019. Can you trust your model’s uncertainty? evaluating predictive uncertainty under dataset shift.Advances in neural information processing systems32 (2019)

  56. [56]

    Muhammed Maruf Öztürk. 2017. Which type of metrics are useful to deal with class imbalance in software defect prediction?Information and Software Technology92 (2017), 17–29

  57. [57]

    John Platt et al. 1999. Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods.Advances in large margin classifiers10, 3 (1999), 61–74

  58. [58]

    Shaoming Qiu, Bicong E, and Jingjie He. 2025. Features extraction and fusion by attention mechanism for software defect prediction.PloS one20, 4 (2025), e0320808. , Vol. 1, No. 1, Article . Publication date: July 2026. Understanding Software Defect Prediction: A Large-scale Empirical Study Across Uncertainty Quantification and Performance Evaluation 29

  59. [59]

    Tobias Scheffer, Christian Decomain, and Stefan Wrobel. 2001. Active hidden markov models for information extraction. InInternational symposium on intelligent data analysis. Springer, 309–318

  60. [60]

    Patrick Schober, Christa Boer, and Lothar A Schwarte. 2018. Correlation coefficients: appropriate use and interpretation. Anesthesia & analgesia126, 5 (2018), 1763–1768

  61. [61]

    Xhulja Shahini, Jone Bartel, and Klaus Pohl. 2025. On the calibration of Just-in-time Defect Prediction. In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR). IEEE, 14–26

  62. [62]

    Xhulja Shahini, Andreas Metzger, and Klaus Pohl. 2024. An Empirical Study on Just-in-time Conformal Defect Prediction. InProceedings of the 21st International Conference on Mining Software Repositories. 88–99

  63. [63]

    Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423

  64. [64]

    Martin Shepperd, Qinbao Song, Zhongbin Sun, and Carolyn Mair. 2013. Data quality: Some comments on the nasa software defect datasets.IEEE Transactions on software engineering39, 9 (2013), 1208–1215

  65. [65]

    Michael R Smith, Tony Martinez, and Christophe Giraud-Carrier. 2014. An instance level analysis of data complexity. Machine learning95, 2 (2014), 225–256

  66. [66]

    Charles Spearman. 1961. The proof and measurement of association between two things. (1961)

  67. [67]

    Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. 2016. Automated parameter optimization of classification techniques for defect prediction models. InProceedings of the 38th international conference on software engineering. 321–332

  68. [68]

    Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. 2016. An empirical comparison of model validation techniques for defect prediction models.IEEE Transactions on Software Engineering43, 1 (2016), 1–18

  69. [69]

    Chakkrit Tantithamthavorn, Shane McIntosh, Ahmed E Hassan, and Kenichi Matsumoto. 2018. The impact of automated parameter optimization on defect prediction models.IEEE Transactions on Software Engineering45, 7 (2018), 683–711

  70. [70]

    Ruben Van den Goorbergh, Maarten Van Smeden, Dirk Timmerman, and Ben Van Calster. 2022. The harm of class imbalance corrections for risk prediction models: illustration and simulation using logistic regression.Journal of the American Medical Informatics Association29, 9 (2022), 1525–1534

  71. [71]

    Xiaohui Wan, Zheng Zheng, Fangyun Qin, and Xuhui Lu. 2024. Data complexity: a new perspective for analyzing the difficulty of defect prediction tasks.ACM Transactions on Software Engineering and Methodology33, 6 (2024), 1–45

  72. [72]

    Shuo Wang and Xin Yao. 2013. Using class imbalance learning for software defect prediction.IEEE Transactions on Reliability62, 2 (2013), 434–443

  73. [73]

    Michael Weiss and Paolo Tonella. 2022. Simple techniques work surprisingly well for neural network test prioritization and active learning (replicability study). InProceedings of the 31st ACM SIGSOFT international symposium on software testing and analysis. 139–150

  74. [74]

    Rongxin Wu, Hongyu Zhang, Sunghun Kim, and Shing-Chi Cheung. 2011. Relink: recovering links between bugs and changes. InProceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering. 15–25

  75. [75]

    Ting-Fan Wu, Chih-Jen Lin, and Ruby C Weng. 2004. Probability estimates for multi-class classification by pairwise coupling.Journal of Machine Learning Research5, Aug (2004), 975–1005

  76. [76]

    Xianzong Wu, Xiaohong Li, Lili Quan, and Qiang Hu. 2025. UncertaintyZoo: A Unified Toolkit for Quantifying Predictive Uncertainty in Deep Learning Systems.arXiv preprint arXiv:2512.06406(2025)

  77. [77]

    Chunqiu Steven Xia, Yuxiang Wei, and Lingming Zhang. 2023. Automated program repair in the era of large pre-trained language models. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1482–1494

  78. [78]

    Aidan ZH Yang, Sophia Kolak, Vincent Hellendoorn, Ruben Martins, and Claire Le Goues. 2025. Revisiting unnaturalness for automated program repair in the era of large language models. In2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE). IEEE, 2561–2573

  79. [79]

    Zezhi Ye, Chenghai Yu, and Zhilong Lu. 2025. Software Defect Prediction Model Based on AST and Deep Learning. Scalable Computing: Practice and Experience26, 5 (2025), 2183–2195

  80. [80]

    Xiao Yu, Jiqing Rao, Lei Liu, Guancheng Lin, Wenhua Hu, Jacky Wai Keung, Junwei Zhou, and Jianwen Xiang. 2024. Improving effort-aware defect prediction by directly learning to rank software modules.Information and Software Technology165 (2024), 107250

Showing first 80 references.