Will It Break in Production? Metric-Driven Prediction of Residual Defects in Python Systems
Pith reviewed 2026-05-07 10:48 UTC · model grok-4.3
The pith
Supervised metric models predict residual Python defects with 0.85-0.9 recall while LLMs fail.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that supervised classifiers trained on software metrics can distinguish residual from non-residual faults in Python systems far better than either large language models or unsupervised techniques, delivering recall between 0.85 and 0.9 and an order-of-magnitude drop in missed defects. The most useful signals come from process metrics (age, churn, developer activity) and size measures. In addition, the metrics and normalized code representations lie in distinct regions of feature space, indicating they capture complementary rather than redundant information.
What carries the argument
Supervised ensemble classifiers (RandomForest, XGBoost, CatBoost) trained on a balanced collection of 4,000+ labeled faults using 83 product, process, statistical, and Python-specific metrics.
Load-bearing premise
The balanced set of over 4,000 labeled faults is representative of real Python systems and the cross-project experiments measure generalization without data leakage or label noise.
What would settle it
Training the same models on a fresh collection of Python projects and finding recall below 0.8 or loss of predictive strength for process metrics would falsify the central result.
Figures
read the original abstract
Python's dynamic nature complicates testing and increases the possibility that some defects evade detection, so an effective fault prediction becomes essential. We examine whether post-release faults can be predicted using modern ML and DL. Using a balanced dataset of over 4,000 labeled faults with 83 product, process, statistical, and Python-specific metrics plus normalized code representations, we conduct cross-project experiments. LLMs and unsupervised models fail to distinguish residual from non-residual faults, while supervised metric-based models (RandomForest, XGBoost, CatBoost) perform far better, yielding a 0.85-0.9 recall and cutting false negatives by an order of magnitude. Process metrics, especially age, churn, and developer-activity, alongside class and file size, consistently prove most predictive. Notably, the Principal Component Analysis shows that metrics and code embeddings occupy distinct regions of the representation space, suggesting that they capture complementary rather than redundant information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical study on predicting post-release residual defects in Python systems. It uses a balanced dataset of over 4,000 labeled faults together with 83 product, process, statistical, and Python-specific metrics plus normalized code representations to run cross-project experiments. Supervised models (Random Forest, XGBoost, CatBoost) are claimed to reach 0.85-0.9 recall and reduce false negatives by an order of magnitude, while LLMs and unsupervised methods fail; process metrics (age, churn, developer activity) and class/file size are identified as most predictive, and PCA indicates that metrics and embeddings occupy distinct regions of feature space.
Significance. If the dataset construction and splitting procedures prove free of leakage and label noise, the work would strengthen the case for metric-driven supervised learning in defect prediction for dynamically typed languages. The explicit comparison against LLMs and the finding that process metrics dominate would be useful for practitioners and for guiding future hybrid metric-embedding models. The PCA result on complementary representations is a modest but concrete contribution to understanding feature spaces in SE ML.
major comments (4)
- [Methods / Dataset] Dataset construction (Methods section): The abstract and results claim a balanced dataset of >4,000 labeled residual faults yielding 0.85-0.9 recall, yet no description is given of the labeling source (post-release bug reports, test failures, etc.), the balancing procedure, or any controls for label noise or class imbalance artifacts. These omissions are load-bearing for the central performance claim.
- [Experimental setup] Cross-project splitting (Experimental setup): The paper states that experiments are cross-project but supplies no details on how folds were constructed to prevent leakage of process metrics (churn, age, developer activity) via shared modules, developers, or temporal overlap. Without such controls the reported generalization performance cannot be assessed.
- [Results] Evaluation reporting (Results): Only recall is highlighted; the manuscript must report precision, F1, AUC-ROC, and statistical significance or error bars for all models to substantiate the claim that supervised metric models cut false negatives by an order of magnitude relative to baselines.
- [Results] Feature importance (Results / Discussion): The assertion that process metrics and size metrics are consistently most predictive requires explicit support via ranked importance scores, ablation studies, or SHAP analysis; the current text leaves this claim qualitative.
minor comments (3)
- [Abstract] Abstract: the phrase 'normalized code representations' is undefined; specify the embedding model, normalization procedure, and dimensionality.
- [Discussion] The manuscript would benefit from a dedicated Threats to Validity section addressing project-size confounds and label quality.
- [Figures] Figure legends for metric-importance or PCA plots should be enlarged and include axis labels with units or explained variance percentages.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which has strengthened the clarity and rigor of our work. We address each major comment below and have revised the manuscript to incorporate the requested details and analyses.
read point-by-point responses
-
Referee: [Methods / Dataset] Dataset construction (Methods section): The abstract and results claim a balanced dataset of >4,000 labeled residual faults yielding 0.85-0.9 recall, yet no description is given of the labeling source (post-release bug reports, test failures, etc.), the balancing procedure, or any controls for label noise or class imbalance artifacts. These omissions are load-bearing for the central performance claim.
Authors: We agree the original submission omitted key details on dataset construction. The labels were derived from post-release bug reports in project issue trackers, cross-referenced with commit histories and test failure logs from the studied Python repositories. Balancing was performed via stratified undersampling of the non-faulty class to equal the number of residual faults (>4,000 total instances). Label noise was mitigated by requiring confirmation from at least two independent sources (e.g., bug report plus failing test) and excluding ambiguous cases. We have added a new subsection 'Dataset Construction, Labeling, and Balancing' in Methods with these procedures, including explicit discussion of imbalance controls. revision: yes
-
Referee: [Experimental setup] Cross-project splitting (Experimental setup): The paper states that experiments are cross-project but supplies no details on how folds were constructed to prevent leakage of process metrics (churn, age, developer activity) via shared modules, developers, or temporal overlap. Without such controls the reported generalization performance cannot be assessed.
Authors: We acknowledge the need for explicit leakage controls. The revised Experimental setup section now details that cross-project folds were constructed by partitioning at the project level with no shared modules, files, or developers between training and test sets. Temporal ordering was enforced so that all training data precedes test data, and process metrics (churn, age, developer activity) were computed strictly within each fold's training window using only historical information available at prediction time. We added pseudocode, a diagram of the split, and a leakage audit summary. revision: yes
-
Referee: [Results] Evaluation reporting (Results): Only recall is highlighted; the manuscript must report precision, F1, AUC-ROC, and statistical significance or error bars for all models to substantiate the claim that supervised metric models cut false negatives by an order of magnitude relative to baselines.
Authors: We have expanded the Results section to report precision, F1-score, and AUC-ROC for all models and baselines. All metrics are now presented as means with standard deviations across 10 repeated stratified cross-validation runs, accompanied by error bars in the figures. Statistical significance of the reduction in false negatives (order-of-magnitude improvement) was evaluated with paired Wilcoxon signed-rank tests (p < 0.01 for supervised models vs. baselines and LLMs). These additions provide the comprehensive evaluation requested. revision: yes
-
Referee: [Results] Feature importance (Results / Discussion): The assertion that process metrics and size metrics are consistently most predictive requires explicit support via ranked importance scores, ablation studies, or SHAP analysis; the current text leaves this claim qualitative.
Authors: We have added quantitative support in the revised Results and Discussion. Feature importance rankings from Random Forest, XGBoost, and CatBoost are now shown in a table, confirming process metrics (age, churn, developer activity) and size metrics as the top predictors. We include SHAP summary plots and ablation experiments: removing the process+size group drops recall by 18-22% across models, while other groups have smaller effects. These results are reported with statistical tests and directly support the original claim. revision: yes
Circularity Check
No circularity: standard empirical ML evaluation on held-out data
full rationale
The paper is a purely empirical supervised learning study. It constructs a balanced dataset of labeled faults, extracts metrics, trains RandomForest/XGBoost/CatBoost models, and reports recall on cross-project held-out folds. These performance numbers are computed from test-set predictions rather than being forced by construction, self-definition, or self-citation chains. No mathematical derivations, ansatzes, uniqueness theorems, or renamings of known results appear. Dataset construction choices (balancing, labeling source) affect validity but do not create circular reductions of the reported results to the inputs.
Axiom & Free-Parameter Ledger
free parameters (2)
- ML model hyperparameters
- Classification threshold
axioms (2)
- domain assumption Fault labels in the dataset are accurate and unbiased
- domain assumption Cross-project splits preserve independence between training and test data
Reference graph
Works this paper leans on
-
[1]
Project failure case study: Knight capital group,
H. Dolfing, “Project failure case study: Knight capital group,”Henrico Dolfing Blog, June 2019. Accessed: 2025-11-21
2019
-
[2]
How a self-driving uber killed a pedestrian in arizona,
NYT, “How a self-driving uber killed a pedestrian in arizona,” March
-
[3]
Accessed: 2025-11-21
2025
-
[4]
How the boeing 737 max disaster looks to a software developer,
I. S. Staff, “How the boeing 737 max disaster looks to a software developer,”IEEE Spectrum, 2019. Accessed: 2024-11-21
2019
-
[5]
Improve software quality by build- ing digital immunity
J. Herschmann, “Improve software quality by build- ing digital immunity.” https://www.gartner.com/en/doc/ 735246-improve-software-quality-by-building-digital-mmunity, 2023. Accessed: 2025-04-20
2023
-
[6]
Mining metrics to predict compo- nent failures,
N. Nagappan, T. Ball, and A. Zeller, “Mining metrics to predict compo- nent failures,” inProceedings of the 28th International Conference on Software Engineering, ICSE ’06, (New York, NY , USA), p. 452–461, Association for Computing Machinery, 2006
2006
-
[7]
An industrial study on the differences between pre-release and post- release bugs,
R. Rwemalika, M. Kintis, M. Papadakis, Y . Le Traon, and P. Lorrach, “An industrial study on the differences between pre-release and post- release bugs,” in2019 IEEE International Conference on Software Maintenance and Evolution (ICSME), pp. 92–102, 2019
2019
-
[8]
On the use of deep learning in software defect prediction,
G. Giray, K. E. Bennin, ¨O. K ¨oksal, ¨O. Babur, and B. Tekinerdogan, “On the use of deep learning in software defect prediction,”Journal of Systems and Software, vol. 195, p. 111537, 2023
2023
-
[9]
In situ defect prediction: Practical transfer learning for cross-project defect detection,
A. Chowdhury, A. Hindle, and E. Shihab, “In situ defect prediction: Practical transfer learning for cross-project defect detection,”IEEE Transactions on Software Engineering, 2024
2024
-
[10]
Cosmos: A fault injection framework to assess hardware-assisted hy- pervisors,
M. Cinque, D. Cotroneo, G. De Rosa, L. De Simone, and G. Farina, “Cosmos: A fault injection framework to assess hardware-assisted hy- pervisors,”IEEE Transactions on Dependable and Secure Computing, 2025
2025
-
[11]
A sys- tematic literature review on fault prediction performance in software engineering,
T. Hall, S. Beecham, D. Bowes, D. Gray, and S. Counsell, “A sys- tematic literature review on fault prediction performance in software engineering,”IEEE Transactions on Software Engineering, vol. 38, no. 6, pp. 1276–1304, 2011
2011
-
[12]
A systematic review of software fault prediction studies,
C. Catal and B. Diri, “A systematic review of software fault prediction studies,”Expert systems with applications, vol. 36, no. 4, pp. 7346–7354, 2009
2009
-
[13]
Tiobe index: The programming community index
TIOBE Software BV, “Tiobe index: The programming community index.” https://www.tiobe.com/tiobe-index/. Accessed: 2025-11-27
2025
-
[14]
An empirical study of fault localization in python programs,
M. Rezaalipour and C. A. Furia, “An empirical study of fault localization in python programs,”Empirical Softw. Engg., vol. 29, June 2024
2024
-
[15]
Real world projects, real faults: evaluating spectrum based fault localization techniques on python projects,
R. Widyasari, G. A. A. Prana, S. A. Haryono, S. Wang, and D. Lo, “Real world projects, real faults: evaluating spectrum based fault localization techniques on python projects,”Empirical Software Engineering, vol. 27, no. 6, p. 147, 2022
2022
-
[16]
Resource package
“Resource package.” Zenodo, 2025. Zenodo record
2025
-
[17]
A study on software fault prediction techniques,
S. S. Rathore and S. Kumar, “A study on software fault prediction techniques,”Artificial Intelligence Review, vol. 51, no. 2, pp. 255–327, 2019
2019
-
[18]
Survey on software defect prediction techniques,
M. K. Thota, F. H. Shajin, and P. Rajesh, “Survey on software defect prediction techniques,”International Journal of Applied Science and Engineering, vol. 17, no. 4, pp. 331–344, 2020
2020
-
[19]
Octoverse: AI leads Python to top language as the number of global developers surges
GitHub, “Octoverse: AI leads Python to top language as the number of global developers surges.” https://github.blog/news-insights/octoverse/ octoverse-2024/, 2024. Accessed: Apr. 20, 2025
2024
-
[20]
Python News Roundup: November 2024
Real Python, “Python News Roundup: November 2024.” https:// realpython.com/python-news-november-2024/, 2024. Accessed: Apr. 20, 2025
2024
-
[21]
A comparative study of programming languages in rosetta code,
S. Nanz and C. A. Furia, “A comparative study of programming languages in rosetta code,” in2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, vol. 1, pp. 778–788, 2015
2015
-
[22]
The cost of poor software quality in the us: A 2022 report
H. Krasner, “The cost of poor software quality in the us: A 2022 report.” https://www.it-cisq.org/ the-cost-of-poor-quality-software-in-the-us-a-2022-report/, December 2022
2022
-
[23]
Pyresbugs: A dataset of residual python bugs for natural language-driven fault injection,
D. Cotroneo, G. De Rosa, and P. Liguori, “Pyresbugs: A dataset of residual python bugs for natural language-driven fault injection,” in2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pp. 146–150, IEEE, 2025
2025
-
[24]
Towards standardizing and improving classification of bug-fix commits,
S. Zafar, M. Z. Malik, and G. S. Walia, “Towards standardizing and improving classification of bug-fix commits,” in2019 ACM/IEEE International Symposium on Empirical Software Engineering and Mea- surement (ESEM), pp. 1–6, 2019
2019
-
[25]
Boosting automatic commit classification into maintenance activities by utilizing source code changes,
S. Levin and A. Yehudai, “Boosting automatic commit classification into maintenance activities by utilizing source code changes,” inProceedings of the 13th International Conference on Predictive Models and Data Analytics in Software Engineering, PROMISE, (New York, NY , USA), p. 97–106, Association for Computing Machinery, 2017
2017
-
[26]
A taxonomy of metrics for software fault prediction,
M. Caulo and G. Scanniello, “A taxonomy of metrics for software fault prediction,” in2020 46th Euromicro Conference on Software Engineering and Advanced Applications (SEAA), pp. 429–436, 2020
2020
-
[27]
Sc- iTools, LLC, Lexington, KY , USA, 2025
SciTools, LLC,Understand: The Software Developer’s Multi-Tool. Sc- iTools, LLC, Lexington, KY , USA, 2025
2025
-
[28]
Probabilistic model for code with decision trees,
V . Raychev, P. Bielik, and M. Vechev, “Probabilistic model for code with decision trees,”SIGPLAN Not., vol. 51, p. 731–747, Oct. 2016
2016
-
[29]
Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies,
R. Widyasari, S. Q. Sim, C. Lok, H. Qi, J. Phan, Q. Tay, C. Tan, F. Wee, J. E. Tan, Y . Yieh, B. Goh, F. Thung, H. J. Kang, T. Hoang, D. Lo, and E. L. Ouh, “Bugsinpy: a database of existing bugs in python programs to enable controlled testing and debugging studies,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference an...
2020
-
[30]
Note on the sampling error of the difference between correlated proportions or percentages,
Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,”Psychometrika, vol. 12, no. 2, pp. 153–157, 1947
1947
-
[31]
Approximate statistical tests for comparing supervised classification learning algorithms,
T. G. Dietterich, “Approximate statistical tests for comparing supervised classification learning algorithms,”Neural Computation, vol. 10, no. 7, pp. 1895–1923, 1998
1923
-
[32]
CodeT5+: Open Code Large Language Models for Code Understanding and Generation
Y . Wang, L. Zhou, W. Chen, L. Dong, J. Wei, H. Wang, S. Li, D. Zhou, X. Xia, X. Lu, Q. Xie, H. Cheng, Z. Deng, X. Tan, P. Liang, Z. Zhang, Y . Li, X. Wang, H. Cheng, W. Cui, Y . Qin, X. Chen, E. Wong, Y . Lu, W. Yu, C. Pang, Z. Chen, Z. Xu, and W. Ye, “Codet5+: Open code large language models for code understanding and generation,”arXiv preprint arXiv:23...
work page internal anchor Pith review arXiv 2023
-
[33]
DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence
X. Wang, T. Sun, X. Yuan, Z. Feng, X. Hu, Z. Lin, Z. Zhu, Y . Ni, G. Shen, T. Xiao, and X. Xie, “Deepseek: Turning large language models into multi-turn semantic parsers,”arXiv preprint arXiv:2401.14196, 2023
work page internal anchor Pith review arXiv 2023
-
[34]
Code llama: Open foundation models for code,
B. Rozi `ere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y . Adi, J. Liu, R. Sauvestre, T. Remez, J. Rapin, A. Kozhevnikov, I. Evtimov, J. Bitton, M. Bhatt, C. C. Ferrer, A. Grattafiori, W. Xiong, A. D ´efossez, J. Copet, F. Azhar, H. Touvron, L. Martin, N. Usunier, T. Scialom, and G. Synnaeve, “Code llama: Open foundation models for code,” 2024
2024
-
[35]
Isolation forest,
F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation forest,” inProceedings of the 2008 IEEE International Conference on Data Mining, pp. 413– 422, IEEE, 2008
2008
-
[36]
Estimating the support of a high-dimensional distribution,
B. Sch ¨olkopf, J. C. Platt, J. Shawe-Taylor, A. J. Smola, and R. C. Williamson, “Estimating the support of a high-dimensional distribution,” Neural computation, vol. 13, no. 7, pp. 1443–1471, 2001
2001
-
[37]
Lof: identifying density-based local outliers,
M. M. Breunig, H.-P. Kriegel, R. T. Ng, and J. Sander, “Lof: identifying density-based local outliers,” inProceedings of the 2000 ACM SIGMOD international conference on Management of data, pp. 93–104, 2000
2000
-
[38]
Improved software defect prediction using pruned histogram-based isolation forest,
Z. Ding and L. Xing, “Improved software defect prediction using pruned histogram-based isolation forest,”Reliability Engineering & System Safety, vol. 204, p. 107170, 2020
2020
-
[39]
On the effectiveness of one-class support vector machine in different defect prediction scenarios,
R. Moussa, D. Azar, and F. Sarro, “On the effectiveness of one-class support vector machine in different defect prediction scenarios,” in2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pp. 535–545, 2024
2024
-
[40]
Highly accurate anomaly based intrusion detection through integration of the local outlier factor and convolutional neural network,
R. Rabih, H. Vahdat-Nejad, W. Mansoor, and J. H. Joloudari, “Highly accurate anomaly based intrusion detection through integration of the local outlier factor and convolutional neural network,”Scientific Reports, vol. 15, no. 1, p. 21147, 2025
2025
-
[41]
A unified approach to interpreting model predictions,
S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,”Advances in neural information processing systems, vol. 30, 2017
2017
-
[42]
Principal component analysis,
H. Abdi and L. J. Williams, “Principal component analysis,”Wiley interdisciplinary reviews: computational statistics, vol. 2, no. 4, pp. 433– 459, 2010
2010
-
[43]
Relations between two sets of variates,
H. Hotelling, “Relations between two sets of variates,” inBreakthroughs in statistics: methodology and distribution, pp. 162–190, Springer, 1992
1992
-
[44]
On the costs and profit of software defect prediction,
S. Herbold, “On the costs and profit of software defect prediction,”IEEE Transactions on Software Engineering, vol. 47, no. 11, pp. 2617–2631, 2019. 13
2019
-
[45]
Exploring the relationship between perfor- mance metrics and cost saving potential of defect prediction models,
S. Tunkel and S. Herbold, “Exploring the relationship between perfor- mance metrics and cost saving potential of defect prediction models,” Empirical Software Engineering, vol. 27, no. 7, p. 182, 2022
2022
-
[46]
Code improvement practices at meta,
A. Mockus, P. C. Rigby, R. Abreu, A. Akkerman, Y . Bhootada, P. Bhuptani, G. Ghardhora, L. H. Dao, C. Hawley, R. He,et al., “Code improvement practices at meta,”arXiv preprint arXiv:2504.12517, 2025
-
[47]
Leveraging risk models to improve productivity for effective code un-freeze at scale,
A. Mockus, R. Abreu, P. C. Rigby, D. Amsallem, P. Bansal, K. Chinniah, B. Ellis, P. Fan, J. Ge, B. He,et al., “Leveraging risk models to improve productivity for effective code un-freeze at scale,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 7, pp. 1–24, 2025
2025
-
[48]
Moving faster and reducing risk: Using llms in release deployment,
R. Abreu, V . Murali, P. C. Rigby, C. Maddila, W. Sun, J. Ge, K. Chin- niah, A. Mockus, M. Mehta, and N. Nagappan, “Moving faster and reducing risk: Using llms in release deployment,” in2025 IEEE/ACM 47th International Conference on Software Engineering: Software En- gineering in Practice (ICSE-SEIP), pp. 448–457, IEEE, 2025
2025
-
[49]
An example of software system debugging,
F. Akiyama, “An example of software system debugging,”Proceedings of the IFIP Congress, pp. 353–359, 1971
1971
-
[50]
A complexity measure,
T. J. McCabe, “A complexity measure,”IEEE Transactions on Software Engineering, vol. SE-2(4), pp. 308–320, 1976
1976
-
[51]
Elements of software science,
M. H. Halstead, “Elements of software science,”Elsevier, 1977
1977
-
[52]
A metrics suite for object-oriented design,
S. R. Chidamber and C. F. Kemerer, “A metrics suite for object-oriented design,”IEEE Transactions on Software Engineering, vol. 20(6), pp. 476–493, 1994
1994
-
[53]
Exploring the relationships between design measures and software quality in object- oriented systems,
L. C. Briand, J. W ¨ust, J. W. Daly, and D. V . Porter, “Exploring the relationships between design measures and software quality in object- oriented systems,”Journal of Systems and Software, vol. 51(3), pp. 245– 273, 2000
2000
-
[54]
Predicting fault incidence using software change history,
T. L. Graves, A. F. Karr, J. S. Marron, and H. Siy, “Predicting fault incidence using software change history,”IEEE Transactions on Software Engineering, vol. 26(7), pp. 653–661, 2000
2000
-
[55]
Predicting fault-prone software modules in telephone switches,
N. Ohlsson and H. Alberg, “Predicting fault-prone software modules in telephone switches,”IEEE Transactions on Software Engineering, vol. 22(12), pp. 886–894, 1996
1996
-
[56]
Predicting faults using the complexity of code changes,
A. E. Hassan, “Predicting faults using the complexity of code changes,” Proceedings of ICSE, pp. 78–88, 2009
2009
-
[57]
Software fault prediction metrics: A systematic literature review,
D. Radjenovi ´c, M. Heri ˇcko, R. Torkar, and A. ˇZivkovi´c, “Software fault prediction metrics: A systematic literature review,”Information and Software Technology, vol. 55(8), pp. 1397–1418, 2013
2013
-
[58]
A systematic review of software fault prediction studies,
C. Catal and B. Diri, “A systematic review of software fault prediction studies,”Expert Systems with Applications, vol. 36(4), pp. 7346–7354, 2009
2009
-
[59]
Nasa metrics data program (mdp) datasets
N. M. D. Program, “Nasa metrics data program (mdp) datasets.” http: //mdp.ivv.nasa.gov/. Accessed: 2025-04-30
2025
-
[60]
Promise repository of empirical software engineering data
G. Boetticher, T. Menzies, and T. Ostrand, “Promise repository of empirical software engineering data.” https://github.com/ApoorvaKrisna/ NASA-promise-dataset-repository, 2005. Accessed: 2025-04-30
2005
-
[61]
Software fault prediction: A literature review and current trends,
C. Catal, “Software fault prediction: A literature review and current trends,”Expert Systems with Applications, vol. 38(4), pp. 4626–4636, 2011
2011
-
[62]
A survey on software defect prediction using deep learning,
E. N. Akimova, A. Y . Bersenev, A. A. Deikov, K. S. Kobylkin, A. V . Konygin, I. P. Mezentsev, and V . E. Misilov, “A survey on software defect prediction using deep learning,”Mathematics, vol. 9(11), p. 1180, 2021
2021
-
[63]
Deep learning based software defect prediction,
L. Qiao, X. Li, Q. Umer, and P. Guo, “Deep learning based software defect prediction,”Neurocomputing, vol. 385, pp. 100–110, 2020
2020
-
[64]
Codebert: A pre-trained model for programming and natural languages,
Y . Feng, S. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, and D. Jiang, “Codebert: A pre-trained model for programming and natural languages,”Findings of EMNLP, pp. 1536–1547, 2020
2020
-
[65]
Faulttracer: A fault localization-based approach to fixing bugs,
F. Zhang, S. Kim, and S. Thummalapenta, “Faulttracer: A fault localization-based approach to fixing bugs,” in2013 20th Working Conference on Reverse Engineering (WCRE), pp. 272–281, 2013
2013
-
[66]
Bug prediction based on fine- grained module histories,
H. Hata, O. Mizuno, and T. Kikuno, “Bug prediction based on fine- grained module histories,” inProceedings of the 34th International Conference on Software Engineering (ICSE), pp. 200–210, 2012
2012
-
[67]
Deep learning-based bug detection in source code using graph embedding,
Z. Li, X. Mao, and L. Zhang, “Deep learning-based bug detection in source code using graph embedding,” inProceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineer- ing (ASE), pp. 143–153, 2017
2017
-
[68]
Deep semantic feature learning for software defect prediction,
S. Wang, T. Liu, J. Nam, and L. Tan, “Deep semantic feature learning for software defect prediction,”IEEE Transactions on Software Engineering, vol. 46, no. 12, pp. 1267–1293, 2020
2020
-
[69]
Within- project defect prediction of infrastructure-as-code using product and process metrics,
S. D. Palma, D. D. Nucci, F. Palomba, and D. A. Tamburri, “Within- project defect prediction of infrastructure-as-code using product and process metrics,”IEEE Transactions on Software Engineering, vol. 48, no. 6, pp. 2086–2104, 2022
2086
-
[70]
Cross-project defect prediction via trans- fer learning: A benchmark study,
Z. Li, Y . Jin, and H. He, “Cross-project defect prediction via trans- fer learning: A benchmark study,”Empirical Software Engineering, vol. 26(4), pp. 1–37, 2021
2021
-
[71]
Software defect prediction based on residual/shuffle network optimized by upgraded fish migration optimization algorithm,
Z. Liu, T. Su, M. A. Zakharov, G. Wei, and S. Lee, “Software defect prediction based on residual/shuffle network optimized by upgraded fish migration optimization algorithm,”Scientific Reports, vol. 15, p. Article 7201, 2025
2025
-
[72]
A machine and deep learning analysis among sonarqube rules, product, and process metrics for faults prediction,
F. Lomio, S. Moreschini, and V . Lenarduzzi, “A machine and deep learning analysis among sonarqube rules, product, and process metrics for faults prediction,”Empirical Software Engineering, vol. 27, no. 3, p. 189, 2022
2022
-
[73]
A large-scale study of programming languages and code quality in github,
B. Ray, D. Posnett, V . Filkov, and P. Devanbu, “A large-scale study of programming languages and code quality in github,” inProceedings of the 22nd ACM SIGSOFT International Symposium on Foundations of Software Engineering, 2014
2014
-
[74]
Oops, my tests broke the build: An explorative study of travis ci with github,
M. Beller, G. Gousios, and A. Zaidman, “Oops, my tests broke the build: An explorative study of travis ci with github,” inProceedings of the 14th International Conference on Mining Software Repositories, 2016
2016
-
[75]
Deepbugs: A learning approach to name-based bug detection,
M. Pradel and K. Sen, “Deepbugs: A learning approach to name-based bug detection,” inProceedings of the ACM on Programming Languages, 2018
2018
-
[76]
Ridge and lasso regression models for cross- version defect prediction,
X. Yang and W. Wen, “Ridge and lasso regression models for cross- version defect prediction,”IEEE Transactions on Reliability, vol. 67, no. 3, pp. 885–896, 2018
2018
-
[77]
A study of redundant metrics in defect prediction datasets,
J. Jiarpakdee, C. Tantithamthavorn, A. Ihara, and K. Matsumoto, “A study of redundant metrics in defect prediction datasets,” in2016 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pp. 51–52, 2016
2016
-
[78]
The impact of automated feature selection techniques on the interpretation of defect models,
J. Jiarpakdee, C. Tantithamthavorn, and C. Treude, “The impact of automated feature selection techniques on the interpretation of defect models,”Empirical Softw. Engg., vol. 25, p. 3590–3638, Sept. 2020. 14
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.