Confident Learning-based Network for Detecting Bug-Inducing Commits on SZZ with Noisy Labels
Pith reviewed 2026-06-29 11:39 UTC · model grok-4.3
The pith
BIC-Hunter cleans noisy SZZ labels with confident learning and uses graph convolutions on homogeneous commit graphs to raise recall in bug-inducing commit detection.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The BIC-Hunter model mitigates inaccurate annotations and inconsistencies in SZZ-labeled data through a confident-learning denoising component and constructs homogeneous graphs processed by graph convolutional networks to capture semantic relationships among commits, thereby improving the accuracy of bug-inducing commit identification over prior methods.
What carries the argument
Confident learning applied to label correction, paired with homogeneous graph construction and graph convolutional networks for semantic context analysis.
If this is right
- Training sets for JIT defect prediction become more reliable after noise correction.
- Graph convolutional networks on commit graphs recover semantic links that earlier models missed.
- Overall robustness to real-world label noise increases.
- Recall@1, Recall@2, Recall@3 and MFR all rise by the reported margins on merged open-source data.
Where Pith is reading between the lines
- The same denoising-plus-graph pattern could be tested on other noisy-label tasks such as vulnerability or smell detection.
- Replacing the homogeneous graph with a heterogeneous one that also links files and developers might further improve context capture.
- The method supplies a concrete way to measure how much label noise currently limits existing SZZ-based detectors.
Load-bearing premise
Confident learning can reliably spot and fix inaccurate SZZ annotations without adding new biases that lower downstream detection performance.
What would settle it
On a hold-out set of commits whose inducing status has been independently verified by multiple developers, measure whether applying the confident-learning stage increases final Recall@K compared with training on the raw noisy labels.
Figures
read the original abstract
The Just-In-Time (JIT) defect prediction model serves as a critical tool for ensuring the quality of software development and enhancing software performance. It assists development teams in promptly identifying and addressing potential issues by predicting whether code submissions may introduce defects. However, due to the existence of data noise and insufficient semantic connections in real-world scenarios, existing approaches face challenges in accurately identifying the code commits that introduce defects and capturing the potential semantic relationships. To address these challenges, we propose the BIC- Hunter(Bug-Inducing Commits Hunter) model, which mitigates data noise and improves semantic understanding, thereby enhancing the accuracy of bug-inducing commit identification. BIC - Hunter model consists of two components: a data denoising component and a semantic relationship capturing component. Specifically, the data denoising component addresses the challenges posed by inaccurate annotations and inconsistencies in real-world data, enhancing the reliability of training data and improving overall model robustness. The semantic relation- ship capturing component constructs homogeneous graphs and applies graph convolutional networks to facilitate a more comprehensive analysis of code context, enabling the identification of defects caused by code commits and enhancing the confidence in pinpointing their root causes. Experimental studies on a large-scale dataset integrated from three open-source datasets show that BIC- Hunter exhibits outstanding performance. BIC- Hunter outperforms the state-of-the-art by 6.16%, 7.13%, and 5.53% on Recall@1, Recall@2, and Recall@3, respectively, while the MFR index increases by 8.43% to 32.82%. These results demonstrate the superior capability of our method in identifying bug-inducing commits.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes BIC-Hunter, a model for identifying bug-inducing commits (BICs) that combines a confident-learning-based data-denoising component to mitigate noisy SZZ labels with a semantic-relationship component that builds homogeneous graphs and applies graph convolutional networks. Experiments on a large-scale dataset formed by integrating three open-source projects report that BIC-Hunter outperforms prior work by 6.16%, 7.13%, and 5.53% on Recall@1, Recall@2, and Recall@3 respectively, while improving the MFR index by 8.43% to 32.82%.
Significance. If the reported gains are shown to be robust after proper controls for baselines, dataset construction, and component validation, the work would address two recognized difficulties in just-in-time defect prediction: label noise arising from the SZZ algorithm and limited modeling of semantic context among commits. The combination of confident learning for denoising and GCNs on homogeneous graphs is a plausible direction, but its practical impact cannot yet be judged from the supplied information.
major comments (1)
- [Abstract] Abstract: performance numbers are stated without any description of the baselines, the construction or size of the integrated dataset, statistical significance testing, or ablation/validation results for the denoising and graph components; consequently the central empirical claim cannot be assessed.
Simulated Author's Rebuttal
We thank the referee for their review. The single major comment concerns the abstract's lack of detail on baselines, dataset, significance testing, and ablations. We address this below.
read point-by-point responses
-
Referee: [Abstract] Abstract: performance numbers are stated without any description of the baselines, the construction or size of the integrated dataset, statistical significance testing, or ablation/validation results for the denoising and graph components; consequently the central empirical claim cannot be assessed.
Authors: We agree the abstract is concise and omits these specifics, which are instead provided in the body (dataset construction and size in Section 4.1, baselines in Section 4.2, significance testing in Section 5.3, and component ablations in Section 5.4). To make the central claim more self-contained, we will revise the abstract to briefly note the integrated dataset from three projects, the SOTA baselines compared, and that full validation details appear in the experiments section. revision: yes
Circularity Check
No significant circularity
full rationale
The abstract and available description present an empirical ML model (BIC-Hunter) with two components: confident learning for denoising SZZ labels and GCN on homogeneous graphs for semantic capture. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains are quoted or present. Performance metrics are reported as experimental outcomes on integrated datasets rather than derived results. Without load-bearing mathematical steps that reduce to inputs by construction, the derivation chain (such as it is) is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008
Sunghun Kim, E James Whitehead, and Yi Zhang. Classifying software changes: Clean or buggy?IEEE Transactions on software engineering, 34(2):181–196, 2008
2008
-
[2]
A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6):757–773, 2012
Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi. A large-scale empirical study of just-in-time quality assurance.IEEE Transactions on Software Engineering, 39(6):757–773, 2012
2012
-
[3]
Towards reliable online just-in-time software defect prediction.IEEE Transactions on Software Engineering, 49(3):1342–1358, 2022
George G Cabral and Leandro L Minku. Towards reliable online just-in-time software defect prediction.IEEE Transactions on Software Engineering, 49(3):1342–1358, 2022
2022
-
[4]
Bridging expert knowledge with deep learning techniques for just-in-time defect prediction.Empirical Software Engineering, 30(1):1–44, 2025
Xin Zhou, DongGyun Han, and David Lo. Bridging expert knowledge with deep learning techniques for just-in-time defect prediction.Empirical Software Engineering, 30(1):1–44, 2025
2025
-
[5]
The impact of mislabeled changes by szz on just-in-time defect prediction.IEEE transactions on software engineering, 47(8):1559–1586, 2019
Yuanrui Fan, Xin Xia, Daniel Alencar Da Costa, David Lo, Ahmed E Hassan, and Shanping Li. The impact of mislabeled changes by szz on just-in-time defect prediction.IEEE transactions on software engineering, 47(8):1559–1586, 2019
2019
-
[6]
Do developers introduce bugs when they do not communicate? the case of eclipse and mozilla
Mario Luca Bernardi, Gerardo Canfora, Giuseppe A Di Lucca, Massimiliano Di Penta, and Damiano Distante. Do developers introduce bugs when they do not communicate? the case of eclipse and mozilla. In2012 16th European Conference on Software Maintenance and Reengineering, pages 139–148. IEEE, 2012
2012
-
[7]
How long does a bug survive? an empirical study
Gerardo Canfora, Michele Ceccarelli, Luigi Cerulo, and Massimiliano Di Penta. How long does a bug survive? an empirical study. In2011 18th Working Conference on Reverse Engineering, pages 191–200. IEEE, 2011
2011
-
[8]
Identifying failure inducing developer pairs within developer networks
Jordan Ell. Identifying failure inducing developer pairs within developer networks. In2013 35th International Conference on Software Engineering (ICSE), pages 1471–1473. IEEE, 2013
2013
-
[9]
Bug introducing changes: A case study with android
Muhammad Asaduzzaman, Michael C Bullock, Chanchal K Roy, and Kevin A Schneider. Bug introducing changes: A case study with android. In2012 9th IEEE Working Conference on Mining Software Repositories (MSR), pages 116–119. IEEE, 2012
2012
-
[10]
When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005
Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller. When do changes induce fixes?ACM sigsoft software engineering notes, 30(4):1–5, 2005
2005
-
[11]
Evaluating szz implementations through a developer-informed oracle
Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Bavota, Michele Lanza, and Rocco Oliveto. Evaluating szz implementations through a developer-informed oracle. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 436–447. IEEE, 2021
2021
-
[12]
Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022
Steffen Herbold, Alexander Trautsch, Fabian Trautsch, and Benjamin Ledel. Problems with szz and features: An empirical study of the state of practice of defect prediction data collection.Empirical Software Engineering, 27(2):42, 2022. ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 2, Article 3. Publication date: May 2025. 3:24 Sun et al
2022
-
[13]
An empirical study on the use of szz for identifying inducing changes of non-functional bugs.Empirical Software Engineering, 26(4):71, 2021
Sophia Quach, Maxime Lamothe, Yasutaka Kamei, and Weiyi Shang. An empirical study on the use of szz for identifying inducing changes of non-functional bugs.Empirical Software Engineering, 26(4):71, 2021
2021
-
[14]
The impact of refactoring changes on the szz algorithm: An empirical study
Edmilson Campos Neto, Daniel Alencar Da Costa, and Uirá Kulesza. The impact of refactoring changes on the szz algorithm: An empirical study. In2018 IEEE 25th international conference on software analysis, evolution and reengineering (SANER), pages 380–390. IEEE, 2018
2018
-
[15]
Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 2024
Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo. Evaluating szz implementations: An empirical study on the linux kernel.IEEE Transactions on Software Engineering, 2024
2024
-
[16]
Automatic identification of bug-introducing changes
Sunghun Kim, Thomas Zimmermann, Kai Pan, E James Jr, et al. Automatic identification of bug-introducing changes. In21st IEEE/ACM international conference on automated software engineering (ASE’06), pages 81–90. IEEE, 2006
2006
-
[17]
Revisiting and improving szz implementations
Edmilson Campos Neto, Daniel Alencar Da Costa, and Uirá Kulesza. Revisiting and improving szz implementations. In2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–12. IEEE, 2019
2019
-
[18]
Neural szz algorithm
Lingxiao Tang, Lingfeng Bao, Xin Xia, and Zhongdong Huang. Neural szz algorithm. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE), pages 1024–1035. IEEE, 2023
2023
-
[19]
A comprehensive evaluation of szz variants through a developer-informed oracle.Journal of Systems and Software, 202:111729, 2023
Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Bavota, Michele Lanza, and Rocco Oliveto. A comprehensive evaluation of szz variants through a developer-informed oracle.Journal of Systems and Software, 202:111729, 2023
2023
-
[20]
Exploring and exploiting the correlations between bug-inducing and bug-fixing commits
Ming Wen, Rongxin Wu, Yepang Liu, Yongqiang Tian, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. Exploring and exploiting the correlations between bug-inducing and bug-fixing commits. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 326–337, 2019
2019
-
[21]
Heterogeneous graph attention network
Xiao Wang, Houye Ji, Chuan Shi, Bai Wang, Yanfang Ye, Peng Cui, and Philip S Yu. Heterogeneous graph attention network. InThe world wide web conference, pages 2022–2032, 2019
2022
-
[22]
Semi-Supervised Classification with Graph Convolutional Networks
Thomas N Kipf and Max Welling. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[23]
Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021
Curtis Northcutt, Lu Jiang, and Isaac Chuang. Confident learning: Estimating uncertainty in dataset labels.Journal of Artificial Intelligence Research, 70:1373–1411, 2021
2021
-
[24]
https://github.com/Vandbs/BIC-Hunter
Our project package. https://github.com/Vandbs/BIC-Hunter
-
[25]
Predicting risk of software changes.Bell Labs Technical Journal, 5(2):169–180, 2000
Audris Mockus and David M Weiss. Predicting risk of software changes.Bell Labs Technical Journal, 5(2):169–180, 2000
2000
-
[26]
Class noise vs
Xingquan Zhu and Xindong Wu. Class noise vs. attribute noise: A quantitative study.Artificial intelligence review, 22:177–210, 2004
2004
-
[27]
Hipikat: Recommending pertinent software development artifacts
Davor Cubranic and Gail C Murphy. Hipikat: Recommending pertinent software development artifacts. In25th International Conference on Software Engineering, 2003. Proceedings., pages 408–418. IEEE, 2003
2003
-
[28]
Analyzing and relating bug report data for feature tracking
Michael Fischer, Martin Pinzger, and Harald Gall. Analyzing and relating bug report data for feature tracking. In WCRE, volume 3, page 90, 2003
2003
-
[29]
A framework for evaluating the results of the szz approach for identifying bug-introducing changes.IEEE Transactions on Software Engineering, 43(7):641–657, 2016
Daniel Alencar Da Costa, Shane McIntosh, Weiyi Shang, Uirá Kulesza, Roberta Coelho, and Ahmed E Hassan. A framework for evaluating the results of the szz approach for identifying bug-introducing changes.IEEE Transactions on Software Engineering, 43(7):641–657, 2016
2016
-
[30]
Identifying bug-inducing changes for code additions
Emre Sahal and Ayse Tosun. Identifying bug-inducing changes for code additions. InProceedings of the 12th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, pages 1–2, 2018
2018
-
[31]
V-szz: automatic identification of version ranges affected by cve vulnerabilities
Lingfeng Bao, Xin Xia, Ahmed E Hassan, and Xiaohu Yang. V-szz: automatic identification of version ranges affected by cve vulnerabilities. InProceedings of the 44th International Conference on Software Engineering, pages 2352–2364, 2022
2022
-
[32]
Pr-szz: How pull requests can support the tracing of defects in software repositories
Peter Bludau and Alexander Pretschner. Pr-szz: How pull requests can support the tracing of defects in software repositories. In2022 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER), pages 1–12. IEEE, 2022
2022
-
[33]
Identifying root cause of bugs by capturing changed code lines with relational graph neural networks
Jiaqi Zhang, Shikai Guo, Hui Li, Chenchen Li, Yu Chai, and Rong Chen. Identifying root cause of bugs by capturing changed code lines with relational graph neural networks. https://arxiv.org/abs/2505.00990,
-
[34]
Detecting the root cause code lines in bug-fixing commits by heterogeneous graph learning
Liguo Ji, Shikai Guo, Lehuan Zhang, Hui Li, Yu Chai, and Rong Chen. Detecting the root cause code lines in bug-fixing commits by heterogeneous graph learning. http://export.arxiv.org/abs/2505.01022
-
[35]
Getting defect prediction into industrial practice: the elff tool
David Bowes, Steve Counsell, Tracy Hall, Jean Petric, and Thomas Shippey. Getting defect prediction into industrial practice: the elff tool. In2017 IEEE International Symposium on Software Reliability Engineering Workshops (ISSREW), pages 44–47. IEEE, 2017
2017
-
[36]
CodeBERT: A Pre-Trained Model for Programming and Natural Languages
Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155, 2020. ACM Trans. Softw. Eng. Methodol., Vol. 1, No. 2, Article 3. Publication date: May 2025. Confident Learning-based Network for...
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[37]
From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010
Christopher JC Burges. From ranknet to lambdarank to lambdamart: An overview.Learning, 11(23-581):81, 2010
2010
-
[38]
Yahoo! learning to rank challenge overview
Olivier Chapelle and Yi Chang. Yahoo! learning to rank challenge overview. InProceedings of the learning to rank challenge, pages 1–24. PMLR, 2011
2011
-
[39]
Adapting deep ranknet for personalized search
Yang Song, Hongning Wang, and Xiaodong He. Adapting deep ranknet for personalized search. InProceedings of the 7th ACM international conference on Web search and data mining, pages 83–92, 2014
2014
-
[40]
On application of learning to rank for e-commerce search
Shubhra Kanti Karmaker Santu, Parikshit Sondhi, and ChengXiang Zhai. On application of learning to rank for e-commerce search. InProceedings of the 40th international ACM SIGIR conference on research and development in information retrieval, pages 475–484, 2017
2017
-
[41]
Regminer: towards constructing a large regression dataset from code evolution history
Xuezhi Song, Yun Lin, Siang Hwee Ng, Yijian Wu, Xin Peng, Jin Song Dong, and Hong Mei. Regminer: towards constructing a large regression dataset from code evolution history. InProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, pages 314–326, 2022
2022
-
[42]
Radial basis functions.Acta numerica, 9:1–38, 2000
Martin Dietrich Buhmann. Radial basis functions.Acta numerica, 9:1–38, 2000
2000
-
[43]
A weighted gcn with logical adjacency matrix for relation extraction
Li Zhou, Tingyu Wang, Hong Qu, Li Huang, and Yuguo Liu. A weighted gcn with logical adjacency matrix for relation extraction. InECAI 2020, pages 2314–2321. IOS Press, 2020
2020
-
[44]
Adam: A Method for Stochastic Optimization
Diederik P Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[45]
Isolation forest
Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. Isolation forest. In2008 eighth ieee international conference on data mining, pages 413–422. IEEE, 2008
2008
-
[46]
Addressing the curse of imbalanced training sets: one-sided selection
Miroslav Kubat, Stan Matwin, et al. Addressing the curse of imbalanced training sets: one-sided selection. InIcml, volume 97, page 179. Citeseer, 1997
1997
-
[47]
Dealing with noise in defect prediction
Sunghun Kim, Hongyu Zhang, Rongxin Wu, and Liang Gong. Dealing with noise in defect prediction. InProceedings of the 33rd International Conference on Software Engineering, pages 481–490, 2011
2011
-
[48]
Random space division sampling for label-noisy classification or imbalanced classification.IEEE Transactions on Cybernetics, 52(10):10444–10457, 2021
Shuyin Xia, Yong Zheng, Guoyin Wang, Ping He, Heng Li, and Zizhong Chen. Random space division sampling for label-noisy classification or imbalanced classification.IEEE Transactions on Cybernetics, 52(10):10444–10457, 2021
2021
-
[49]
An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification
Washington Cunha, Celso França, Guilherme Fonseca, Leonardo Rocha, and Marcos André Gonçalves. An effective, efficient, and scalable confidence-based instance selection framework for transformer-based text classification. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 665–674, 2023
2023
-
[50]
Granular ball sampling for noisy label classification or imbalanced classification.IEEE Transactions on Neural Networks and Learning Systems, 34(4):2144–2155, 2021
Shuyin Xia, Shaoyuan Zheng, Guoyin Wang, Xinbo Gao, and Binggui Wang. Granular ball sampling for noisy label classification or imbalanced classification.IEEE Transactions on Neural Networks and Learning Systems, 34(4):2144–2155, 2021
2021
-
[51]
Deepergcn: All you need to train deeper gcns.arXiv preprint arXiv:2006.07739, 2020
Guohao Li, Chenxin Xiong, Ali Thabet, and Bernard Ghanem. Deepergcn: All you need to train deeper gcns.arXiv preprint arXiv:2006.07739, 2020
-
[52]
Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. Graph attention networks.arXiv preprint arXiv:1710.10903, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[53]
Relational Graph Attention Networks
Dan Busbridge, Dane Sherburn, Pietro Cavallo, and Nils Y Hammerla. Relational graph attention networks.arXiv preprint arXiv:1904.05811, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[54]
Topology Adaptive Graph Convolutional Networks
Jian Du, Shanghang Zhang, Guanhang Wu, José MF Moura, and Soummya Kar. Topology adaptive graph convolutional networks.arXiv preprint arXiv:1710.10370, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[55]
Modeling relational data with graph convolutional networks
Michael Schlichtkrull, Thomas N Kipf, Peter Bloem, Rianne Van Den Berg, Ivan Titov, and Max Welling. Modeling relational data with graph convolutional networks. InThe semantic web: 15th international conference, ESWC 2018, Heraklion, Crete, Greece, June 3–7, 2018, proceedings 15, pages 593–607. Springer, 2018. ACM Trans. Softw. Eng. Methodol., Vol. 1, No....
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.