Duplicate Bug Report Detection: How Far Are We?
Pith reviewed 2026-05-24 09:59 UTC · model grok-4.3
The pith
Simpler duplicate bug report detection techniques outperform recent sophisticated ones on a corrected benchmark.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
After preparing a benchmark that corrects for data age and issue-tracking-system choice, evaluation shows that a simpler technique outperforms recently proposed sophisticated DBRD techniques on most projects, while a simple technique already adopted in practice achieves comparable results to a recently proposed research tool.
What carries the argument
The new benchmark constructed by correcting for data age and issue-tracking-system choice, which is then used to re-evaluate and rank DBRD techniques under conditions closer to current deployment.
If this is right
- New DBRD proposals should be compared against both recent research tools and the simple methods already used in practice.
- Benchmark construction for DBRD must explicitly control for data age to avoid inflated accuracy numbers.
- Industry tools provide a useful baseline that research should match or exceed before claiming improvement.
- Reported gains from sophisticated features may shrink or disappear once age and platform biases are removed.
Where Pith is reading between the lines
- Similar age and platform biases may affect other detection tasks such as duplicate question detection or code-clone detection.
- Future work could test whether the simple techniques remain competitive when the task is extended to cross-project or cross-language settings.
- The finding suggests that research effort might usefully shift from accuracy gains to other properties such as speed or ease of integration.
Load-bearing premise
The corrected benchmark gives a realistic estimate of how the compared techniques would perform if used on today's projects.
What would settle it
Re-running the same set of techniques on bug reports opened after the benchmark's latest date and checking whether the simple method still ranks highest on most projects.
Figures
read the original abstract
Many Duplicate Bug Report Detection (DBRD) techniques have been proposed in the research literature. The industry uses some other techniques. Unfortunately, there is insufficient comparison among them, and it is unclear how far we have been. This work fills this gap by comparing the aforementioned techniques. To compare them, we first need a benchmark that can estimate how a tool would perform if applied in a realistic setting today. Thus, we first investigated potential biases that affect the fair comparison of the accuracy of DBRD techniques. Our experiments suggest that data age and issue tracking system choice cause a significant difference. Based on these findings, we prepared a new benchmark. We then used it to evaluate DBRD techniques to estimate better how far we have been. Surprisingly, a simpler technique outperforms recently proposed sophisticated techniques on most projects in our benchmark. In addition, we compared the DBRD techniques proposed in research with those used in Mozilla and VSCode. Surprisingly, we observe that a simple technique already adopted in practice can achieve comparable results as a recently proposed research tool. Our study gives reflections on the current state of DBRD, and we share our insights to benefit future DBRD research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper identifies two sources of bias (data age and issue-tracking-system choice) that affect fair evaluation of duplicate bug report detection (DBRD) techniques, constructs a new benchmark that corrects for these biases, and uses it to compare research-proposed DBRD methods against each other and against techniques already deployed in Mozilla and VSCode. The central empirical finding is that simpler techniques outperform recently proposed sophisticated ones on most projects in the benchmark, and that a simple technique already in industrial use achieves results comparable to a recent research tool.
Significance. If the benchmark construction is accepted as yielding realistic performance estimates, the result directly challenges the incremental value of complex DBRD methods and supplies a reusable, bias-corrected resource for future work. The explicit comparison against deployed industrial techniques is a strength that grounds the claims in practice.
major comments (2)
- [Benchmark preparation] Benchmark construction section: the claim that correcting only for data age and ITS choice produces a realistic estimate of current-day performance rests on the assumption that these are the dominant biases; the paper should report the quantitative impact of each correction (e.g., change in MAP or recall@K before/after) and any sensitivity analysis on the age threshold chosen.
- [Evaluation] Evaluation results: the statement that a simpler technique outperforms sophisticated ones “on most projects” requires the per-project breakdown (including number of projects, effect sizes, and statistical significance tests) to be shown; without these, the aggregate claim cannot be assessed for robustness.
minor comments (2)
- [Abstract] Abstract and introduction should cite the specific prior DBRD papers whose techniques are re-evaluated so readers can immediately locate the baselines.
- [Evaluation] Notation for the performance metrics (MAP, recall@K, etc.) should be defined at first use and kept consistent across tables and text.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and constructive comments on our paper. We address the major comments below and will incorporate the suggested changes in the revised manuscript.
read point-by-point responses
-
Referee: Benchmark construction section: the claim that correcting only for data age and ITS choice produces a realistic estimate of current-day performance rests on the assumption that these are the dominant biases; the paper should report the quantitative impact of each correction (e.g., change in MAP or recall@K before/after) and any sensitivity analysis on the age threshold chosen.
Authors: We agree that reporting the quantitative impact of each correction and including a sensitivity analysis would increase transparency. Our prior experiments identified data age and ITS choice as the biases producing statistically significant differences, but we will add explicit before/after metric tables (MAP, recall@K) for each correction step and a sensitivity analysis varying the age threshold in the revised benchmark-preparation section. revision: yes
-
Referee: Evaluation results: the statement that a simpler technique outperforms sophisticated ones “on most projects” requires the per-project breakdown (including number of projects, effect sizes, and statistical significance tests) to be shown; without these, the aggregate claim cannot be assessed for robustness.
Authors: We will expand the evaluation section to present the full per-project breakdown. The revision will report the exact number of projects, per-project performance values, effect sizes, and the results of statistical significance tests (e.g., paired Wilcoxon tests) so that the claim of outperformance “on most projects” can be directly verified. revision: yes
Circularity Check
No significant circularity; empirical evaluation on external data
full rationale
The paper performs an empirical study: it identifies biases (data age, issue-tracking system) via experiments on project data, constructs a benchmark accordingly, then measures accuracy of existing DBRD techniques (research and industry) on that benchmark. No derivation chain, equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear; the central claims are direct observations from performance metrics on real bug-report corpora. The structure is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard information-retrieval metrics (precision, recall, etc.) and similarity measures are appropriate for evaluating DBRD techniques.
Reference graph
Works this paper leans on
-
[1]
2022. GraphQL. https://docs.github.com/en/graphql. (Accessed on 02/10/2022)
work page 2022
-
[2]
2022. hadoop. https://issues.apache.org/jira/projects/HADOOP/issues. (Accessed on 02/10/2022)
work page 2022
-
[3]
2022. Jira. https://www.atlassian.com/software/jira. (Accessed on 02/10/2022)
work page 2022
-
[4]
2022. kibana. https://github.com/elastic/kibana. (Accessed on 02/10/2022)
work page 2022
-
[5]
2022. spark. https://issues.apache.org/jira/projects/SPARK/issues. (Accessed on 02/10/2022)
work page 2022
-
[6]
2022. vscode repository on GitHub. https://github.com/microsoft/vscode. (Accessed on 02/10/2022)
work page 2022
-
[7]
2022. vscodebot on GitHub. https://github.com/apps/vscodebot. (Accessed on 02/10/2022)
work page 2022
-
[8]
Anahita Alipour, Abram Hindle, and Eleni Stroulia. 2013. A contextual approach towards more accurate duplicate bug report detection. In 2013 10th Working Conference on Mining Software Repositories (MSR) . IEEE, 183–192
work page 2013
-
[9]
Mehdi Amoui, Nilam Kaushik, Abraham Al-Dabbagh, Ladan Tahvildari, Shimin Li, and Weining Liu. 2013. Search-based duplicate defect detection: an industrial experience. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 173–182
work page 2013
-
[10]
Jude Arokiam and Jeremy S Bradbury. 2020. Automatically predicting bug severity early in the development process. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results . 17–20. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:30 Zhang et al
work page 2020
-
[11]
Atlassian. 2022. Open Source Project License Request | Atlassian. https://www.atlassian.com/software/views/open- source-license-request. (Accessed on 02/10/2022)
work page 2022
-
[12]
Aladdin Baarah, Ahmad Aloqaily, Zaher Salah, Mannam Zamzeer, and Mohammad Sallam. 2019. Machine learning approaches for predicting the severity level of software bug reports in closed source projects. International Journal of Advanced Computer Science and Applications 10, 10.14569 (2019)
work page 2019
-
[13]
Victor R Basili. 2007. The role of controlled experiments in software engineering research. In Empirical Software Engineering Issues. Critical Assessment and Future Directions . Springer, 33–37
work page 2007
-
[14]
Nicolas Bettenburg, Rahul Premraj, Thomas Zimmermann, and Sunghun Kim. 2008. Duplicate bug reports considered harmful. . . really?. In2008 IEEE International Conference on Software Maintenance . IEEE, 337–345
work page 2008
-
[15]
Jake D Brutlag, Hilary Hutchinson, and Maria Stone. 2008. User preference and search engine latency. (2008)
work page 2008
-
[16]
Bugzilla. 2022. Eclipse. https://bugs.eclipse.org/bugs/. (Accessed on 02/10/2022)
work page 2022
-
[17]
Bugzilla. 2022. Mozilla. https://bugzilla.mozilla.org/home. (Accessed on 02/10/2022)
work page 2022
-
[18]
Jayati Deshmukh, KM Annervaz, Sanjay Podder, Shubhashis Sengupta, and Neville Dubash. 2017. Towards accurate duplicate bug retrieval using deep learning techniques. In 2017 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 115–124
work page 2017
-
[19]
Bugzilla Wiki FAQ. 2022. How to Mark a Bug Report as a Duplicate? https://wiki.documentfoundation.org/QA/ Bugzilla/FAQ#How_to_Mark_a_Bug_Report_as_a_Duplicate. (Accessed on 02/10/2022)
work page 2022
-
[20]
Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1797–1806
work page 2017
-
[21]
Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the 2017 11th joint meeting on foundations of software engineering . 49–60
work page 2017
-
[22]
Cooper, Andreas Herzig, Fr ´ed´eric Maris & Julien Vianey (2018): Temporal Epistemic Gossip Problems
Vahid Garousi, Michael Felderer, Mika V. Mäntylä, and Austen Rainer. 2020.Benefitting from the Grey Literature in Software Engineering Research. Springer International Publishing, Cham, 385–413. https://doi.org/10.1007/978-3-030- 32489-6_14
-
[23]
Edmund A Gehan. 1965. A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52, 1-2 (1965), 203–224
work page 1965
-
[24]
GitHub. [n. d.]. Marking issues or pull requests as a duplicate - GitHub Docs. https://docs.github.com/en/issues/tracking- your-work-with-issues/marking-issues-or-pull-requests-as-a-duplicate. (Accessed on 02/10/2022)
work page 2022
-
[25]
Mehdi Golzadeh, Alexandre Decan, Eleni Constantinou, and Tom Mens. 2021. Identifying bot activity in GitHub pull request and issue comments. In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE) . IEEE, 21–25
work page 2021
- [26]
-
[27]
Jianjun He, Ling Xu, Meng Yan, Xin Xia, and Yan Lei. 2020. Duplicate bug report detection using dual-channel convolutional neural networks. In Proceedings of the 28th International Conference on Program Comprehension . 117–127
work page 2020
-
[28]
Nilam Kaushik and Ladan Tahvildari. 2012. A comparative study of the performance of IR models on duplicate bug detection. In 2012 16th European Conference on Software Maintenance and Reengineering . IEEE, 159–168
work page 2012
-
[29]
Misoo Kim and Eunseok Lee. 2021. Are datasets for information retrieval-based bug localization techniques trustworthy? Empirical Software Engineering 26, 3 (2021), 1–66
work page 2021
-
[30]
Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. 2016. Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis . 165–176
work page 2016
-
[31]
Berfin Kucuk and Eray Tuzun. 2021. Characterizing duplicate bugs: An empirical analysis. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 661–668
work page 2021
-
[32]
Alina Lazar, Sarah Ritchey, and Bonita Sharif. 2014. Generating duplicate bug datasets. In Proceedings of the 11th working conference on mining software repositories . 392–395
work page 2014
-
[33]
Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. 2015. Support vector machines and word2vec for text classification with semantic features. In 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, 136–140
work page 2015
-
[34]
Wenjie Liu, Shanshan Wang, Xin Chen, and He Jiang. 2018. Predicting the severity of bug reports based on feature selection. International Journal of Software Engineering and Knowledge Engineering 28, 04 (2018), 537–558
work page 2018
-
[35]
Pablo Loyola, Kugamoorthy Gajananan, and Fumiko Satoh. 2018. Bug localization by learning to rank and represent bug inducing changes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management . 657–665
work page 2018
-
[36]
Tim Menzies, Suvodeep Majumder, Nikhila Balaji, Katie Brey, and Wei Fu. 2018. 500+ times faster than deep learning:(a case study exploring faster methods for text mining stackoverflow). In 2018 IEEE/ACM 15th International Conference on J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. Duplicate Bug Report Detection: How Far Are We? 111:3...
work page 2018
-
[37]
Lloyd Montgomery, Clara Lüders, and Walid Maalej. 2022. An Alternative Issue Tracking Dataset of Public Jira Repositories. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR) . IEEE, 73–77
work page 2022
-
[38]
Jakob Nielsen. 1994. Usability engineering. Morgan Kaufmann
work page 1994
-
[39]
Michael Pradel, Vijayaraghavan Murali, Rebecca Qian, Mateusz Machalica, Erik Meijer, and Satish Chandra. 2020. Scaffle: Bug Localization on Millions of Files. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 225–236. ht...
-
[40]
Mohamed Sami Rakha, Cor-Paul Bezemer, and Ahmed E Hassan. 2018. Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval. Empirical Software Engineering 23, 5 (2018), 2597–2621
work page 2018
-
[41]
Mohamed Sami Rakha, Weiyi Shang, and Ahmed E Hassan. 2016. Studying the needed effort for identifying duplicate issues. Empirical Software Engineering 21, 5 (2016), 1960–1989
work page 2016
-
[42]
Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management . 42–49
work page 2004
-
[43]
Irving Muller Rodrigues, Daniel Aloise, Eraldo Rezende Fernandes, and Michel Dagenais. 2020. A Soft Alignment Model for Bug Deduplication. InProceedings of the 17th International Conference on Mining Software Repositories . 43–53
work page 2020
-
[44]
Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research , Vol. 13
work page 2006
-
[45]
Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[46]
Per Runeson, Magnus Alexandersson, and Oskar Nyholm. 2007. Detection of duplicate defect reports using natural language processing. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 499–510
work page 2007
-
[47]
Marcos Salganicoff. 1997. Tolerating Concept and Sampling Shift in Lazy Learning Using Prediction Error Context Switching. Springer Netherlands, Dordrecht, 133–155. https://doi.org/10.1007/978-94-017-2053-3_5
-
[48]
Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Introduction to information retrieval . Vol. 39. Cambridge University Press Cambridge
work page 2008
-
[49]
Yanqi Su, Zhenchang Xing, Xin Peng, Xin Xia, Chong Wang, Xiwei Xu, and Liming Zhu. 2021. Reducing Bug Triaging Confusion by Learning from Mistakes with a Bug Tossing Knowledge Graph. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 191–202
work page 2021
-
[50]
Chengnian Sun, David Lo, Siau-Cheng Khoo, and Jing Jiang. 2011. Towards more accurate retrieval of duplicate bug reports. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011) . IEEE, 253–262
work page 2011
-
[51]
Chengnian Sun, David Lo, Xiaoyin Wang, Jing Jiang, and Siau-Cheng Khoo. 2010. A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 45–54
work page 2010
-
[52]
Feifei Tu, Jiaxin Zhu, Qimu Zheng, and Minghui Zhou. 2018. Be Careful of When: An Empirical Study on Time-Related Misuse of Issue Tracking Data. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for ...
-
[53]
Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. 2008. An approach to detecting duplicate bug reports using natural language and execution information. In Proceedings of the 30th international conference on Software engineering. 461–470
work page 2008
-
[54]
Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How long will it take to fix this bug?. In Fourth International Workshop on Mining Software Repositories (MSR’07: ICSE Workshops 2007) . IEEE, 1–1
work page 2007
-
[55]
Eric W Weisstein. 2004. Bonferroni correction. https://mathworld.wolfram.com/ (2004)
work page 2004
-
[56]
Xin Xia, David Lo, Ming Wen, Emad Shihab, and Bo Zhou. 2014. An empirical study of bug report field reassignment. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 174–183
work page 2014
-
[57]
Guanping Xiao, Xiaoting Du, Yulei Sui, and Tao Yue. 2020. HINDBR: Heterogeneous information network based duplicate bug report prediction. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 195–206
work page 2020
-
[58]
Qi Xie, Zhiyuan Wen, Jieming Zhu, Cuiyun Gao, and Zibin Zheng. 2018. Detecting duplicate bug reports with convolutional neural networks. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC) . IEEE, 416–425
work page 2018
-
[59]
Jifeng Xuan, He Jiang, Zhilei Ren, Jun Yan, and Zhongxuan Luo. 2017. Automatic bug triage using semi-supervised text classification. arXiv preprint arXiv:1704.04769 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[60]
Zhengran Zeng, Yuqun Zhang, Haotian Zhang, and Lingming Zhang. 2021. Deep just-in-time defect prediction: how far are we?. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis . 427–438. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:32 Zhang et al
work page 2021
-
[61]
Jian Zhou and Hongyu Zhang. 2012. Learning to rank duplicate bug reports. InProceedings of the 21st ACM international conference on Information and knowledge management . 852–861
work page 2012
-
[62]
Indr˙e Žliobait ˙e, Mykola Pechenizkiy, and Joao Gama. 2016. An overview of concept drift applications.Big data analysis: new algorithms for a new society (2016), 91–114
work page 2016
-
[63]
Weiqin Zou, David Lo, Zhenyu Chen, Xin Xia, Yang Feng, and Baowen Xu. 2018. How practitioners perceive automated bug report management techniques. IEEE Transactions on Software Engineering 46, 8 (2018), 836–862. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018
work page 2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.