Duplicate Bug Report Detection: How Far Are We?

Bowen Xu; David Lo; DongGyun Han; Ferdian Thung; Ivana Clairine Irsan; Lingxiao Jiang; Ting Zhang; Venkatesh Vinayakarao

arxiv: 2212.00548 · v1 · submitted 2022-12-01 · 💻 cs.SE

Duplicate Bug Report Detection: How Far Are We?

Ting Zhang , DongGyun Han , Venkatesh Vinayakarao , Ivana Clairine Irsan , Bowen Xu , Ferdian Thung , David Lo , Lingxiao Jiang This is my paper

Pith reviewed 2026-05-24 09:59 UTC · model grok-4.3

classification 💻 cs.SE

keywords duplicate bug report detectionbenchmark constructionempirical evaluationissue tracking systemssoftware maintenanceperformance comparisondata bias

0 comments

The pith

Simpler duplicate bug report detection techniques outperform recent sophisticated ones on a corrected benchmark.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares research and industry Duplicate Bug Report Detection techniques after identifying two major biases that had skewed prior evaluations. Data age and the choice of issue tracking system both produce large differences in measured accuracy. The authors build a new benchmark that removes those biases and then run the comparison again. On this benchmark a basic technique beats most of the complex methods proposed in recent papers, and a simple method already used in practice matches the performance of one recent research tool. The result indicates that reported progress in the area rests partly on unrealistic test conditions.

Core claim

After preparing a benchmark that corrects for data age and issue-tracking-system choice, evaluation shows that a simpler technique outperforms recently proposed sophisticated DBRD techniques on most projects, while a simple technique already adopted in practice achieves comparable results to a recently proposed research tool.

What carries the argument

The new benchmark constructed by correcting for data age and issue-tracking-system choice, which is then used to re-evaluate and rank DBRD techniques under conditions closer to current deployment.

If this is right

New DBRD proposals should be compared against both recent research tools and the simple methods already used in practice.
Benchmark construction for DBRD must explicitly control for data age to avoid inflated accuracy numbers.
Industry tools provide a useful baseline that research should match or exceed before claiming improvement.
Reported gains from sophisticated features may shrink or disappear once age and platform biases are removed.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar age and platform biases may affect other detection tasks such as duplicate question detection or code-clone detection.
Future work could test whether the simple techniques remain competitive when the task is extended to cross-project or cross-language settings.
The finding suggests that research effort might usefully shift from accuracy gains to other properties such as speed or ease of integration.

Load-bearing premise

The corrected benchmark gives a realistic estimate of how the compared techniques would perform if used on today's projects.

What would settle it

Re-running the same set of techniques on bug reports opened after the benchmark's latest date and checking whether the simple method still ranks highest on most projects.

Figures

Figures reproduced from arXiv: 2212.00548 by Bowen Xu, David Lo, DongGyun Han, Ferdian Thung, Ivana Clairine Irsan, Lingxiao Jiang, Ting Zhang, Venkatesh Vinayakarao.

**Figure 2.** Figure 2: An example of VSCodeBot duplicate issue recommendation (issue 75817) from Microsoft/VSCode repository the pre-submission usage scenario, the efficiency of DBRD tools plays an important role as it requires DBRD tools to produce a real-time recommendation. Issue reporters are unlikely to be willing to wait for a long time for a DBRD tool to return some results. On the other hand, for the post-submission usag… view at source ↗

**Figure 3.** Figure 3: Examples of the predictions in the top-10 positions for 4 test BRs. [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: The workflow of retrieving the correct bucket. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Recall Rate@k in the test data of Eclipse, Mozilla, Hadoop, Spark, Kibana, and VSCode J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018 [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: REP compared to the other four approaches in terms of successful predictions the dataset characteristics. Furthermore, we draw the Venn diagrams ( [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: Recall Rate@𝑘 comparing the tools in research and in practice on the VSCode data VSCodeBot in terms of RR@5. FTS which is also adopted in practice also shows worse results than VSCodeBot. Implications. Since FTS is based on exact word matching, the relatively good performance of FTS indicates that many duplicate BRs are more likely to carry the same words in BR titles. It also indicates the important role … view at source ↗

**Figure 8.** Figure 8: Recall Rate@𝑘 in the test data of Eclipse-Old, Mozilla-Old project was created on February 6, 2013. Therefore, we are only able to conduct experiments on the old data of Hadoop, which uses Jira as an ITS. Following what we did in the RQ1, we evaluate the three tools REP, Siamese Pair, and SABD on the Hadoop old dataset (which contains BRs submitted between 2012 and 2014) [PITH_FULL_IMAGE:figures/full_fig_… view at source ↗

read the original abstract

Many Duplicate Bug Report Detection (DBRD) techniques have been proposed in the research literature. The industry uses some other techniques. Unfortunately, there is insufficient comparison among them, and it is unclear how far we have been. This work fills this gap by comparing the aforementioned techniques. To compare them, we first need a benchmark that can estimate how a tool would perform if applied in a realistic setting today. Thus, we first investigated potential biases that affect the fair comparison of the accuracy of DBRD techniques. Our experiments suggest that data age and issue tracking system choice cause a significant difference. Based on these findings, we prepared a new benchmark. We then used it to evaluate DBRD techniques to estimate better how far we have been. Surprisingly, a simpler technique outperforms recently proposed sophisticated techniques on most projects in our benchmark. In addition, we compared the DBRD techniques proposed in research with those used in Mozilla and VSCode. Surprisingly, we observe that a simple technique already adopted in practice can achieve comparable results as a recently proposed research tool. Our study gives reflections on the current state of DBRD, and we share our insights to benefit future DBRD research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main point is that a bias-corrected benchmark shows simpler DBRD techniques beating recent research proposals and matching some industry tools.

read the letter

The new element here is the benchmark that tries to fix for data age and issue-tracking system choice after the authors ran experiments showing those factors shift results. They then run a head-to-head on research methods plus the ones actually used at Mozilla and VSCode. The headline finding is that a basic approach often does better than the fancier recent papers on most projects, and one deployed simple method comes close to a research tool. That direct industry comparison is useful and not common in this corner of the literature. The work also surfaces concrete reflections on why over-engineering may not pay off. Credit for doing the bias checks first instead of assuming the data is clean. The argument holds together on its own terms: they target the biases they measured rather than ignoring them. The main soft spot is that the benchmark's claim to realism still hinges on how well the age and system corrections generalize beyond the projects they picked; small changes in selection could move the numbers. The abstract gives no raw counts or significance tests, so the size of the gaps is hard to judge without the full tables. No code or data release is mentioned, which limits quick checks. This is a solid empirical study for people who build or evaluate duplicate detection in bug trackers. It will interest SE researchers who want to know what actually ships versus what gets published. It is worth sending to referees because the setup is straightforward, the industry angle adds value, and the results push back on a common assumption in the subfield.

Referee Report

2 major / 2 minor

Summary. The paper identifies two sources of bias (data age and issue-tracking-system choice) that affect fair evaluation of duplicate bug report detection (DBRD) techniques, constructs a new benchmark that corrects for these biases, and uses it to compare research-proposed DBRD methods against each other and against techniques already deployed in Mozilla and VSCode. The central empirical finding is that simpler techniques outperform recently proposed sophisticated ones on most projects in the benchmark, and that a simple technique already in industrial use achieves results comparable to a recent research tool.

Significance. If the benchmark construction is accepted as yielding realistic performance estimates, the result directly challenges the incremental value of complex DBRD methods and supplies a reusable, bias-corrected resource for future work. The explicit comparison against deployed industrial techniques is a strength that grounds the claims in practice.

major comments (2)

[Benchmark preparation] Benchmark construction section: the claim that correcting only for data age and ITS choice produces a realistic estimate of current-day performance rests on the assumption that these are the dominant biases; the paper should report the quantitative impact of each correction (e.g., change in MAP or recall@K before/after) and any sensitivity analysis on the age threshold chosen.
[Evaluation] Evaluation results: the statement that a simpler technique outperforms sophisticated ones “on most projects” requires the per-project breakdown (including number of projects, effect sizes, and statistical significance tests) to be shown; without these, the aggregate claim cannot be assessed for robustness.

minor comments (2)

[Abstract] Abstract and introduction should cite the specific prior DBRD papers whose techniques are re-evaluated so readers can immediately locate the baselines.
[Evaluation] Notation for the performance metrics (MAP, recall@K, etc.) should be defined at first use and kept consistent across tables and text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments on our paper. We address the major comments below and will incorporate the suggested changes in the revised manuscript.

read point-by-point responses

Referee: Benchmark construction section: the claim that correcting only for data age and ITS choice produces a realistic estimate of current-day performance rests on the assumption that these are the dominant biases; the paper should report the quantitative impact of each correction (e.g., change in MAP or recall@K before/after) and any sensitivity analysis on the age threshold chosen.

Authors: We agree that reporting the quantitative impact of each correction and including a sensitivity analysis would increase transparency. Our prior experiments identified data age and ITS choice as the biases producing statistically significant differences, but we will add explicit before/after metric tables (MAP, recall@K) for each correction step and a sensitivity analysis varying the age threshold in the revised benchmark-preparation section. revision: yes
Referee: Evaluation results: the statement that a simpler technique outperforms sophisticated ones “on most projects” requires the per-project breakdown (including number of projects, effect sizes, and statistical significance tests) to be shown; without these, the aggregate claim cannot be assessed for robustness.

Authors: We will expand the evaluation section to present the full per-project breakdown. The revision will report the exact number of projects, per-project performance values, effect sizes, and the results of statistical significance tests (e.g., paired Wilcoxon tests) so that the claim of outperformance “on most projects” can be directly verified. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation on external data

full rationale

The paper performs an empirical study: it identifies biases (data age, issue-tracking system) via experiments on project data, constructs a benchmark accordingly, then measures accuracy of existing DBRD techniques (research and industry) on that benchmark. No derivation chain, equations, fitted parameters renamed as predictions, self-definitional steps, or load-bearing self-citations appear; the central claims are direct observations from performance metrics on real bug-report corpora. The structure is self-contained against external benchmarks and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work is an empirical benchmarking study and introduces no new free parameters, ad-hoc axioms, or invented entities beyond standard assumptions of information-retrieval evaluation.

axioms (1)

domain assumption Standard information-retrieval metrics (precision, recall, etc.) and similarity measures are appropriate for evaluating DBRD techniques.
The abstract relies on these metrics to declare one technique superior without additional justification.

pith-pipeline@v0.9.0 · 5762 in / 1172 out tokens · 43791 ms · 2026-05-24T09:59:51.460030+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · 2 internal anchors

[1]

2022. GraphQL. https://docs.github.com/en/graphql. (Accessed on 02/10/2022)

work page 2022
[2]

2022. hadoop. https://issues.apache.org/jira/projects/HADOOP/issues. (Accessed on 02/10/2022)

work page 2022
[3]

2022. Jira. https://www.atlassian.com/software/jira. (Accessed on 02/10/2022)

work page 2022
[4]

2022. kibana. https://github.com/elastic/kibana. (Accessed on 02/10/2022)

work page 2022
[5]

2022. spark. https://issues.apache.org/jira/projects/SPARK/issues. (Accessed on 02/10/2022)

work page 2022
[6]

vscode repository on GitHub

2022. vscode repository on GitHub. https://github.com/microsoft/vscode. (Accessed on 02/10/2022)

work page 2022
[7]

vscodebot on GitHub

2022. vscodebot on GitHub. https://github.com/apps/vscodebot. (Accessed on 02/10/2022)

work page 2022
[8]

Anahita Alipour, Abram Hindle, and Eleni Stroulia. 2013. A contextual approach towards more accurate duplicate bug report detection. In 2013 10th Working Conference on Mining Software Repositories (MSR) . IEEE, 183–192

work page 2013
[9]

Mehdi Amoui, Nilam Kaushik, Abraham Al-Dabbagh, Ladan Tahvildari, Shimin Li, and Weining Liu. 2013. Search-based duplicate defect detection: an industrial experience. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 173–182

work page 2013
[10]

Jude Arokiam and Jeremy S Bradbury. 2020. Automatically predicting bug severity early in the development process. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results . 17–20. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:30 Zhang et al

work page 2020
[11]

Atlassian. 2022. Open Source Project License Request | Atlassian. https://www.atlassian.com/software/views/open- source-license-request. (Accessed on 02/10/2022)

work page 2022
[12]

Aladdin Baarah, Ahmad Aloqaily, Zaher Salah, Mannam Zamzeer, and Mohammad Sallam. 2019. Machine learning approaches for predicting the severity level of software bug reports in closed source projects. International Journal of Advanced Computer Science and Applications 10, 10.14569 (2019)

work page 2019
[13]

Victor R Basili. 2007. The role of controlled experiments in software engineering research. In Empirical Software Engineering Issues. Critical Assessment and Future Directions . Springer, 33–37

work page 2007
[14]

Nicolas Bettenburg, Rahul Premraj, Thomas Zimmermann, and Sunghun Kim. 2008. Duplicate bug reports considered harmful. . . really?. In2008 IEEE International Conference on Software Maintenance . IEEE, 337–345

work page 2008
[15]

Jake D Brutlag, Hilary Hutchinson, and Maria Stone. 2008. User preference and search engine latency. (2008)

work page 2008
[16]

Bugzilla. 2022. Eclipse. https://bugs.eclipse.org/bugs/. (Accessed on 02/10/2022)

work page 2022
[17]

Bugzilla. 2022. Mozilla. https://bugzilla.mozilla.org/home. (Accessed on 02/10/2022)

work page 2022
[18]

Jayati Deshmukh, KM Annervaz, Sanjay Podder, Shubhashis Sengupta, and Neville Dubash. 2017. Towards accurate duplicate bug retrieval using deep learning techniques. In 2017 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 115–124

work page 2017
[19]

Bugzilla Wiki FAQ. 2022. How to Mark a Bug Report as a Duplicate? https://wiki.documentfoundation.org/QA/ Bugzilla/FAQ#How_to_Mark_a_Bug_Report_as_a_Duplicate. (Accessed on 02/10/2022)

work page 2022
[20]

Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1797–1806

work page 2017
[21]

Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the 2017 11th joint meeting on foundations of software engineering . 49–60

work page 2017
[22]

Cooper, Andreas Herzig, Fr ´ed´eric Maris & Julien Vianey (2018): Temporal Epistemic Gossip Problems

Vahid Garousi, Michael Felderer, Mika V. Mäntylä, and Austen Rainer. 2020.Benefitting from the Grey Literature in Software Engineering Research. Springer International Publishing, Cham, 385–413. https://doi.org/10.1007/978-3-030- 32489-6_14

work page doi:10.1007/978-3-030- 2020
[23]

Edmund A Gehan. 1965. A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52, 1-2 (1965), 203–224

work page 1965
[24]

GitHub. [n. d.]. Marking issues or pull requests as a duplicate - GitHub Docs. https://docs.github.com/en/issues/tracking- your-work-with-issues/marking-issues-or-pull-requests-as-a-duplicate. (Accessed on 02/10/2022)

work page 2022
[25]

Mehdi Golzadeh, Alexandre Decan, Eleni Constantinou, and Tom Mens. 2021. Identifying bot activity in GitHub pull request and issue comments. In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE) . IEEE, 21–25

work page 2021
[26]

Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR ’13). IEEE Press, Piscataway, NJ, USA, 233–236. http: //dl.acm.org/citation.cfm?id=2487085.2487132

work page arXiv 2013
[27]

Jianjun He, Ling Xu, Meng Yan, Xin Xia, and Yan Lei. 2020. Duplicate bug report detection using dual-channel convolutional neural networks. In Proceedings of the 28th International Conference on Program Comprehension . 117–127

work page 2020
[28]

Nilam Kaushik and Ladan Tahvildari. 2012. A comparative study of the performance of IR models on duplicate bug detection. In 2012 16th European Conference on Software Maintenance and Reengineering . IEEE, 159–168

work page 2012
[29]

Misoo Kim and Eunseok Lee. 2021. Are datasets for information retrieval-based bug localization techniques trustworthy? Empirical Software Engineering 26, 3 (2021), 1–66

work page 2021
[30]

Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. 2016. Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis . 165–176

work page 2016
[31]

Berfin Kucuk and Eray Tuzun. 2021. Characterizing duplicate bugs: An empirical analysis. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 661–668

work page 2021
[32]

Alina Lazar, Sarah Ritchey, and Bonita Sharif. 2014. Generating duplicate bug datasets. In Proceedings of the 11th working conference on mining software repositories . 392–395

work page 2014
[33]

Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. 2015. Support vector machines and word2vec for text classification with semantic features. In 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, 136–140

work page 2015
[34]

Wenjie Liu, Shanshan Wang, Xin Chen, and He Jiang. 2018. Predicting the severity of bug reports based on feature selection. International Journal of Software Engineering and Knowledge Engineering 28, 04 (2018), 537–558

work page 2018
[35]

Pablo Loyola, Kugamoorthy Gajananan, and Fumiko Satoh. 2018. Bug localization by learning to rank and represent bug inducing changes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management . 657–665

work page 2018
[36]

Tim Menzies, Suvodeep Majumder, Nikhila Balaji, Katie Brey, and Wei Fu. 2018. 500+ times faster than deep learning:(a case study exploring faster methods for text mining stackoverflow). In 2018 IEEE/ACM 15th International Conference on J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. Duplicate Bug Report Detection: How Far Are We? 111:3...

work page 2018
[37]

Lloyd Montgomery, Clara Lüders, and Walid Maalej. 2022. An Alternative Issue Tracking Dataset of Public Jira Repositories. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR) . IEEE, 73–77

work page 2022
[38]

Jakob Nielsen. 1994. Usability engineering. Morgan Kaufmann

work page 1994
[39]

Michael Pradel, Vijayaraghavan Murali, Rebecca Qian, Mateusz Machalica, Erik Meijer, and Satish Chandra. 2020. Scaffle: Bug Localization on Millions of Files. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 225–236. ht...

work page doi:10.1145/3395363.3397356 2020
[40]

Mohamed Sami Rakha, Cor-Paul Bezemer, and Ahmed E Hassan. 2018. Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval. Empirical Software Engineering 23, 5 (2018), 2597–2621

work page 2018
[41]

Mohamed Sami Rakha, Weiyi Shang, and Ahmed E Hassan. 2016. Studying the needed effort for identifying duplicate issues. Empirical Software Engineering 21, 5 (2016), 1960–1989

work page 2016
[42]

Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management . 42–49

work page 2004
[43]

Irving Muller Rodrigues, Daniel Aloise, Eraldo Rezende Fernandes, and Michel Dagenais. 2020. A Soft Alignment Model for Bug Deduplication. InProceedings of the 17th International Conference on Mining Software Repositories . 43–53

work page 2020
[44]

Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research , Vol. 13

work page 2006
[45]

Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[46]

Per Runeson, Magnus Alexandersson, and Oskar Nyholm. 2007. Detection of duplicate defect reports using natural language processing. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 499–510

work page 2007
[47]

Marcos Salganicoff. 1997. Tolerating Concept and Sampling Shift in Lazy Learning Using Prediction Error Context Switching. Springer Netherlands, Dordrecht, 133–155. https://doi.org/10.1007/978-94-017-2053-3_5

work page doi:10.1007/978-94-017-2053-3_5 1997
[48]

Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Introduction to information retrieval . Vol. 39. Cambridge University Press Cambridge

work page 2008
[49]

Yanqi Su, Zhenchang Xing, Xin Peng, Xin Xia, Chong Wang, Xiwei Xu, and Liming Zhu. 2021. Reducing Bug Triaging Confusion by Learning from Mistakes with a Bug Tossing Knowledge Graph. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 191–202

work page 2021
[50]

Chengnian Sun, David Lo, Siau-Cheng Khoo, and Jing Jiang. 2011. Towards more accurate retrieval of duplicate bug reports. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011) . IEEE, 253–262

work page 2011
[51]

Chengnian Sun, David Lo, Xiaoyin Wang, Jing Jiang, and Siau-Cheng Khoo. 2010. A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 45–54

work page 2010
[52]

Feifei Tu, Jiaxin Zhu, Qimu Zheng, and Minghui Zhou. 2018. Be Careful of When: An Empirical Study on Time-Related Misuse of Issue Tracking Data. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for ...

work page doi:10.1145/3236024.3236054 2018
[53]

Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. 2008. An approach to detecting duplicate bug reports using natural language and execution information. In Proceedings of the 30th international conference on Software engineering. 461–470

work page 2008
[54]

Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How long will it take to fix this bug?. In Fourth International Workshop on Mining Software Repositories (MSR’07: ICSE Workshops 2007) . IEEE, 1–1

work page 2007
[55]

Eric W Weisstein. 2004. Bonferroni correction. https://mathworld.wolfram.com/ (2004)

work page 2004
[56]

Xin Xia, David Lo, Ming Wen, Emad Shihab, and Bo Zhou. 2014. An empirical study of bug report field reassignment. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 174–183

work page 2014
[57]

Guanping Xiao, Xiaoting Du, Yulei Sui, and Tao Yue. 2020. HINDBR: Heterogeneous information network based duplicate bug report prediction. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 195–206

work page 2020
[58]

Qi Xie, Zhiyuan Wen, Jieming Zhu, Cuiyun Gao, and Zibin Zheng. 2018. Detecting duplicate bug reports with convolutional neural networks. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC) . IEEE, 416–425

work page 2018
[59]

Jifeng Xuan, He Jiang, Zhilei Ren, Jun Yan, and Zhongxuan Luo. 2017. Automatic bug triage using semi-supervised text classification. arXiv preprint arXiv:1704.04769 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[60]

Zhengran Zeng, Yuqun Zhang, Haotian Zhang, and Lingming Zhang. 2021. Deep just-in-time defect prediction: how far are we?. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis . 427–438. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:32 Zhang et al

work page 2021
[61]

Jian Zhou and Hongyu Zhang. 2012. Learning to rank duplicate bug reports. InProceedings of the 21st ACM international conference on Information and knowledge management . 852–861

work page 2012
[62]

Indr˙e Žliobait ˙e, Mykola Pechenizkiy, and Joao Gama. 2016. An overview of concept drift applications.Big data analysis: new algorithms for a new society (2016), 91–114

work page 2016
[63]

Weiqin Zou, David Lo, Zhenyu Chen, Xin Xia, Yang Feng, and Baowen Xu. 2018. How practitioners perceive automated bug report management techniques. IEEE Transactions on Software Engineering 46, 8 (2018), 836–862. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018

work page 2018

[1] [1]

2022. GraphQL. https://docs.github.com/en/graphql. (Accessed on 02/10/2022)

work page 2022

[2] [2]

2022. hadoop. https://issues.apache.org/jira/projects/HADOOP/issues. (Accessed on 02/10/2022)

work page 2022

[3] [3]

2022. Jira. https://www.atlassian.com/software/jira. (Accessed on 02/10/2022)

work page 2022

[4] [4]

2022. kibana. https://github.com/elastic/kibana. (Accessed on 02/10/2022)

work page 2022

[5] [5]

2022. spark. https://issues.apache.org/jira/projects/SPARK/issues. (Accessed on 02/10/2022)

work page 2022

[6] [6]

vscode repository on GitHub

2022. vscode repository on GitHub. https://github.com/microsoft/vscode. (Accessed on 02/10/2022)

work page 2022

[7] [7]

vscodebot on GitHub

2022. vscodebot on GitHub. https://github.com/apps/vscodebot. (Accessed on 02/10/2022)

work page 2022

[8] [8]

Anahita Alipour, Abram Hindle, and Eleni Stroulia. 2013. A contextual approach towards more accurate duplicate bug report detection. In 2013 10th Working Conference on Mining Software Repositories (MSR) . IEEE, 183–192

work page 2013

[9] [9]

Mehdi Amoui, Nilam Kaushik, Abraham Al-Dabbagh, Ladan Tahvildari, Shimin Li, and Weining Liu. 2013. Search-based duplicate defect detection: an industrial experience. In 2013 10th Working Conference on Mining Software Repositories (MSR). IEEE, 173–182

work page 2013

[10] [10]

Jude Arokiam and Jeremy S Bradbury. 2020. Automatically predicting bug severity early in the development process. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering: New Ideas and Emerging Results . 17–20. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:30 Zhang et al

work page 2020

[11] [11]

Atlassian. 2022. Open Source Project License Request | Atlassian. https://www.atlassian.com/software/views/open- source-license-request. (Accessed on 02/10/2022)

work page 2022

[12] [12]

Aladdin Baarah, Ahmad Aloqaily, Zaher Salah, Mannam Zamzeer, and Mohammad Sallam. 2019. Machine learning approaches for predicting the severity level of software bug reports in closed source projects. International Journal of Advanced Computer Science and Applications 10, 10.14569 (2019)

work page 2019

[13] [13]

Victor R Basili. 2007. The role of controlled experiments in software engineering research. In Empirical Software Engineering Issues. Critical Assessment and Future Directions . Springer, 33–37

work page 2007

[14] [14]

Nicolas Bettenburg, Rahul Premraj, Thomas Zimmermann, and Sunghun Kim. 2008. Duplicate bug reports considered harmful. . . really?. In2008 IEEE International Conference on Software Maintenance . IEEE, 337–345

work page 2008

[15] [15]

Jake D Brutlag, Hilary Hutchinson, and Maria Stone. 2008. User preference and search engine latency. (2008)

work page 2008

[16] [16]

Bugzilla. 2022. Eclipse. https://bugs.eclipse.org/bugs/. (Accessed on 02/10/2022)

work page 2022

[17] [17]

Bugzilla. 2022. Mozilla. https://bugzilla.mozilla.org/home. (Accessed on 02/10/2022)

work page 2022

[18] [18]

Jayati Deshmukh, KM Annervaz, Sanjay Podder, Shubhashis Sengupta, and Neville Dubash. 2017. Towards accurate duplicate bug retrieval using deep learning techniques. In 2017 IEEE International conference on software maintenance and evolution (ICSME). IEEE, 115–124

work page 2017

[19] [19]

Bugzilla Wiki FAQ. 2022. How to Mark a Bug Report as a Duplicate? https://wiki.documentfoundation.org/QA/ Bugzilla/FAQ#How_to_Mark_a_Bug_Report_as_a_Duplicate. (Accessed on 02/10/2022)

work page 2022

[20] [20]

Tao-yang Fu, Wang-Chien Lee, and Zhen Lei. 2017. Hin2vec: Explore meta-paths in heterogeneous information networks for representation learning. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management. 1797–1806

work page 2017

[21] [21]

Wei Fu and Tim Menzies. 2017. Easy over hard: A case study on deep learning. In Proceedings of the 2017 11th joint meeting on foundations of software engineering . 49–60

work page 2017

[22] [22]

Cooper, Andreas Herzig, Fr ´ed´eric Maris & Julien Vianey (2018): Temporal Epistemic Gossip Problems

Vahid Garousi, Michael Felderer, Mika V. Mäntylä, and Austen Rainer. 2020.Benefitting from the Grey Literature in Software Engineering Research. Springer International Publishing, Cham, 385–413. https://doi.org/10.1007/978-3-030- 32489-6_14

work page doi:10.1007/978-3-030- 2020

[23] [23]

Edmund A Gehan. 1965. A generalized Wilcoxon test for comparing arbitrarily singly-censored samples. Biometrika 52, 1-2 (1965), 203–224

work page 1965

[24] [24]

GitHub. [n. d.]. Marking issues or pull requests as a duplicate - GitHub Docs. https://docs.github.com/en/issues/tracking- your-work-with-issues/marking-issues-or-pull-requests-as-a-duplicate. (Accessed on 02/10/2022)

work page 2022

[25] [25]

Mehdi Golzadeh, Alexandre Decan, Eleni Constantinou, and Tom Mens. 2021. Identifying bot activity in GitHub pull request and issue comments. In 2021 IEEE/ACM Third International Workshop on Bots in Software Engineering (BotSE) . IEEE, 21–25

work page 2021

[26] [26]

Georgios Gousios. 2013. The GHTorrent dataset and tool suite. In Proceedings of the 10th Working Conference on Mining Software Repositories (San Francisco, CA, USA) (MSR ’13). IEEE Press, Piscataway, NJ, USA, 233–236. http: //dl.acm.org/citation.cfm?id=2487085.2487132

work page arXiv 2013

[27] [27]

Jianjun He, Ling Xu, Meng Yan, Xin Xia, and Yan Lei. 2020. Duplicate bug report detection using dual-channel convolutional neural networks. In Proceedings of the 28th International Conference on Program Comprehension . 117–127

work page 2020

[28] [28]

Nilam Kaushik and Ladan Tahvildari. 2012. A comparative study of the performance of IR models on duplicate bug detection. In 2012 16th European Conference on Software Maintenance and Reengineering . IEEE, 159–168

work page 2012

[29] [29]

Misoo Kim and Eunseok Lee. 2021. Are datasets for information retrieval-based bug localization techniques trustworthy? Empirical Software Engineering 26, 3 (2021), 1–66

work page 2021

[30] [30]

Pavneet Singh Kochhar, Xin Xia, David Lo, and Shanping Li. 2016. Practitioners’ expectations on automated fault localization. In Proceedings of the 25th International Symposium on Software Testing and Analysis . 165–176

work page 2016

[31] [31]

Berfin Kucuk and Eray Tuzun. 2021. Characterizing duplicate bugs: An empirical analysis. In 2021 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER) . IEEE, 661–668

work page 2021

[32] [32]

Alina Lazar, Sarah Ritchey, and Bonita Sharif. 2014. Generating duplicate bug datasets. In Proceedings of the 11th working conference on mining software repositories . 392–395

work page 2014

[33] [33]

Joseph Lilleberg, Yun Zhu, and Yanqing Zhang. 2015. Support vector machines and word2vec for text classification with semantic features. In 2015 IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC). IEEE, 136–140

work page 2015

[34] [34]

Wenjie Liu, Shanshan Wang, Xin Chen, and He Jiang. 2018. Predicting the severity of bug reports based on feature selection. International Journal of Software Engineering and Knowledge Engineering 28, 04 (2018), 537–558

work page 2018

[35] [35]

Pablo Loyola, Kugamoorthy Gajananan, and Fumiko Satoh. 2018. Bug localization by learning to rank and represent bug inducing changes. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management . 657–665

work page 2018

[36] [36]

Tim Menzies, Suvodeep Majumder, Nikhila Balaji, Katie Brey, and Wei Fu. 2018. 500+ times faster than deep learning:(a case study exploring faster methods for text mining stackoverflow). In 2018 IEEE/ACM 15th International Conference on J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. Duplicate Bug Report Detection: How Far Are We? 111:3...

work page 2018

[37] [37]

Lloyd Montgomery, Clara Lüders, and Walid Maalej. 2022. An Alternative Issue Tracking Dataset of Public Jira Repositories. In 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR) . IEEE, 73–77

work page 2022

[38] [38]

Jakob Nielsen. 1994. Usability engineering. Morgan Kaufmann

work page 1994

[39] [39]

Michael Pradel, Vijayaraghavan Murali, Rebecca Qian, Mateusz Machalica, Erik Meijer, and Satish Chandra. 2020. Scaffle: Bug Localization on Millions of Files. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis (Virtual Event, USA) (ISSTA 2020). Association for Computing Machinery, New York, NY, USA, 225–236. ht...

work page doi:10.1145/3395363.3397356 2020

[40] [40]

Mohamed Sami Rakha, Cor-Paul Bezemer, and Ahmed E Hassan. 2018. Revisiting the performance of automated approaches for the retrieval of duplicate reports in issue tracking systems that perform just-in-time duplicate retrieval. Empirical Software Engineering 23, 5 (2018), 2597–2621

work page 2018

[41] [41]

Mohamed Sami Rakha, Weiyi Shang, and Ahmed E Hassan. 2016. Studying the needed effort for identifying duplicate issues. Empirical Software Engineering 21, 5 (2016), 1960–1989

work page 2016

[42] [42]

Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25 extension to multiple weighted fields. In Proceedings of the thirteenth ACM international conference on Information and knowledge management . 42–49

work page 2004

[43] [43]

Irving Muller Rodrigues, Daniel Aloise, Eraldo Rezende Fernandes, and Michel Dagenais. 2020. A Soft Alignment Model for Bug Deduplication. InProceedings of the 17th International Conference on Mining Software Repositories . 43–53

work page 2020

[44] [44]

Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, and Jeff Skowronek. 2006. Appropriate statistics for ordinal level data: Should we really be using t-test and Cohen’sd for evaluating group differences on the NSSE and other surveys. In annual meeting of the Florida Association of Institutional Research , Vol. 13

work page 2006

[45] [45]

Xin Rong. 2014. word2vec parameter learning explained. arXiv preprint arXiv:1411.2738 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[46] [46]

Per Runeson, Magnus Alexandersson, and Oskar Nyholm. 2007. Detection of duplicate defect reports using natural language processing. In 29th International Conference on Software Engineering (ICSE’07). IEEE, 499–510

work page 2007

[47] [47]

Marcos Salganicoff. 1997. Tolerating Concept and Sampling Shift in Lazy Learning Using Prediction Error Context Switching. Springer Netherlands, Dordrecht, 133–155. https://doi.org/10.1007/978-94-017-2053-3_5

work page doi:10.1007/978-94-017-2053-3_5 1997

[48] [48]

Hinrich Schütze, Christopher D Manning, and Prabhakar Raghavan. 2008. Introduction to information retrieval . Vol. 39. Cambridge University Press Cambridge

work page 2008

[49] [49]

Yanqi Su, Zhenchang Xing, Xin Peng, Xin Xia, Chong Wang, Xiwei Xu, and Liming Zhu. 2021. Reducing Bug Triaging Confusion by Learning from Mistakes with a Bug Tossing Knowledge Graph. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE) . IEEE, 191–202

work page 2021

[50] [50]

Chengnian Sun, David Lo, Siau-Cheng Khoo, and Jing Jiang. 2011. Towards more accurate retrieval of duplicate bug reports. In 2011 26th IEEE/ACM International Conference on Automated Software Engineering (ASE 2011) . IEEE, 253–262

work page 2011

[51] [51]

Chengnian Sun, David Lo, Xiaoyin Wang, Jing Jiang, and Siau-Cheng Khoo. 2010. A discriminative model approach for accurate duplicate bug report retrieval. In Proceedings of the 32nd ACM/IEEE International Conference on Software Engineering-Volume 1. 45–54

work page 2010

[52] [52]

Feifei Tu, Jiaxin Zhu, Qimu Zheng, and Minghui Zhou. 2018. Be Careful of When: An Empirical Study on Time-Related Misuse of Issue Tracking Data. In Proceedings of the 2018 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (Lake Buena Vista, FL, USA) (ESEC/FSE 2018). Association for ...

work page doi:10.1145/3236024.3236054 2018

[53] [53]

Xiaoyin Wang, Lu Zhang, Tao Xie, John Anvik, and Jiasu Sun. 2008. An approach to detecting duplicate bug reports using natural language and execution information. In Proceedings of the 30th international conference on Software engineering. 461–470

work page 2008

[54] [54]

Cathrin Weiss, Rahul Premraj, Thomas Zimmermann, and Andreas Zeller. 2007. How long will it take to fix this bug?. In Fourth International Workshop on Mining Software Repositories (MSR’07: ICSE Workshops 2007) . IEEE, 1–1

work page 2007

[55] [55]

Eric W Weisstein. 2004. Bonferroni correction. https://mathworld.wolfram.com/ (2004)

work page 2004

[56] [56]

Xin Xia, David Lo, Ming Wen, Emad Shihab, and Bo Zhou. 2014. An empirical study of bug report field reassignment. In 2014 Software Evolution Week-IEEE Conference on Software Maintenance, Reengineering, and Reverse Engineering (CSMR-WCRE). IEEE, 174–183

work page 2014

[57] [57]

Guanping Xiao, Xiaoting Du, Yulei Sui, and Tao Yue. 2020. HINDBR: Heterogeneous information network based duplicate bug report prediction. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE) . IEEE, 195–206

work page 2020

[58] [58]

Qi Xie, Zhiyuan Wen, Jieming Zhu, Cuiyun Gao, and Zibin Zheng. 2018. Detecting duplicate bug reports with convolutional neural networks. In 2018 25th Asia-Pacific Software Engineering Conference (APSEC) . IEEE, 416–425

work page 2018

[59] [59]

Jifeng Xuan, He Jiang, Zhilei Ren, Jun Yan, and Zhongxuan Luo. 2017. Automatic bug triage using semi-supervised text classification. arXiv preprint arXiv:1704.04769 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[60] [60]

Zhengran Zeng, Yuqun Zhang, Haotian Zhang, and Lingming Zhang. 2021. Deep just-in-time defect prediction: how far are we?. In Proceedings of the 30th ACM SIGSOFT International Symposium on Software Testing and Analysis . 427–438. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018. 111:32 Zhang et al

work page 2021

[61] [61]

Jian Zhou and Hongyu Zhang. 2012. Learning to rank duplicate bug reports. InProceedings of the 21st ACM international conference on Information and knowledge management . 852–861

work page 2012

[62] [62]

Indr˙e Žliobait ˙e, Mykola Pechenizkiy, and Joao Gama. 2016. An overview of concept drift applications.Big data analysis: new algorithms for a new society (2016), 91–114

work page 2016

[63] [63]

Weiqin Zou, David Lo, Zhenyu Chen, Xin Xia, Yang Feng, and Baowen Xu. 2018. How practitioners perceive automated bug report management techniques. IEEE Transactions on Software Engineering 46, 8 (2018), 836–862. J. ACM, Vol. 37, No. 4, Article 111. Publication date: August 2018

work page 2018