Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval
Pith reviewed 2026-05-09 19:08 UTC · model grok-4.3
The pith
Dense retrieval outperforms sparse methods for linking issues to commits, while traditional machine learning reranking beats large language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By comparing retrieval methods such as BM25 and dense alternatives like SBERT and ANNOY to shrink candidate sets, then reranking with traditional machine learning models, cross-encoders, and LLMs including ChatGPT, the paper shows dense retrieval exceeds sparse retrieval in identifying relevant commits, hybrid retrieval further improves recall, and traditional machine learning reranking yields higher performance than LLM-based reranking.
What carries the argument
A two-stage retrieval followed by reranking pipeline that first narrows commits with dense or sparse search and then refines candidates using machine learning or large language model models.
If this is right
- Dense retrieval methods identify relevant commits more effectively than sparse retrieval approaches.
- Combining dense and sparse retrieval increases recall of true issue-commit links.
- Traditional machine learning reranking achieves higher precision than LLM-based reranking.
- Retrieval-based pipelines remain practical for large-scale issue-commit linking without LLMs.
- Simpler models should be considered before using computationally expensive LLM approaches.
Where Pith is reading between the lines
- Developers could gain efficiency by embedding hybrid retrieval directly into issue tracking systems to speed up change understanding.
- Further prompt engineering tailored to this task might reduce the observed gap between LLMs and traditional rerankers.
- Resource-limited teams may prefer avoiding LLMs in reranking to maintain accuracy at lower computational cost.
- The same staged pipeline could be adapted and tested for related traceability problems such as linking requirements to code.
Load-bearing premise
The selected datasets and metrics such as precision and recall represent real developer needs for issue-commit links, and the prompts for large language models were optimized equally across models.
What would settle it
A new evaluation on different open-source project data where an LLM reranker reaches higher precision and recall than the best traditional machine learning reranker would disprove the main performance claims.
Figures
read the original abstract
Linking issue reports to the commits that resolve them is essential for software traceability, maintenance, and evolution. Accurate issue-commit links help developers to understand system changes and the rationale behind them. While numerous automated techniques have been proposed, ranging from heuristic and feature-based approaches to modern deep learning and large language model approaches, our goal is to evaluate these techniques to determine which are most effective and efficient. In this study, we revisit several established issue-commit link recovery techniques, including BTLink, EasyLink, FRLink, RCLinker, and Hybrid-Linker, and assess their performance for reranking issue-commit links. We first evaluate different retrieval methods (BM25, BM25L, SBERT-Semantic Search, ANNOY, LSH, HNSW) for their ability to efficiently retrieve relevant commits, reducing the candidate set that must be considered by more computationally expensive models. Using the best retrieval methods, we then investigate the reranking effectiveness of different machine learning-based techniques, including traditional machine learning models, a cross-encoder, and large language models (ChatGPT, Qwen, Gemma, Llama), to refine the reranking of candidate commits and improve precision. Finally, we compare the effectiveness of these techniques. Our results show that dense retrieval methods outperform sparse retrieval approaches in identifying relevant commits and that combining dense and sparse retrieval can improve recall. Additionally, we find that traditional machine learning-based reranking techniques achieve higher performance than LLM-based approaches. Our results highlight that retrieval-based pipelines remain a practical and effective solution for large-scale issue-commit linking, and that simpler models should be carefully considered before adopting computationally expensive LLM-based approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript empirically evaluates retrieval methods (BM25, BM25L, SBERT, ANNOY, LSH, HNSW) and reranking techniques (traditional ML models, cross-encoder, and LLMs including ChatGPT, Qwen, Gemma, Llama) for recovering links between issue reports and resolving commits. It claims dense retrieval outperforms sparse retrieval, hybrid dense-sparse retrieval improves recall, and traditional ML reranking outperforms LLM-based reranking, concluding that simpler retrieval-based pipelines remain practical and that computationally expensive LLMs should be considered only after simpler alternatives.
Significance. If the comparative results hold under fair conditions, the work offers actionable guidance for software traceability tools by showing that established dense retrieval and supervised ML rerankers can deliver strong performance without the overhead of LLMs. This could help practitioners prioritize efficiency in large-scale repositories and temper enthusiasm for LLM adoption in link recovery tasks.
major comments (2)
- [Reranking experiments] Reranking experiments: the central claim that traditional machine learning-based reranking techniques achieve higher performance than LLM-based approaches (ChatGPT, Qwen, Gemma, Llama) is load-bearing for the practical recommendation to prefer simpler models. However, the manuscript provides no details on the prompts used for the LLMs, whether they were zero-shot or few-shot, or any optimization steps, while the ML models appear to have been trained on labeled data. This raises the possibility that the observed gap reflects prompt design rather than inherent model capability.
- [Evaluation methodology] Evaluation methodology: the abstract and results report comparative precision/recall figures across retrieval and reranking methods but supply no information on the datasets (size, selection criteria, or public availability), statistical significance tests, or error bars/variance. Without these, the reliability of the dense-vs-sparse and ML-vs-LLM comparisons cannot be assessed.
minor comments (1)
- [Abstract] The abstract lists specific retrieval methods (BM25, SBERT-Semantic Search, ANNOY, etc.) but does not indicate which ones were ultimately selected as 'best' before the reranking stage; a table or explicit statement would improve clarity.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We value the opportunity to clarify and strengthen our work based on these comments. Below, we address each major comment point by point.
read point-by-point responses
-
Referee: [Reranking experiments] Reranking experiments: the central claim that traditional machine learning-based reranking techniques achieve higher performance than LLM-based approaches (ChatGPT, Qwen, Gemma, Llama) is load-bearing for the practical recommendation to prefer simpler models. However, the manuscript provides no details on the prompts used for the LLMs, whether they were zero-shot or few-shot, or any optimization steps, while the ML models appear to have been trained on labeled data. This raises the possibility that the observed gap reflects prompt design rather than inherent model capability.
Authors: We agree that providing details on the LLM prompting is essential for interpreting the results. In our experiments, we employed zero-shot prompts for all LLMs, using a standardized template that asked the model to assess the relevance of each candidate commit to the given issue report and output a ranked list. No few-shot examples were used, as this would have required project-specific examples and could introduce bias; we aimed for a generalizable approach. No additional optimization or fine-tuning was performed on the LLMs beyond their default configurations. The traditional ML rerankers were trained on the labeled issue-commit pairs from the training splits of our datasets using standard supervised learning. We acknowledge that advanced prompt engineering might narrow the performance gap, but our comparison reflects practical, straightforward application of these models. In the revised manuscript, we will include the exact prompt templates used, the model versions, and a discussion of this aspect in the methodology section. revision: yes
-
Referee: [Evaluation methodology] Evaluation methodology: the abstract and results report comparative precision/recall figures across retrieval and reranking methods but supply no information on the datasets (size, selection criteria, or public availability), statistical significance tests, or error bars/variance. Without these, the reliability of the dense-vs-sparse and ML-vs-LLM comparisons cannot be assessed.
Authors: We apologize for not making this information more prominent. The full manuscript details the datasets in the experimental setup section, including the specific software projects used (with sizes ranging from hundreds to thousands of issues and commits), selection criteria based on projects with available ground-truth links from prior studies, and links to public repositories. However, to improve clarity, we will revise the abstract to briefly mention the dataset characteristics and add a dedicated subsection on statistical analysis. We performed paired statistical significance tests (Wilcoxon signed-rank test) between methods, and the revised version will report p-values along with error bars representing standard deviation across the projects in the figures. This will allow better assessment of the reliability of our comparisons. revision: yes
Circularity Check
No circularity: pure empirical comparison on public benchmarks
full rationale
The paper performs an empirical evaluation of existing retrieval methods (BM25 variants, dense retrievers) and reranking approaches (traditional ML, cross-encoders, LLMs) on standard issue-commit linking datasets. No derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains are present. All claims rest on direct experimental measurements of precision/recall against public benchmarks, with no equations or theoretical steps that reduce to inputs by construction. Self-citations to prior techniques (BTLink, FRLink, etc.) serve only as baselines and do not bear the load of any new result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN- Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems (2020), 13 pages
work page 2020
-
[2]
Thazin Win Win Aung, Huan Huo, and Yulei Sui. 2020. A Literature Review of Automatic Traceability Links Recovery for Software Change Impact Analysis. In Proceedings of the 28th International Conference on Program Comprehension. 14–24
work page 2020
-
[3]
Adrian Bachmann, Christian Bird, Foyzur Rahman, Premkumar Devanbu, and Abraham Bernstein. 2010. The missing links: bugs and bug-fix commits. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering. 97–106
work page 2010
-
[4]
Erik Bernhardsson. [n. d.]. Approximate Nearest Neighbors Oh Yeah (ANNOY). https://github.com/spotify/annoy Retrieved April 30, 2026
work page 2026
-
[5]
Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and balanced? bias in bug-fix datasets. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering. 121–130
work page 2009
-
[6]
D. Brown. 2020. Rank-BM25: A Collection of BM25 Algorithms in Python. https://github.com/dorianbrown/rank_bm25 Retrieved April 30, 2026
work page 2020
-
[7]
Jane Cleland-Huang, Orlena C. Z. Gotel, Jane Huffman Hayes, Patrick Mäder, and Andrea Zisman. 2014. Software traceability: trends and future directions. In Future of Software Engineering Proceedings. 55–69
work page 2014
-
[8]
Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement (1960), 37–46
work page 1960
-
[9]
Liming Dong, He Zhang, Wei Liu, Zhiluo Weng, and Hongyu Kuang. 2022. Semi-supervised pre-processing for learning-based traceability framework on real-world software projects. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 570–582
work page 2022
-
[10]
Google. 2024. Gemma-7B-IT. https://huggingface.co/google/gemma-7b-it Re- trieved April 30, 2026
work page 2024
-
[11]
Pengfei He, Shaowei Wang, Shaiful Chowdhury, and Tse-Hsun Chen. 2025. Eval- uating the Effectiveness and Efficiency of Demonstration Retrievers in RAG for Coding Tasks. InProceedings of 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 500–510
work page 2025
-
[12]
Huihui Huang, Ratnadira Widyasari, Ting Zhang, Ivana Clairine Irsan, Jieke Shi, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, and David Lo. 2025. Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval. InProceedings of the IEEE/ACM 48th International Conference on Software Engineering. 13 pages
work page 2025
-
[13]
Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam, and Chidambaram Crushev. 2021. A Survey on Locality Sensitive Hashing Algorithms and their Applications. arXiv (2021)
work page 2021
-
[14]
Masanari Kondo, Yutaro Kashiwa, Yasutaka Kamei, and Osamu Mizuno. 2022. An empirical study of issue-link algorithms: which issue-link algorithms should we use? Empirical Software Engineering (2022), 50 pages
work page 2022
-
[15]
Jinpeng Lan, Lina Gong, Jingxuan Zhang, and Haoxiang Zhang. 2023. BTLink : automatic link recovery between issues and commits based on pre-trained BERT model. Empirical Software Engineering (2023), 55 pages
work page 2023
-
[16]
Le, Mario Linares-Vasquez, David Lo, and Denys Poshyvanyk
Tien-Duy B. Le, Mario Linares-Vasquez, David Lo, and Denys Poshyvanyk. 2015. RCLinker: Automated Linking of Issue Reports and Commits Leveraging Rich Contextual Information. In Proceedings of the 23rd International Conference on Program Comprehension. 36–47
work page 2015
-
[17]
Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, and Jane Cleland-Huang. 2021. Traceability Transformed: Generating more Accurate Links with Pre-Trained BERT Models. In Proceedings of the 43rd International Conference on Software Engineering. 324–335
work page 2021
-
[18]
Yuanhua Lv and ChengXiang Zhai. 2011. When documents are very long, BM25 fails!. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1103–1104
work page 2011
-
[19]
Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. (2020), 824–836
work page 2020
-
[20]
Pooya Rostami Mazrae, Maliheh Izadi, and Abbas Heydarnoori. 2021. Automated Recovery of Issue-Commit Links Leveraging Both Textual and Non-textual Data. In Proceedings of the 37th International Conference on Software Maintenance and Evolution (ICSME). 263–273
work page 2021
-
[21]
Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama- 3.1-8B-Instruct Retrieved April 30, 2026
work page 2024
-
[22]
Anh Tuan Nguyen, Tung Thanh Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen
-
[23]
Multi-layered approach for recovering links between bug reports and fixes. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. 11 pages
-
[24]
Replication Package. [n. d.]. Think Harder and Don’t Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval. https://figshare. com/s/3b9673176afe92929398 Retrieved April 30, 2026
work page 2026
-
[25]
Michael C. Panis. 2010. Successful Deployment of Requirements Traceability in a Commercial Engineering Organization...Really. In Proceedings of the 18th IEEE International Requirements Engineering Conference. 303–307
work page 2010
-
[26]
Qwen. 2025. Qwen3-32B. https://huggingface.co/Qwen/Qwen3-32B Retrieved April 30, 2026
work page 2025
-
[27]
Michael Rath, Jacob Rendall, Jin L. C. Guo, Jane Cleland-Huang, and Patrick Mäder. 2018. Traceability in the wild: automatically augmenting incomplete trace links. In Proceedings of the 40th International Conference on Software Engineering. 834–845
work page 2018
-
[28]
Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embed- dings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3980–3990
work page 2019
-
[29]
Rodriguez, Jane Cleland-Huang, and Davide Falessi
Alberto D. Rodriguez, Jane Cleland-Huang, and Davide Falessi. 2021. Lever- aging Intermediate Artifacts to Improve Automated Trace Link Retrieval. In Proceedings of the 37th International Conference on Software Maintenance and Evolution (ICSME). 81–92
work page 2021
-
[30]
Hang Ruan, Bihuan Chen, Xin Peng, and Wenyun Zhao. 2019. DeepLink: Recov- ering issue-commit links based on deep learning. J. Syst. Softw. (2019), 13 pages
work page 2019
-
[31]
Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki. 2024. Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Se- mantic Search and Hybrid Query-Based Retrievers. In Proceedings of the 7th International Conference on Multimedia Information Processing and Retrieval (MIPR). 155–161
work page 2024
-
[32]
Gerald Schermann, Martin Brandtner, Sebastiano Panichella, Philipp Leitner, and Harald Gall. 2015. Discovering Loners and Phantoms in Commit and Issue Data. In Proceedings of 23rd International Conference on Program Comprehension. 4–14
work page 2015
-
[33]
Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 908–911
work page 2018
-
[34]
Yan Sun, Celia Chen, Qing Wang, and Barry Boehm. 2017. Improving missing issue-commit link recovery using positive and unlabeled data. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 147–152
work page 2017
-
[35]
Yan Sun, Qing Wang, and Ye Yang. 2017. FRLink: Improving the recovery of missing issue-commit links by revisiting file relevance.Information and Software Technology (2017), 33–47
work page 2017
-
[36]
Gemma Team. 2024. Gemma: Open Models Based on Gemini Research and Technology. arXiv (2024)
work page 2024
-
[37]
Qwen Team. 2025. Qwen3 Technical Report. arXiv (2025)
work page 2025
-
[38]
Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. In Proceedings of the 19th Australasian Document Computing Symposium. 58–65
work page 2014
-
[39]
Rongxin Wu, Hongyu Zhang, Sunghun Kim, and Shing-Chi Cheung. 2011. Re- Link: recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 15–25
work page 2011
-
[40]
Zhaonan Wu, Yanjie Zhao, Chen Wei, Zirui Wan, Yue Liu, and Haoyu Wang
-
[41]
COmmitSHield: Tracking Vulnerability Introduction and Fix in Version Control Systems. In Proceedings of 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 279–290
-
[42]
Rui Xie, Long Chen, Wei Ye, Zhiyu Li, Tianxiang Hu, Dongdong Du, and Shikun Zhang. 2019. DeepLink: A Code Knowledge Graph Based Deep Learning Ap- proach for Issue-Commit Link Recovery. InProceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 434– 444
work page 2019
-
[43]
Chenyuan Zhang, Yanlin Wang, Zhao Wei, Yong Xu, Juhong Wang, Hui Li, and Rongrong Ji. 2023. EALink: An Efficient and Accurate Pre-trained Frame- work for Issue-Commit Link Recovery. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 217–229
work page 2023
-
[44]
Jianfei Zhu, Guanping Xiao, Zheng Zheng, and Yulei Sui. 2022. Enhancing Traceability Link Recovery with Unlabeled Data. In Proceedings of the 33rd International Symposium on Software Reliability Engineering (ISSRE). 446–457
work page 2022
-
[45]
Jianfei Zhu, Guanping Xiao, Zheng Zheng, and Yulei Sui. 2024. Deep semi- supervised learning for recovering traceability links between issues and commits. Journal of Systems and Software (2024), 19 pages
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.