pith. sign in

arxiv: 2605.00447 · v1 · submitted 2026-05-01 · 💻 cs.SE

Think Harder and Don't Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval

Pith reviewed 2026-05-09 19:08 UTC · model grok-4.3

classification 💻 cs.SE
keywords issue-commit linkingsoftware traceabilitydense retrievalsparse retrievalrerankinglarge language modelsmachine learning
0
0 comments X

The pith

Dense retrieval outperforms sparse methods for linking issues to commits, while traditional machine learning reranking beats large language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper evaluates automated ways to connect software issue reports with the code commits that fix them. It first tests retrieval methods to narrow down candidate commits from large histories and then applies reranking to improve accuracy. Dense retrieval techniques prove better at finding relevant commits than sparse ones, and blending the two raises recall. For the reranking step, conventional machine learning models deliver stronger results than large language model approaches. The work concludes that straightforward retrieval pipelines can manage large-scale traceability without the expense of LLMs.

Core claim

By comparing retrieval methods such as BM25 and dense alternatives like SBERT and ANNOY to shrink candidate sets, then reranking with traditional machine learning models, cross-encoders, and LLMs including ChatGPT, the paper shows dense retrieval exceeds sparse retrieval in identifying relevant commits, hybrid retrieval further improves recall, and traditional machine learning reranking yields higher performance than LLM-based reranking.

What carries the argument

A two-stage retrieval followed by reranking pipeline that first narrows commits with dense or sparse search and then refines candidates using machine learning or large language model models.

If this is right

  • Dense retrieval methods identify relevant commits more effectively than sparse retrieval approaches.
  • Combining dense and sparse retrieval increases recall of true issue-commit links.
  • Traditional machine learning reranking achieves higher precision than LLM-based reranking.
  • Retrieval-based pipelines remain practical for large-scale issue-commit linking without LLMs.
  • Simpler models should be considered before using computationally expensive LLM approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could gain efficiency by embedding hybrid retrieval directly into issue tracking systems to speed up change understanding.
  • Further prompt engineering tailored to this task might reduce the observed gap between LLMs and traditional rerankers.
  • Resource-limited teams may prefer avoiding LLMs in reranking to maintain accuracy at lower computational cost.
  • The same staged pipeline could be adapted and tested for related traceability problems such as linking requirements to code.

Load-bearing premise

The selected datasets and metrics such as precision and recall represent real developer needs for issue-commit links, and the prompts for large language models were optimized equally across models.

What would settle it

A new evaluation on different open-source project data where an LLM reranker reaches higher precision and recall than the best traditional machine learning reranker would disprove the main performance claims.

Figures

Figures reproduced from arXiv: 2605.00447 by Cole Morgan, Muhammad Asaduzzaman, Shaiful Chowdhurry, Shaowei Wang.

Figure 1
Figure 1. Figure 1: An example of a retrieved commit that is similar view at source ↗
Figure 2
Figure 2. Figure 2: An example of an information retrieval pipeline. view at source ↗
Figure 3
Figure 3. Figure 3: Distribution Of True Links Captured by Time Frames on Our Datasets view at source ↗
read the original abstract

Linking issue reports to the commits that resolve them is essential for software traceability, maintenance, and evolution. Accurate issue-commit links help developers to understand system changes and the rationale behind them. While numerous automated techniques have been proposed, ranging from heuristic and feature-based approaches to modern deep learning and large language model approaches, our goal is to evaluate these techniques to determine which are most effective and efficient. In this study, we revisit several established issue-commit link recovery techniques, including BTLink, EasyLink, FRLink, RCLinker, and Hybrid-Linker, and assess their performance for reranking issue-commit links. We first evaluate different retrieval methods (BM25, BM25L, SBERT-Semantic Search, ANNOY, LSH, HNSW) for their ability to efficiently retrieve relevant commits, reducing the candidate set that must be considered by more computationally expensive models. Using the best retrieval methods, we then investigate the reranking effectiveness of different machine learning-based techniques, including traditional machine learning models, a cross-encoder, and large language models (ChatGPT, Qwen, Gemma, Llama), to refine the reranking of candidate commits and improve precision. Finally, we compare the effectiveness of these techniques. Our results show that dense retrieval methods outperform sparse retrieval approaches in identifying relevant commits and that combining dense and sparse retrieval can improve recall. Additionally, we find that traditional machine learning-based reranking techniques achieve higher performance than LLM-based approaches. Our results highlight that retrieval-based pipelines remain a practical and effective solution for large-scale issue-commit linking, and that simpler models should be carefully considered before adopting computationally expensive LLM-based approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript empirically evaluates retrieval methods (BM25, BM25L, SBERT, ANNOY, LSH, HNSW) and reranking techniques (traditional ML models, cross-encoder, and LLMs including ChatGPT, Qwen, Gemma, Llama) for recovering links between issue reports and resolving commits. It claims dense retrieval outperforms sparse retrieval, hybrid dense-sparse retrieval improves recall, and traditional ML reranking outperforms LLM-based reranking, concluding that simpler retrieval-based pipelines remain practical and that computationally expensive LLMs should be considered only after simpler alternatives.

Significance. If the comparative results hold under fair conditions, the work offers actionable guidance for software traceability tools by showing that established dense retrieval and supervised ML rerankers can deliver strong performance without the overhead of LLMs. This could help practitioners prioritize efficiency in large-scale repositories and temper enthusiasm for LLM adoption in link recovery tasks.

major comments (2)
  1. [Reranking experiments] Reranking experiments: the central claim that traditional machine learning-based reranking techniques achieve higher performance than LLM-based approaches (ChatGPT, Qwen, Gemma, Llama) is load-bearing for the practical recommendation to prefer simpler models. However, the manuscript provides no details on the prompts used for the LLMs, whether they were zero-shot or few-shot, or any optimization steps, while the ML models appear to have been trained on labeled data. This raises the possibility that the observed gap reflects prompt design rather than inherent model capability.
  2. [Evaluation methodology] Evaluation methodology: the abstract and results report comparative precision/recall figures across retrieval and reranking methods but supply no information on the datasets (size, selection criteria, or public availability), statistical significance tests, or error bars/variance. Without these, the reliability of the dense-vs-sparse and ML-vs-LLM comparisons cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract lists specific retrieval methods (BM25, SBERT-Semantic Search, ANNOY, etc.) but does not indicate which ones were ultimately selected as 'best' before the reranking stage; a table or explicit statement would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We value the opportunity to clarify and strengthen our work based on these comments. Below, we address each major comment point by point.

read point-by-point responses
  1. Referee: [Reranking experiments] Reranking experiments: the central claim that traditional machine learning-based reranking techniques achieve higher performance than LLM-based approaches (ChatGPT, Qwen, Gemma, Llama) is load-bearing for the practical recommendation to prefer simpler models. However, the manuscript provides no details on the prompts used for the LLMs, whether they were zero-shot or few-shot, or any optimization steps, while the ML models appear to have been trained on labeled data. This raises the possibility that the observed gap reflects prompt design rather than inherent model capability.

    Authors: We agree that providing details on the LLM prompting is essential for interpreting the results. In our experiments, we employed zero-shot prompts for all LLMs, using a standardized template that asked the model to assess the relevance of each candidate commit to the given issue report and output a ranked list. No few-shot examples were used, as this would have required project-specific examples and could introduce bias; we aimed for a generalizable approach. No additional optimization or fine-tuning was performed on the LLMs beyond their default configurations. The traditional ML rerankers were trained on the labeled issue-commit pairs from the training splits of our datasets using standard supervised learning. We acknowledge that advanced prompt engineering might narrow the performance gap, but our comparison reflects practical, straightforward application of these models. In the revised manuscript, we will include the exact prompt templates used, the model versions, and a discussion of this aspect in the methodology section. revision: yes

  2. Referee: [Evaluation methodology] Evaluation methodology: the abstract and results report comparative precision/recall figures across retrieval and reranking methods but supply no information on the datasets (size, selection criteria, or public availability), statistical significance tests, or error bars/variance. Without these, the reliability of the dense-vs-sparse and ML-vs-LLM comparisons cannot be assessed.

    Authors: We apologize for not making this information more prominent. The full manuscript details the datasets in the experimental setup section, including the specific software projects used (with sizes ranging from hundreds to thousands of issues and commits), selection criteria based on projects with available ground-truth links from prior studies, and links to public repositories. However, to improve clarity, we will revise the abstract to briefly mention the dataset characteristics and add a dedicated subsection on statistical analysis. We performed paired statistical significance tests (Wilcoxon signed-rank test) between methods, and the revised version will report p-values along with error bars representing standard deviation across the projects in the figures. This will allow better assessment of the reliability of our comparisons. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical comparison on public benchmarks

full rationale

The paper performs an empirical evaluation of existing retrieval methods (BM25 variants, dense retrievers) and reranking approaches (traditional ML, cross-encoders, LLMs) on standard issue-commit linking datasets. No derivations, first-principles predictions, fitted parameters renamed as outputs, or self-citation chains are present. All claims rest on direct experimental measurements of precision/recall against public benchmarks, with no equations or theoretical steps that reduce to inputs by construction. Self-citations to prior techniques (BTLink, FRLink, etc.) serve only as baselines and do not bear the load of any new result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations, free parameters, axioms, or invented entities; the work is an empirical benchmarking study.

pith-pipeline@v0.9.0 · 5612 in / 1102 out tokens · 36174 ms · 2026-05-09T19:08:17.347566+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages

  1. [1]

    Martin Aumüller, Erik Bernhardsson, and Alexander Faithfull. 2020. ANN- Benchmarks: A benchmarking tool for approximate nearest neighbor algorithms. Information Systems (2020), 13 pages

  2. [2]

    Thazin Win Win Aung, Huan Huo, and Yulei Sui. 2020. A Literature Review of Automatic Traceability Links Recovery for Software Change Impact Analysis. In Proceedings of the 28th International Conference on Program Comprehension. 14–24

  3. [3]

    Adrian Bachmann, Christian Bird, Foyzur Rahman, Premkumar Devanbu, and Abraham Bernstein. 2010. The missing links: bugs and bug-fix commits. In Proceedings of the Eighteenth ACM SIGSOFT International Symposium on Foundations of Software Engineering. 97–106

  4. [4]

    Erik Bernhardsson. [n. d.]. Approximate Nearest Neighbors Oh Yeah (ANNOY). https://github.com/spotify/annoy Retrieved April 30, 2026

  5. [5]

    Christian Bird, Adrian Bachmann, Eirik Aune, John Duffy, Abraham Bernstein, Vladimir Filkov, and Premkumar Devanbu. 2009. Fair and balanced? bias in bug-fix datasets. In Proceedings of the 7th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on The Foundations of Software Engineering. 121–130

  6. [6]

    D. Brown. 2020. Rank-BM25: A Collection of BM25 Algorithms in Python. https://github.com/dorianbrown/rank_bm25 Retrieved April 30, 2026

  7. [7]

    Jane Cleland-Huang, Orlena C. Z. Gotel, Jane Huffman Hayes, Patrick Mäder, and Andrea Zisman. 2014. Software traceability: trends and future directions. In Future of Software Engineering Proceedings. 55–69

  8. [8]

    Jacob Cohen. 1960. A Coefficient of Agreement for Nominal Scales. Educational and Psychological Measurement (1960), 37–46

  9. [9]

    Liming Dong, He Zhang, Wei Liu, Zhiluo Weng, and Hongyu Kuang. 2022. Semi-supervised pre-processing for learning-based traceability framework on real-world software projects. In Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 570–582

  10. [10]

    Google. 2024. Gemma-7B-IT. https://huggingface.co/google/gemma-7b-it Re- trieved April 30, 2026

  11. [11]

    Pengfei He, Shaowei Wang, Shaiful Chowdhury, and Tse-Hsun Chen. 2025. Eval- uating the Effectiveness and Efficiency of Demonstration Retrievers in RAG for Coding Tasks. InProceedings of 2025 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 500–510

  12. [12]

    Huihui Huang, Ratnadira Widyasari, Ting Zhang, Ivana Clairine Irsan, Jieke Shi, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, Hong Jin Kang, and David Lo. 2025. Back to the Basics: Rethinking Issue-Commit Linking with LLM-Assisted Retrieval. InProceedings of the IEEE/ACM 48th International Conference on Software Engineering. 13 pages

  13. [13]

    Omid Jafari, Preeti Maurya, Parth Nagarkar, Khandker Mushfiqul Islam, and Chidambaram Crushev. 2021. A Survey on Locality Sensitive Hashing Algorithms and their Applications. arXiv (2021)

  14. [14]

    Masanari Kondo, Yutaro Kashiwa, Yasutaka Kamei, and Osamu Mizuno. 2022. An empirical study of issue-link algorithms: which issue-link algorithms should we use? Empirical Software Engineering (2022), 50 pages

  15. [15]

    Jinpeng Lan, Lina Gong, Jingxuan Zhang, and Haoxiang Zhang. 2023. BTLink : automatic link recovery between issues and commits based on pre-trained BERT model. Empirical Software Engineering (2023), 55 pages

  16. [16]

    Le, Mario Linares-Vasquez, David Lo, and Denys Poshyvanyk

    Tien-Duy B. Le, Mario Linares-Vasquez, David Lo, and Denys Poshyvanyk. 2015. RCLinker: Automated Linking of Issue Reports and Commits Leveraging Rich Contextual Information. In Proceedings of the 23rd International Conference on Program Comprehension. 36–47

  17. [17]

    Jinfeng Lin, Yalin Liu, Qingkai Zeng, Meng Jiang, and Jane Cleland-Huang. 2021. Traceability Transformed: Generating more Accurate Links with Pre-Trained BERT Models. In Proceedings of the 43rd International Conference on Software Engineering. 324–335

  18. [18]

    Yuanhua Lv and ChengXiang Zhai. 2011. When documents are very long, BM25 fails!. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1103–1104

  19. [19]

    Malkov and D

    Yu A. Malkov and D. A. Yashunin. 2020. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE Trans. Pattern Anal. Mach. Intell. (2020), 824–836

  20. [20]

    Pooya Rostami Mazrae, Maliheh Izadi, and Abbas Heydarnoori. 2021. Automated Recovery of Issue-Commit Links Leveraging Both Textual and Non-textual Data. In Proceedings of the 37th International Conference on Software Maintenance and Evolution (ICSME). 263–273

  21. [21]

    Meta. 2024. Llama-3.1-8B-Instruct. https://huggingface.co/meta-llama/Llama- 3.1-8B-Instruct Retrieved April 30, 2026

  22. [22]

    Anh Tuan Nguyen, Tung Thanh Nguyen, Hoan Anh Nguyen, and Tien N. Nguyen

  23. [23]

    In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering

    Multi-layered approach for recovering links between bug reports and fixes. In Proceedings of the ACM SIGSOFT 20th International Symposium on the Foundations of Software Engineering. 11 pages

  24. [24]

    Replication Package. [n. d.]. Think Harder and Don’t Overlook Your Options: Revisiting Issue-Commit Linking with LLM-Assisted Retrieval. https://figshare. com/s/3b9673176afe92929398 Retrieved April 30, 2026

  25. [25]

    Michael C. Panis. 2010. Successful Deployment of Requirements Traceability in a Commercial Engineering Organization...Really. In Proceedings of the 18th IEEE International Requirements Engineering Conference. 303–307

  26. [26]

    Qwen. 2025. Qwen3-32B. https://huggingface.co/Qwen/Qwen3-32B Retrieved April 30, 2026

  27. [27]

    Michael Rath, Jacob Rendall, Jin L. C. Guo, Jane Cleland-Huang, and Patrick Mäder. 2018. Traceability in the wild: automatically augmenting incomplete trace links. In Proceedings of the 40th International Conference on Software Engineering. 834–845

  28. [28]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embed- dings using siamese bert-networks. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 3980–3990

  29. [29]

    Rodriguez, Jane Cleland-Huang, and Davide Falessi

    Alberto D. Rodriguez, Jane Cleland-Huang, and Davide Falessi. 2021. Lever- aging Intermediate Artifacts to Improve Automated Trace Link Retrieval. In Proceedings of the 37th International Conference on Software Maintenance and Evolution (ICSME). 81–92

  30. [30]

    Hang Ruan, Bihuan Chen, Xin Peng, and Wenyun Zhao. 2019. DeepLink: Recov- ering issue-commit links based on deep learning. J. Syst. Softw. (2019), 13 pages

  31. [31]

    Kunal Sawarkar, Abhilasha Mangal, and Shivam Raj Solanki. 2024. Blended RAG: Improving RAG (Retriever-Augmented Generation) Accuracy with Se- mantic Search and Hybrid Query-Based Retrievers. In Proceedings of the 7th International Conference on Multimedia Information Processing and Retrieval (MIPR). 155–161

  32. [32]

    Gerald Schermann, Martin Brandtner, Sebastiano Panichella, Philipp Leitner, and Harald Gall. 2015. Discovering Loners and Phantoms in Commit and Issue Data. In Proceedings of 23rd International Conference on Program Comprehension. 4–14

  33. [33]

    Davide Spadini, Maurício Aniche, and Alberto Bacchelli. 2018. PyDriller: Python framework for mining software repositories. In Proceedings of the 26th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 908–911

  34. [34]

    Yan Sun, Celia Chen, Qing Wang, and Barry Boehm. 2017. Improving missing issue-commit link recovery using positive and unlabeled data. In Proceedings of the 32nd IEEE/ACM International Conference on Automated Software Engineering (ASE). 147–152

  35. [35]

    Yan Sun, Qing Wang, and Ye Yang. 2017. FRLink: Improving the recovery of missing issue-commit links by revisiting file relevance.Information and Software Technology (2017), 33–47

  36. [36]

    Gemma Team. 2024. Gemma: Open Models Based on Gemini Research and Technology. arXiv (2024)

  37. [37]

    Qwen Team. 2025. Qwen3 Technical Report. arXiv (2025)

  38. [38]

    Andrew Trotman, Antti Puurula, and Blake Burgess. 2014. Improvements to BM25 and Language Models Examined. In Proceedings of the 19th Australasian Document Computing Symposium. 58–65

  39. [39]

    Rongxin Wu, Hongyu Zhang, Sunghun Kim, and Shing-Chi Cheung. 2011. Re- Link: recovering links between bugs and changes. In Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering. 15–25

  40. [40]

    Zhaonan Wu, Yanjie Zhao, Chen Wei, Zirui Wan, Yue Liu, and Haoyu Wang

  41. [41]

    In Proceedings of 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion)

    COmmitSHield: Tracking Vulnerability Introduction and Fix in Version Control Systems. In Proceedings of 47th International Conference on Software Engineering: Companion Proceedings (ICSE-Companion). 279–290

  42. [42]

    Rui Xie, Long Chen, Wei Ye, Zhiyu Li, Tianxiang Hu, Dongdong Du, and Shikun Zhang. 2019. DeepLink: A Code Knowledge Graph Based Deep Learning Ap- proach for Issue-Commit Link Recovery. InProceedings of the 26th International Conference on Software Analysis, Evolution and Reengineering (SANER). 434– 444

  43. [43]

    Chenyuan Zhang, Yanlin Wang, Zhao Wei, Yong Xu, Juhong Wang, Hui Li, and Rongrong Ji. 2023. EALink: An Efficient and Accurate Pre-trained Frame- work for Issue-Commit Link Recovery. In Proceedings of the 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 217–229

  44. [44]

    Jianfei Zhu, Guanping Xiao, Zheng Zheng, and Yulei Sui. 2022. Enhancing Traceability Link Recovery with Unlabeled Data. In Proceedings of the 33rd International Symposium on Software Reliability Engineering (ISSRE). 446–457

  45. [45]

    Jianfei Zhu, Guanping Xiao, Zheng Zheng, and Yulei Sui. 2024. Deep semi- supervised learning for recovering traceability links between issues and commits. Journal of Systems and Software (2024), 19 pages