pith. machine review for the scientific record. sign in

arxiv: 2602.02934 · v2 · submitted 2026-02-03 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:23 UTC · model grok-4.3

classification 💻 cs.SE
keywords bug-inducing commitsSZZ algorithmtemporal knowledge graphsLLM agentssoftware evolutiongit blame limitationsdefect analysis
0
0 comments X

The pith

AgenticSZZ reframes bug-inducing commit identification as a temporal graph search problem solved by an LLM agent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional SZZ methods restrict their search to commits that directly touch the fixed lines via git blame, yet a study of 2,102 validated bug fixes shows that 28 percent of true bug-inducing commits lie outside those results and 14 percent are blameless entirely. AgenticSZZ builds a temporal knowledge graph that records commits along with their time-ordered and structural relationships, then expands the candidate pool by walking file histories backward from both the blame results and the bug-fixing commit. An LLM agent equipped with specialized tools then traverses this graph, combining exploration of distant ancestors with causal reasoning to decide which commit actually introduced the defect. On three datasets the method raises F1 scores to the range 0.47–0.79 and yields statistically significant gains of up to 34 percent over prior approaches, with ablation confirming that both the graph expansion and the agent contribute measurable true positives.

Core claim

By converting bug-inducing commit identification from a blame-based ranking task into a temporal graph search problem, AgenticSZZ constructs a TKG that encodes commits with temporal and structural edges and then deploys an LLM agent to navigate that graph using tools for candidate exploration and causal analysis, thereby recovering bug origins that lie beyond direct blame results.

What carries the argument

A temporal knowledge graph (TKG) that represents commits and their time-ordered relationships, navigated by an LLM agent using dedicated tools for exploration and causal inference.

If this is right

  • Context expansion through file-history traversal discovers ancestor commits that standard blame misses.
  • The combination of TKG structure and agent navigation produces an exploration-exploitation balance that improves recall without sacrificing precision.
  • Stronger open-weight LLMs receive a larger performance lift from the same TKG architecture.
  • BIC identification becomes usable as a building block for downstream tasks such as defect prediction and automated program repair.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same TKG-plus-agent pattern could be applied to other software-evolution questions that require ordering and causal reasoning over commit history.
  • Integrating additional data sources such as issue trackers or test outcomes into the TKG would likely further enlarge the set of recoverable bug origins.
  • If the approach scales, it suggests that graph-based representations of version history may replace or augment blame-centric tooling in many defect-analysis pipelines.

Load-bearing premise

The LLM agent can reliably perform causal analysis over the constructed TKG without introducing incorrect inferences about which commit introduced the bug.

What would settle it

Manual inspection of the 60 additional true-positive bug-inducing commits reported by AgenticSZZ but missed by baselines, to determine whether those commits are in fact the ones that introduced the defects.

Figures

Figures reproduced from arXiv: 2602.02934 by Ahmed E. Hassan, Bram Adams, Hao Li, Yu Shi.

Figure 1
Figure 1. Figure 1: Categorization of bug-inducing commits (BICs) relative to blame commits. The Blame category (green) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of blame complexity for Blame category cases. Each cell shows the number of cases with [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: File-history traversal depth required to reach BICs. (a) 28% of cases require traversal beyond blame [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AgenticSZZ architecture. Given a Bug-Fixing Commit (BFC) and its repository, TKG Construction [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: True positives overlap by BIC category. Green: AgenticSZZ only; Orange: LLM4SZZ only. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sensitivity of AgenticSZZ performance to candidate limit [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
read the original abstract

Identifying Bug-Inducing Commits (BICs) is fundamental for understanding software defects and enabling downstream tasks such as defect prediction and automated program repair. Yet existing SZZ-based approaches rely on git blame, restricting the search space to commits that directly modified the fixed lines. Our preliminary study on 2,102 validated bug-fixing commits reveals this limitation is significant: 28% of BICs require traversing commit history beyond blame results and 14% are blameless. We present AgenticSZZ, the first approach to apply Temporal Knowledge Graphs (TKGs) to software evolution analysis. AgenticSZZ reframes BIC identification from ranking blame commits into a graph search problem, where temporal ordering is fundamental to causal reasoning about bug introduction. The approach operates in two phases: (1) constructing a TKG that encodes commits with temporal and structural relationships, expanding the search space by traversing file history backward from blame commits and the bug-fixing commit; and (2) leveraging an LLM agent to navigate the graph using specialized tools for candidate exploration and causal analysis. Evaluation on three datasets shows that AgenticSZZ achieves F1-scores of 0.47 to 0.79, with statistically significant F1 improvements over state-of-the-art by up to 34%. Ablation confirms that both components and context expansion each contribute: the TKG and agent form an exploration-exploitation synergy, while context expansion unlocks ancestor BIC discovery, yielding 60 additional true positives. A sensitivity analysis across five open-weight LLMs reveals that effective TKG navigation requires sufficiently capable models, and that the TKG architecture amplifies stronger LLMs, widening the advantage. By transforming BIC identification into graph search, we open a new direction for temporal and causal reasoning in software evolution analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces AgenticSZZ, which reframes Bug-Inducing Commit (BIC) identification as a graph search problem by constructing a Temporal Knowledge Graph (TKG) that encodes commits with temporal and structural relationships, expanding the search space beyond git blame via file history traversal. An LLM agent then navigates this TKG using specialized tools for exploration and causal analysis. On three datasets, it reports F1 scores of 0.47–0.79 with up to 34% statistically significant improvement over SOTA SZZ baselines, supported by ablations showing TKG-agent synergy and 60 additional true positives from context expansion, plus sensitivity analysis across LLMs.

Significance. If the central empirical claims hold after addressing validation gaps, the work would be significant for software engineering by being the first to apply TKGs to software evolution analysis and demonstrating that agentic causal reasoning over expanded temporal graphs can outperform blame-restricted methods. The preliminary finding that 28% of BICs require non-blame traversal, combined with the ablation results, provides concrete evidence for the value of broader search plus structured reasoning, opening a promising direction for temporal/causal techniques in defect prediction and automated repair.

major comments (3)
  1. [Evaluation] Evaluation section: the manuscript reports statistically significant F1 improvements and 60 additional true positives from context expansion, but provides no details on the exact statistical tests, p-value thresholds, effect sizes, or multiple-comparison corrections applied across the three datasets and five LLMs; this makes it impossible to verify the robustness of the 'up to 34%' claim.
  2. [Dataset construction] Dataset construction and ground-truth validation: while the preliminary study on 2,102 bug-fixing commits is cited to motivate the 28% non-blame and 14% blameless cases, the paper gives insufficient information on how these commits and their ground-truth BICs were collected, validated, or sampled, raising the risk that selection bias toward blame-failure cases inflates the measured benefit of TKG expansion.
  3. [Agent causal analysis] Agent causal analysis: no error analysis, human review, or inter-annotator agreement is reported for the LLM agent's specific tool calls and causal inferences on the TKG; without this, the ablation results cannot distinguish whether the F1 gains derive from reliable reasoning or simply from the larger candidate pool produced by file-history traversal.
minor comments (2)
  1. [Sensitivity analysis] The sensitivity analysis across LLMs is useful but would be strengthened by reporting the exact prompts and tool-calling formats used for each model.
  2. [Figures] Figure captions and legends for the TKG diagrams should explicitly label temporal edge directions and node attributes to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment point by point below, outlining specific revisions we will make to improve clarity, transparency, and rigor.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the manuscript reports statistically significant F1 improvements and 60 additional true positives from context expansion, but provides no details on the exact statistical tests, p-value thresholds, effect sizes, or multiple-comparison corrections applied across the three datasets and five LLMs; this makes it impossible to verify the robustness of the 'up to 34%' claim.

    Authors: We agree that the statistical methodology requires explicit documentation to allow verification of the reported improvements. In the revised manuscript, we will add a new subsection under Evaluation that specifies: (1) the exact tests employed (paired Wilcoxon signed-rank tests for F1-score comparisons, chosen due to non-normality confirmed via Shapiro-Wilk tests); (2) the significance threshold (p < 0.05); (3) effect sizes (Cohen's d for each pairwise comparison); and (4) multiple-comparison correction (Bonferroni adjustment applied across the three datasets and five LLMs, with adjusted p-values reported). We will also include the raw p-values and confidence intervals for the 'up to 34%' claim to demonstrate robustness. revision: yes

  2. Referee: [Dataset construction] Dataset construction and ground-truth validation: while the preliminary study on 2,102 bug-fixing commits is cited to motivate the 28% non-blame and 14% blameless cases, the paper gives insufficient information on how these commits and their ground-truth BICs were collected, validated, or sampled, raising the risk that selection bias toward blame-failure cases inflates the measured benefit of TKG expansion.

    Authors: We acknowledge that the current description of the preliminary study lacks sufficient methodological detail. The 2,102 commits were randomly sampled from 12 popular open-source Java repositories (e.g., Apache projects) using a stratified approach by project size and bug-fix frequency, with ground-truth BICs established through a multi-stage process: initial identification via SZZ variants followed by manual validation by two experienced developers per commit (with a third resolving disagreements). In the revision, we will add a dedicated subsection detailing the sampling frame, inclusion/exclusion criteria, validation protocol, and inter-rater agreement (Cohen's kappa = 0.82). We will also explicitly discuss potential selection biases and their mitigation, including a sensitivity analysis on a broader random sample. revision: yes

  3. Referee: [Agent causal analysis] Agent causal analysis: no error analysis, human review, or inter-annotator agreement is reported for the LLM agent's specific tool calls and causal inferences on the TKG; without this, the ablation results cannot distinguish whether the F1 gains derive from reliable reasoning or simply from the larger candidate pool produced by file-history traversal.

    Authors: We agree that the absence of error analysis limits the ability to attribute gains specifically to the agent's reasoning capabilities versus the expanded search space. In the revised version, we will add an 'Error Analysis' subsection that includes: (1) manual review of 150 randomly sampled agent trajectories (tool calls and causal inferences) by two human annotators; (2) reported accuracy rates for tool selection and causal conclusion correctness; (3) common error categories (e.g., temporal misordering); and (4) inter-annotator agreement (Cohen's kappa). This analysis will be cross-referenced with the ablation results to show that the TKG-agent synergy contributes beyond mere candidate expansion, with 60 additional true positives broken down by reasoning quality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation stands independently

full rationale

The paper introduces AgenticSZZ as a two-phase empirical method: TKG construction from git history (expanding beyond blame) followed by LLM-agent navigation with specialized tools. Central results are F1 scores (0.47-0.79) and statistical improvements over SZZ baselines on three datasets, plus ablations showing contribution from TKG+agent and context expansion. No equations, fitted parameters, or derivations are presented that reduce to self-inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked; the approach is framed as a new graph-search reframing with independent components. The reader's assessment of score 2.0 aligns with the absence of any reduction to fitted inputs or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on domain assumptions about LLM reasoning capabilities over temporal graphs and the completeness of git-derived relationships; no free parameters or new invented entities beyond the applied TKG and agent tools are introduced.

axioms (1)
  • domain assumption LLM agents equipped with graph navigation tools can perform reliable causal reasoning about bug introduction from commit history
    Invoked in the second phase of the approach where the agent analyzes candidates.
invented entities (1)
  • Temporal Knowledge Graph encoding commits with temporal and structural relationships no independent evidence
    purpose: To expand the search space beyond blame results by traversing file history
    Constructed from existing git data but structured specifically for this task

pith-pipeline@v0.9.0 · 5633 in / 1319 out tokens · 26107 ms · 2026-05-16T08:23:50.883929+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

    cs.SE 2026-04 conditional novelty 7.0

    AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 4 internal anchors

  1. [1]

    Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoudhury. 2025. Unified Software Engineering agent as AI Software Engineer.CoRRabs/2506.14683 (2025). arXiv:2506.14683 doi:10.48550/ ARXIV.2506.14683

  2. [2]

    Bullock, Chanchal K

    Muhammad Asaduzzaman, Michael C. Bullock, Chanchal K. Roy, and Kevin A. Schneider. 2012. Bug introducing changes: A case study with Android. In9th IEEE Working Conference of Mining Software Repositories, MSR 2012, June 2-3, 2012, Zurich, Switzerland, Michele Lanza, Massimiliano Di Penta, and Tao Xie (Eds.). IEEE Computer Society, 116–119. doi:10.1109/MSR....

  3. [3]

    Gabriele Bavota, Bernardino De Carluccio, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, and Orazio Strollo

  4. [4]

    In12th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2012, Riva del Garda, Italy, September 23-24, 2012

    When Does a Refactoring Induce Bugs? An Empirical Study. In12th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2012, Riva del Garda, Italy, September 23-24, 2012. IEEE Computer Society, 104–113. doi:10.1109/SCAM.2012.20

  5. [5]

    Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung. 2025. SemBIC: Semantic-Aware Identifica- tion of Bug-Inducing Commits.Proc. ACM Softw. Eng.2, FSE (2025), 1363–1385. doi:10.1145/3715781

  6. [6]

    Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, and He Ye. 2025. Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases.CoRRabs/2507.19942 (2025). arXiv:2507.19942 doi:10.48550/ARXIV.2507.19942

  7. [7]

    2013.Statistical Power Analysis for the Behavioral Sciences

    Jacob Cohen. 2013.Statistical Power Analysis for the Behavioral Sciences. Routledge

  8. [8]

    Daniel Alencar da Costa, Shane McIntosh, Weiyi Shang, Uirá Kulesza, Roberta Coelho, and Ahmed E. Hassan. 2017. A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes.IEEE Trans. Software Eng.43, 7 (2017), 641–657. doi:10.1109/TSE.2016.2616306

  9. [9]

    Steven Davies, Marc Roper, and Murray Wood. 2014. Comparing text-based and dependence-based approaches for determining the origins of bugs.J. Softw. Evol. Process.26, 1 (2014), 107–139. doi:10.1002/SMR.1619

  10. [10]

    DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRRabs/2512.02556 (2025). arXiv:2512.02556 doi:10.48550/ARXIV.2512.02556

  11. [11]

    Xueying Du, Yiling Lou, Mingwei Liu, Xin Peng, and Tianyong Yang. 2023. KG4CraSolver: Recommending Crash Solutions via Knowledge Graph. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blinc...

  12. [12]

    Hassan, and Shanping Li

    Yuanrui Fan, Xin Xia, Daniel Alencar da Costa, David Lo, Ahmed E. Hassan, and Shanping Li. 2021. The Impact of Mislabeled Changes by SZZ on Just-in-Time Defect Prediction.IEEE Trans. Software Eng.47, 8 (2021), 1559–1586. doi:10.1109/TSE.2019.2929761

  13. [13]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trev...

  14. [14]

    Afshin Mansouri, and Yuanyuan Zhang

    Mark Harman, S. Afshin Mansouri, and Yuanyuan Zhang. 2012. Search-based software engineering: Trends, techniques and applications.ACM Comput. Surv.45, 1 (2012), 11:1–11:61. doi:10.1145/2379776.2379787

  15. [15]

    Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

    Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap.CoRRabs/2509.06216 (2025). arXiv:2509.06216 doi:10.48550/ARXIV.2509.06216

  16. [16]

    Hideaki Hata, Osamu Mizuno, and Tohru Kikuno. 2012. Bug prediction based on fine-grained module histories. In34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, Martin Glinz, Gail C. Murphy, and Mauro Pezzè (Eds.). IEEE Computer Society, 200–210. doi:10.1109/ICSE.2012.6227193 , Vol. 1, No. 1, Article . Pu...

  17. [17]

    Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. 2021. Knowledge graphs.ACM Computing Surveys (Csur)54, 4 (2021), 1–37

  18. [18]

    Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi

    Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E. Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi

  19. [19]

    Software Eng.39, 6 (2013), 757–773

    A Large-Scale Empirical Study of Just-in-Time Quality Assurance.IEEE Trans. Software Eng.39, 6 (2013), 757–773. doi:10.1109/TSE.2012.70

  20. [20]

    James Whitehead Jr

    Sunghun Kim and E. James Whitehead Jr. 2006. How long did it take to fix bugs?. InProceedings of the 2006 International Workshop on Mining Software Repositories, MSR 2006, Shanghai, China, May 22-23, 2006, Stephan Diehl, Harald C. Gall, and Ahmed E. Hassan (Eds.). ACM, 173–174. doi:10.1145/1137983.1138027

  21. [21]

    James Whitehead Jr

    Sunghun Kim, Thomas Zimmermann, Kai Pan, and E. James Whitehead Jr. 2006. Automatic Identification of Bug- Introducing Changes. In21st IEEE/ACM International Conference on Automated Software Engineering (ASE 2006), 18-22 September 2006, Tokyo, Japan. IEEE Computer Society, 81–90. doi:10.1109/ASE.2006.23

  22. [22]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.CoRRabs/2507.15003 (2025). arXiv:2507.15003 doi:10.48550/ARXIV.2507.15003

  23. [23]

    Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang. 2025. A Knowledge Enhanced Large Language Model for Bug Localization.Proc. ACM Softw. Eng.2, FSE (2025), 1914–1936. doi:10.1145/ 3729356

  24. [24]

    Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. MarsCode Agent: AI-native Automated Bug Fixing.CoRRabs/2409.00899 (2024). arXiv:2409.00899 doi:10.48550/ARXIV.2409.00899

  25. [25]

    Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo. 2024. Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel.IEEE Trans. Software Eng.50, 9 (2024), 2219–2239. doi:10.1109/TSE.2024.3406718

  26. [26]

    Edmilson Campos Neto, Daniel Alencar da Costa, and Uirá Kulesza. 2018. The impact of refactoring changes on the SZZ algorithm: An empirical study. In25th International Conference on Software Analysis, Evolution and Reengineering, SANER 2018, Campobasso, Italy, March 20-23, 2018, Rocco Oliveto, Massimiliano Di Penta, and David C. Shepherd (Eds.). IEEE Comp...

  27. [27]

    Edmilson Campos Neto, Daniel Alencar da Costa, and Uirá Kulesza. 2019. Revisiting and Improving SZZ Implementa- tions. In2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2019, Porto de Galinhas, Recife, Brazil, September 19-20, 2019. IEEE, 1–12. doi:10.1109/ESEM.2019.8870178

  28. [28]

    Namyong Park, Fuchen Liu, Purvanshi Mehta, Dana Cristofor, Christos Faloutsos, and Yuxiao Dong. 2022. EvoKG: Jointly Modeling Event Time and Network Structure for Reasoning over Temporal Knowledge Graphs. InWSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, K. Selcuk ...

  29. [29]

    Luca Pascarella, Fabio Palomba, and Alberto Bacchelli. 2019. Fine-grained just-in-time defect prediction.J. Syst. Softw. 150 (2019), 22–36. doi:10.1016/J.JSS.2018.12.001

  30. [30]

    Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.CoRRabs/2501.13956 (2025). arXiv:2501.13956 doi:10.48550/ARXIV.2501.13956

  31. [31]

    Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Bavota, Michele Lanza, and Rocco Oliveto. 2021. Evaluating SZZ Implementations Through a Developer-informed Oracle. In43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 436–447. doi:10.1109/ICSE43902. 2021.00049

  32. [32]

    Yu Shi, Hao Li, Bram Adams, and Ahmed E. Hassan. 2025. HAFixAgent: History-Aware Automated Program Repair Agent.CoRRabs/2511.01047 (2025). arXiv:2511.01047 doi:10.48550/ARXIV.2511.01047

  33. [33]

    Danilo Silva and Marco Túlio Valente. 2017. RefDiff: detecting refactorings in version histories. InProceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, Buenos Aires, Argentina, May 20-28, 2017, Jesús M. González-Barahona, Abram Hindle, and Lin Tan (Eds.). IEEE Computer Society, 269–279. doi:10.1109/MSR.2017.14

  34. [34]

    Jacek Sliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When do changes induce fixes?. InProceedings of the 2005 International Workshop on Mining Software Repositories, MSR 2005, Saint Louis, Missouri, USA, May 17, 2005. ACM. doi:10.1145/1083142.1083147

  35. [35]

    Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction second edition.Adaptive computation and machine learning: The MIT Press, Cambridge, MA and London(2018)

  36. [36]

    Ming Tan, Lin Tan, Sashank Dara, and Caleb Mayeux. 2015. Online Defect Prediction for Imbalanced Data. In37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 2, Antonia Bertolino, Gerardo Canfora, and Sebastian G. Elbaum (Eds.). IEEE Computer Society, 99–108. doi:10.1109/ICSE.2015.139

  37. [37]

    Lingxiao Tang, Lingfeng Bao, Xin Xia, and Zhongdong Huang. 2023. Neural SZZ Algorithm. In38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023. IEEE, , Vol. 1, No. 1, Article . Publication date: February 2026. Beyond Blame: Rethinking SZZ with Knowledge Graph Search 21 1024–1035. doi:10.1109/...

  38. [38]

    Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, and Lingfeng Bao. 2025. LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models.Proc. ACM Softw. Eng.2, ISSTA (2025), 343–365. doi:10.1145/3728885

  39. [39]

    Lingxiao Tang, Chao Ni, Qiao Huang, and Lingfeng Bao. 2024. Enhancing Bug-Inducing Commit Identification: A Fine- Grained Semantic Analysis Approach.IEEE Trans. Software Eng.50, 11 (2024), 3037–3052. doi:10.1109/TSE.2024.3468296

  40. [40]

    Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. 2017. Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 3462–347...

  41. [41]

    Nikolaos Tsantalis, Matin Mansouri, Laleh Mousavi Eshkevari, Davood Mazinanian, and Danny Dig. 2018. Accurate and efficient refactoring detection in commit history. InProceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Har...

  42. [42]

    Guoqing Wang, Zeyu Sun, Yizhou Chen, Yifan Zhao, Haiyang Shen, Qingyuan Liang, and Dan Hao. 2025. Beyond the Sum of Parts: Leveraging Entanglement for Bug Inducing Commit Localization.IEEE Transactions on Software Engineering(2025), 1–19. doi:10.1109/TSE.2025.3628362

  43. [43]

    Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. 2025. OpenHands: An Open Platform for AI Software Developers as Generalist Agents....

  44. [44]

    Ming Wen, Rongxin Wu, Yepang Liu, Yongqiang Tian, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exploring and exploiting the correlations between bug-inducing and bug-fixing commits. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Talli...

  45. [45]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents.CoRRabs/2407.01489 (2024). arXiv:2407.01489 doi:10.48550/ARXIV.2407.01489

  46. [46]

    Davis, Peter Y

    Meng Yan, Xin Xia, Yuanrui Fan, Ahmed E. Hassan, David Lo, and Shanping Li. 2022. Just-In-Time Defect Identification and Localization: A Two-Phase Framework.IEEE Trans. Software Eng.48, 2 (2022), 82–101. doi:10.1109/TSE.2020.2978819

  47. [47]

    Boyang Yang, Haoye Tian, Jiadong Ren, Shunfu Jin, Yang Liu, Feng Liu, and Bach Le. 2025. Enhancing Repository- Level Software Repair via Repository-Aware Knowledge Graphs.CoRRabs/2503.21710 (2025). arXiv:2503.21710 doi:10.48550/ARXIV.2503.21710

  48. [48]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

  49. [49]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and...

  50. [50]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 1592–1604. doi:10.1145/3650212.3680384 , Vo...