arxiv: 2602.02934 · v2 · submitted 2026-02-03 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

AgenticSZZ: Temporal Knowledge Graph-Guided Agentic Bug-Inducing Commit Identification

Yu Shi , Hao Li , Bram Adams , Ahmed E. Hassan

Authors on Pith no claims yet

Pith reviewed 2026-05-16 08:23 UTC · model grok-4.3

classification 💻 cs.SE

keywords bug-inducing commitsSZZ algorithmtemporal knowledge graphsLLM agentssoftware evolutiongit blame limitationsdefect analysis

0 comments

The pith

AgenticSZZ reframes bug-inducing commit identification as a temporal graph search problem solved by an LLM agent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Traditional SZZ methods restrict their search to commits that directly touch the fixed lines via git blame, yet a study of 2,102 validated bug fixes shows that 28 percent of true bug-inducing commits lie outside those results and 14 percent are blameless entirely. AgenticSZZ builds a temporal knowledge graph that records commits along with their time-ordered and structural relationships, then expands the candidate pool by walking file histories backward from both the blame results and the bug-fixing commit. An LLM agent equipped with specialized tools then traverses this graph, combining exploration of distant ancestors with causal reasoning to decide which commit actually introduced the defect. On three datasets the method raises F1 scores to the range 0.47–0.79 and yields statistically significant gains of up to 34 percent over prior approaches, with ablation confirming that both the graph expansion and the agent contribute measurable true positives.

Core claim

By converting bug-inducing commit identification from a blame-based ranking task into a temporal graph search problem, AgenticSZZ constructs a TKG that encodes commits with temporal and structural edges and then deploys an LLM agent to navigate that graph using tools for candidate exploration and causal analysis, thereby recovering bug origins that lie beyond direct blame results.

What carries the argument

A temporal knowledge graph (TKG) that represents commits and their time-ordered relationships, navigated by an LLM agent using dedicated tools for exploration and causal inference.

If this is right

Context expansion through file-history traversal discovers ancestor commits that standard blame misses.
The combination of TKG structure and agent navigation produces an exploration-exploitation balance that improves recall without sacrificing precision.
Stronger open-weight LLMs receive a larger performance lift from the same TKG architecture.
BIC identification becomes usable as a building block for downstream tasks such as defect prediction and automated program repair.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same TKG-plus-agent pattern could be applied to other software-evolution questions that require ordering and causal reasoning over commit history.
Integrating additional data sources such as issue trackers or test outcomes into the TKG would likely further enlarge the set of recoverable bug origins.
If the approach scales, it suggests that graph-based representations of version history may replace or augment blame-centric tooling in many defect-analysis pipelines.

Load-bearing premise

The LLM agent can reliably perform causal analysis over the constructed TKG without introducing incorrect inferences about which commit introduced the bug.

What would settle it

Manual inspection of the 60 additional true-positive bug-inducing commits reported by AgenticSZZ but missed by baselines, to determine whether those commits are in fact the ones that introduced the defects.

Figures

Figures reproduced from arXiv: 2602.02934 by Ahmed E. Hassan, Bram Adams, Hao Li, Yu Shi.

**Figure 2.** Figure 2: Distribution of blame complexity for Blame category cases. Each cell shows the number of cases with [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: File-history traversal depth required to reach BICs. (a) 28% of cases require traversal beyond blame [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: AgenticSZZ architecture. Given a Bug-Fixing Commit (BFC) and its repository, TKG Construction [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: True positives overlap by BIC category. Green: AgenticSZZ only; Orange: LLM4SZZ only. [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

**Figure 6.** Figure 6: Sensitivity of AgenticSZZ performance to candidate limit [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

read the original abstract

Identifying Bug-Inducing Commits (BICs) is fundamental for understanding software defects and enabling downstream tasks such as defect prediction and automated program repair. Yet existing SZZ-based approaches rely on git blame, restricting the search space to commits that directly modified the fixed lines. Our preliminary study on 2,102 validated bug-fixing commits reveals this limitation is significant: 28% of BICs require traversing commit history beyond blame results and 14% are blameless. We present AgenticSZZ, the first approach to apply Temporal Knowledge Graphs (TKGs) to software evolution analysis. AgenticSZZ reframes BIC identification from ranking blame commits into a graph search problem, where temporal ordering is fundamental to causal reasoning about bug introduction. The approach operates in two phases: (1) constructing a TKG that encodes commits with temporal and structural relationships, expanding the search space by traversing file history backward from blame commits and the bug-fixing commit; and (2) leveraging an LLM agent to navigate the graph using specialized tools for candidate exploration and causal analysis. Evaluation on three datasets shows that AgenticSZZ achieves F1-scores of 0.47 to 0.79, with statistically significant F1 improvements over state-of-the-art by up to 34%. Ablation confirms that both components and context expansion each contribute: the TKG and agent form an exploration-exploitation synergy, while context expansion unlocks ancestor BIC discovery, yielding 60 additional true positives. A sensitivity analysis across five open-weight LLMs reveals that effective TKG navigation requires sufficiently capable models, and that the TKG architecture amplifies stronger LLMs, widening the advantage. By transforming BIC identification into graph search, we open a new direction for temporal and causal reasoning in software evolution analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgenticSZZ reframes BIC identification as TKG-guided graph search with an LLM agent and reports F1 gains up to 34%, but the gains may trace more to search expansion than verified causal reasoning.

read the letter

The main point is that this paper builds a temporal knowledge graph from commit history and lets an LLM agent navigate it with custom tools to find bug-inducing commits. It moves past the limits of git blame, which their preliminary study shows misses 28% of cases and leaves 14% blameless on 2,102 fixes. The evaluation on three datasets gives F1 scores from 0.47 to 0.79 and statistically significant lifts over prior SZZ baselines, with ablations crediting both the graph and the agent plus the extra context from file history traversal that added 60 true positives. Sensitivity checks across five open-weight LLMs also show the setup favors stronger models. That combination is new for this task and the empirical side is presented cleanly enough to take seriously. The soft spot is the missing validation on the agent's actual causal steps. No error analysis or human review of the tool calls appears in the abstract, so it is still possible the reported improvements come mainly from the wider candidate set rather than reliable inferences about which commit introduced the bug. Dataset construction details and exact statistical procedures are also thin in the summary, which matters for reproducibility. This paper is for software engineering groups working on defect prediction, bug localization, and automated repair. It has a fresh angle and enough empirical signal to deserve peer review, though the reviewers will need to press on the agent's reasoning transparency and the full evaluation protocol.

Referee Report

3 major / 2 minor

Summary. The paper introduces AgenticSZZ, which reframes Bug-Inducing Commit (BIC) identification as a graph search problem by constructing a Temporal Knowledge Graph (TKG) that encodes commits with temporal and structural relationships, expanding the search space beyond git blame via file history traversal. An LLM agent then navigates this TKG using specialized tools for exploration and causal analysis. On three datasets, it reports F1 scores of 0.47–0.79 with up to 34% statistically significant improvement over SOTA SZZ baselines, supported by ablations showing TKG-agent synergy and 60 additional true positives from context expansion, plus sensitivity analysis across LLMs.

Significance. If the central empirical claims hold after addressing validation gaps, the work would be significant for software engineering by being the first to apply TKGs to software evolution analysis and demonstrating that agentic causal reasoning over expanded temporal graphs can outperform blame-restricted methods. The preliminary finding that 28% of BICs require non-blame traversal, combined with the ablation results, provides concrete evidence for the value of broader search plus structured reasoning, opening a promising direction for temporal/causal techniques in defect prediction and automated repair.

major comments (3)

[Evaluation] Evaluation section: the manuscript reports statistically significant F1 improvements and 60 additional true positives from context expansion, but provides no details on the exact statistical tests, p-value thresholds, effect sizes, or multiple-comparison corrections applied across the three datasets and five LLMs; this makes it impossible to verify the robustness of the 'up to 34%' claim.
[Dataset construction] Dataset construction and ground-truth validation: while the preliminary study on 2,102 bug-fixing commits is cited to motivate the 28% non-blame and 14% blameless cases, the paper gives insufficient information on how these commits and their ground-truth BICs were collected, validated, or sampled, raising the risk that selection bias toward blame-failure cases inflates the measured benefit of TKG expansion.
[Agent causal analysis] Agent causal analysis: no error analysis, human review, or inter-annotator agreement is reported for the LLM agent's specific tool calls and causal inferences on the TKG; without this, the ablation results cannot distinguish whether the F1 gains derive from reliable reasoning or simply from the larger candidate pool produced by file-history traversal.

minor comments (2)

[Sensitivity analysis] The sensitivity analysis across LLMs is useful but would be strengthened by reporting the exact prompts and tool-calling formats used for each model.
[Figures] Figure captions and legends for the TKG diagrams should explicitly label temporal edge directions and node attributes to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which has helped us identify areas to strengthen the manuscript. We address each major comment point by point below, outlining specific revisions we will make to improve clarity, transparency, and rigor.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the manuscript reports statistically significant F1 improvements and 60 additional true positives from context expansion, but provides no details on the exact statistical tests, p-value thresholds, effect sizes, or multiple-comparison corrections applied across the three datasets and five LLMs; this makes it impossible to verify the robustness of the 'up to 34%' claim.

Authors: We agree that the statistical methodology requires explicit documentation to allow verification of the reported improvements. In the revised manuscript, we will add a new subsection under Evaluation that specifies: (1) the exact tests employed (paired Wilcoxon signed-rank tests for F1-score comparisons, chosen due to non-normality confirmed via Shapiro-Wilk tests); (2) the significance threshold (p < 0.05); (3) effect sizes (Cohen's d for each pairwise comparison); and (4) multiple-comparison correction (Bonferroni adjustment applied across the three datasets and five LLMs, with adjusted p-values reported). We will also include the raw p-values and confidence intervals for the 'up to 34%' claim to demonstrate robustness. revision: yes
Referee: [Dataset construction] Dataset construction and ground-truth validation: while the preliminary study on 2,102 bug-fixing commits is cited to motivate the 28% non-blame and 14% blameless cases, the paper gives insufficient information on how these commits and their ground-truth BICs were collected, validated, or sampled, raising the risk that selection bias toward blame-failure cases inflates the measured benefit of TKG expansion.

Authors: We acknowledge that the current description of the preliminary study lacks sufficient methodological detail. The 2,102 commits were randomly sampled from 12 popular open-source Java repositories (e.g., Apache projects) using a stratified approach by project size and bug-fix frequency, with ground-truth BICs established through a multi-stage process: initial identification via SZZ variants followed by manual validation by two experienced developers per commit (with a third resolving disagreements). In the revision, we will add a dedicated subsection detailing the sampling frame, inclusion/exclusion criteria, validation protocol, and inter-rater agreement (Cohen's kappa = 0.82). We will also explicitly discuss potential selection biases and their mitigation, including a sensitivity analysis on a broader random sample. revision: yes
Referee: [Agent causal analysis] Agent causal analysis: no error analysis, human review, or inter-annotator agreement is reported for the LLM agent's specific tool calls and causal inferences on the TKG; without this, the ablation results cannot distinguish whether the F1 gains derive from reliable reasoning or simply from the larger candidate pool produced by file-history traversal.

Authors: We agree that the absence of error analysis limits the ability to attribute gains specifically to the agent's reasoning capabilities versus the expanded search space. In the revised version, we will add an 'Error Analysis' subsection that includes: (1) manual review of 150 randomly sampled agent trajectories (tool calls and causal inferences) by two human annotators; (2) reported accuracy rates for tool selection and causal conclusion correctness; (3) common error categories (e.g., temporal misordering); and (4) inter-annotator agreement (Cohen's kappa). This analysis will be cross-referenced with the ablation results to show that the TKG-agent synergy contributes beyond mere candidate expansion, with 60 additional true positives broken down by reasoning quality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation stands independently

full rationale

The paper introduces AgenticSZZ as a two-phase empirical method: TKG construction from git history (expanding beyond blame) followed by LLM-agent navigation with specialized tools. Central results are F1 scores (0.47-0.79) and statistical improvements over SZZ baselines on three datasets, plus ablations showing contribution from TKG+agent and context expansion. No equations, fitted parameters, or derivations are presented that reduce to self-inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked; the approach is framed as a new graph-search reframing with independent components. The reader's assessment of score 2.0 aligns with the absence of any reduction to fitted inputs or self-referential definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on domain assumptions about LLM reasoning capabilities over temporal graphs and the completeness of git-derived relationships; no free parameters or new invented entities beyond the applied TKG and agent tools are introduced.

axioms (1)

domain assumption LLM agents equipped with graph navigation tools can perform reliable causal reasoning about bug introduction from commit history
Invoked in the second phase of the approach where the agent analyzes candidates.

invented entities (1)

Temporal Knowledge Graph encoding commits with temporal and structural relationships no independent evidence
purpose: To expand the search space beyond blame results by traversing file history
Constructed from existing git data but structured specifically for this task

pith-pipeline@v0.9.0 · 5633 in / 1319 out tokens · 26107 ms · 2026-05-16T08:23:50.883929+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArrowOfTime.lean arrow_from_z echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

temporal ordering is fundamental to causal reasoning about bug introduction... PRECEDES edges connect each commit to the next one by commit date... enabling chronological reasoning about bug introduction
IndisputableMonolith/Foundation/reality_from_one_distinction time-as-orbit certificate echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

TKG... encodes commits with temporal and structural relationships

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits
cs.SE 2026-04 conditional novelty 7.0

AgentSZZ is an LLM-agent framework that identifies bug-inducing commits with up to 27.2% higher F1 scores than prior methods by enabling adaptive exploration and causal tracing, especially for cross-file and ghost commits.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 1 Pith paper · 4 internal anchors

[1]

Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoudhury. 2025. Unified Software Engineering agent as AI Software Engineer.CoRRabs/2506.14683 (2025). arXiv:2506.14683 doi:10.48550/ ARXIV.2506.14683

work page arXiv 2025
[2]

Bullock, Chanchal K

Muhammad Asaduzzaman, Michael C. Bullock, Chanchal K. Roy, and Kevin A. Schneider. 2012. Bug introducing changes: A case study with Android. In9th IEEE Working Conference of Mining Software Repositories, MSR 2012, June 2-3, 2012, Zurich, Switzerland, Michele Lanza, Massimiliano Di Penta, and Tao Xie (Eds.). IEEE Computer Society, 116–119. doi:10.1109/MSR....

work page doi:10.1109/msr.2012.6224267 2012
[3]

Gabriele Bavota, Bernardino De Carluccio, Andrea De Lucia, Massimiliano Di Penta, Rocco Oliveto, and Orazio Strollo

work page
[4]

In12th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2012, Riva del Garda, Italy, September 23-24, 2012

When Does a Refactoring Induce Bugs? An Empirical Study. In12th IEEE International Working Conference on Source Code Analysis and Manipulation, SCAM 2012, Riva del Garda, Italy, September 23-24, 2012. IEEE Computer Society, 104–113. doi:10.1109/SCAM.2012.20

work page doi:10.1109/scam.2012.20 2012
[5]

Xiao Chen, Hengcheng Zhu, Jialun Cao, Ming Wen, and Shing-Chi Cheung. 2025. SemBIC: Semantic-Aware Identifica- tion of Bug-Inducing Commits.Proc. ACM Softw. Eng.2, FSE (2025), 1363–1385. doi:10.1145/3715781

work page doi:10.1145/3715781 2025
[6]

Zimin Chen, Yue Pan, Siyu Lu, Jiayi Xu, Claire Le Goues, Martin Monperrus, and He Ye. 2025. Prometheus: Unified Knowledge Graphs for Issue Resolution in Multilingual Codebases.CoRRabs/2507.19942 (2025). arXiv:2507.19942 doi:10.48550/ARXIV.2507.19942

work page doi:10.48550/arxiv.2507.19942 2025
[7]

2013.Statistical Power Analysis for the Behavioral Sciences

Jacob Cohen. 2013.Statistical Power Analysis for the Behavioral Sciences. Routledge

work page 2013
[8]

Daniel Alencar da Costa, Shane McIntosh, Weiyi Shang, Uirá Kulesza, Roberta Coelho, and Ahmed E. Hassan. 2017. A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes.IEEE Trans. Software Eng.43, 7 (2017), 641–657. doi:10.1109/TSE.2016.2616306

work page doi:10.1109/tse.2016.2616306 2017
[9]

Steven Davies, Marc Roper, and Murray Wood. 2014. Comparing text-based and dependence-based approaches for determining the origins of bugs.J. Softw. Evol. Process.26, 1 (2014), 107–139. doi:10.1002/SMR.1619

work page doi:10.1002/smr.1619 2014
[10]

DeepSeek-AI. 2025. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models.CoRRabs/2512.02556 (2025). arXiv:2512.02556 doi:10.48550/ARXIV.2512.02556

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.02556 2025
[11]

Xueying Du, Yiling Lou, Mingwei Liu, Xin Peng, and Tianyong Yang. 2023. KG4CraSolver: Recommending Crash Solutions via Knowledge Graph. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023, San Francisco, CA, USA, December 3-9, 2023, Satish Chandra, Kelly Blinc...

work page doi:10.1145/3611643.3616317 2023
[12]

Hassan, and Shanping Li

Yuanrui Fan, Xin Xia, Daniel Alencar da Costa, David Lo, Ahmed E. Hassan, and Shanping Li. 2021. The Impact of Mislabeled Changes by SZZ on Just-in-Time Defect Prediction.IEEE Trans. Software Eng.47, 8 (2021), 1559–1586. doi:10.1109/TSE.2019.2929761

work page doi:10.1109/tse.2019.2929761 2021
[13]

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. 2020. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. In Findings of the Association for Computational Linguistics: EMNLP 2020, Online Event, 16-20 November 2020 (Findings of ACL, Vol. EMNLP 2020), Trev...

work page doi:10.18653/v1/2020.findings-emnlp.139 2020
[14]

Afshin Mansouri, and Yuanyuan Zhang

Mark Harman, S. Afshin Mansouri, and Yuanyuan Zhang. 2012. Search-based software engineering: Trends, techniques and applications.ACM Comput. Surv.45, 1 (2012), 11:1–11:61. doi:10.1145/2379776.2379787

work page doi:10.1145/2379776.2379787 2012
[15]

Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu

Ahmed E. Hassan, Hao Li, Dayi Lin, Bram Adams, Tse-Hsun Chen, Yutaro Kashiwa, and Dong Qiu. 2025. Agentic Software Engineering: Foundational Pillars and a Research Roadmap.CoRRabs/2509.06216 (2025). arXiv:2509.06216 doi:10.48550/ARXIV.2509.06216

work page doi:10.48550/arxiv.2509.06216 2025
[16]

Hideaki Hata, Osamu Mizuno, and Tohru Kikuno. 2012. Bug prediction based on fine-grained module histories. In34th International Conference on Software Engineering, ICSE 2012, June 2-9, 2012, Zurich, Switzerland, Martin Glinz, Gail C. Murphy, and Mauro Pezzè (Eds.). IEEE Computer Society, 200–210. doi:10.1109/ICSE.2012.6227193 , Vol. 1, No. 1, Article . Pu...

work page doi:10.1109/icse.2012.6227193 2012
[17]

Aidan Hogan, Eva Blomqvist, Michael Cochez, Claudia d’Amato, Gerard De Melo, Claudio Gutierrez, Sabrina Kirrane, José Emilio Labra Gayo, Roberto Navigli, Sebastian Neumaier, et al. 2021. Knowledge graphs.ACM Computing Surveys (Csur)54, 4 (2021), 1–37

work page 2021
[18]

Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi

Yasutaka Kamei, Emad Shihab, Bram Adams, Ahmed E. Hassan, Audris Mockus, Anand Sinha, and Naoyasu Ubayashi

work page
[19]

Software Eng.39, 6 (2013), 757–773

A Large-Scale Empirical Study of Just-in-Time Quality Assurance.IEEE Trans. Software Eng.39, 6 (2013), 757–773. doi:10.1109/TSE.2012.70

work page doi:10.1109/tse.2012.70 2013
[20]

James Whitehead Jr

Sunghun Kim and E. James Whitehead Jr. 2006. How long did it take to fix bugs?. InProceedings of the 2006 International Workshop on Mining Software Repositories, MSR 2006, Shanghai, China, May 22-23, 2006, Stephan Diehl, Harald C. Gall, and Ahmed E. Hassan (Eds.). ACM, 173–174. doi:10.1145/1137983.1138027

work page doi:10.1145/1137983.1138027 2006
[21]

James Whitehead Jr

Sunghun Kim, Thomas Zimmermann, Kai Pan, and E. James Whitehead Jr. 2006. Automatic Identification of Bug- Introducing Changes. In21st IEEE/ACM International Conference on Automated Software Engineering (ASE 2006), 18-22 September 2006, Tokyo, Japan. IEEE Computer Society, 81–90. doi:10.1109/ASE.2006.23

work page doi:10.1109/ase.2006.23 2006
[22]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.CoRRabs/2507.15003 (2025). arXiv:2507.15003 doi:10.48550/ARXIV.2507.15003

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2507.15003 2025
[23]

Yue Li, Bohan Liu, Ting Zhang, Zhiqi Wang, David Lo, Lanxin Yang, Jun Lyu, and He Zhang. 2025. A Knowledge Enhanced Large Language Model for Bug Localization.Proc. ACM Softw. Eng.2, FSE (2025), 1914–1936. doi:10.1145/ 3729356

work page 2025
[24]

Yizhou Liu, Pengfei Gao, Xinchen Wang, Jie Liu, Yexuan Shi, Zhao Zhang, and Chao Peng. 2024. MarsCode Agent: AI-native Automated Bug Fixing.CoRRabs/2409.00899 (2024). arXiv:2409.00899 doi:10.48550/ARXIV.2409.00899

work page doi:10.48550/arxiv.2409.00899 2024
[25]

Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo. 2024. Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel.IEEE Trans. Software Eng.50, 9 (2024), 2219–2239. doi:10.1109/TSE.2024.3406718

work page doi:10.1109/tse.2024.3406718 2024
[26]

Edmilson Campos Neto, Daniel Alencar da Costa, and Uirá Kulesza. 2018. The impact of refactoring changes on the SZZ algorithm: An empirical study. In25th International Conference on Software Analysis, Evolution and Reengineering, SANER 2018, Campobasso, Italy, March 20-23, 2018, Rocco Oliveto, Massimiliano Di Penta, and David C. Shepherd (Eds.). IEEE Comp...

work page doi:10.1109/saner.2018.8330225 2018
[27]

Edmilson Campos Neto, Daniel Alencar da Costa, and Uirá Kulesza. 2019. Revisiting and Improving SZZ Implementa- tions. In2019 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, ESEM 2019, Porto de Galinhas, Recife, Brazil, September 19-20, 2019. IEEE, 1–12. doi:10.1109/ESEM.2019.8870178

work page doi:10.1109/esem.2019.8870178 2019
[28]

Namyong Park, Fuchen Liu, Purvanshi Mehta, Dana Cristofor, Christos Faloutsos, and Yuxiao Dong. 2022. EvoKG: Jointly Modeling Event Time and Network Structure for Reasoning over Temporal Knowledge Graphs. InWSDM ’22: The Fifteenth ACM International Conference on Web Search and Data Mining, Virtual Event / Tempe, AZ, USA, February 21 - 25, 2022, K. Selcuk ...

work page doi:10.1145/3488560.3498451 2022
[29]

Luca Pascarella, Fabio Palomba, and Alberto Bacchelli. 2019. Fine-grained just-in-time defect prediction.J. Syst. Softw. 150 (2019), 22–36. doi:10.1016/J.JSS.2018.12.001

work page doi:10.1016/j.jss.2018.12.001 2019
[30]

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.CoRRabs/2501.13956 (2025). arXiv:2501.13956 doi:10.48550/ARXIV.2501.13956

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2501.13956 2025
[31]

Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Bavota, Michele Lanza, and Rocco Oliveto. 2021. Evaluating SZZ Implementations Through a Developer-informed Oracle. In43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021. IEEE, 436–447. doi:10.1109/ICSE43902. 2021.00049

work page doi:10.1109/icse43902 2021
[32]

Yu Shi, Hao Li, Bram Adams, and Ahmed E. Hassan. 2025. HAFixAgent: History-Aware Automated Program Repair Agent.CoRRabs/2511.01047 (2025). arXiv:2511.01047 doi:10.48550/ARXIV.2511.01047

work page doi:10.48550/arxiv.2511.01047 2025
[33]

Danilo Silva and Marco Túlio Valente. 2017. RefDiff: detecting refactorings in version histories. InProceedings of the 14th International Conference on Mining Software Repositories, MSR 2017, Buenos Aires, Argentina, May 20-28, 2017, Jesús M. González-Barahona, Abram Hindle, and Lin Tan (Eds.). IEEE Computer Society, 269–279. doi:10.1109/MSR.2017.14

work page doi:10.1109/msr.2017.14 2017
[34]

Jacek Sliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When do changes induce fixes?. InProceedings of the 2005 International Workshop on Mining Software Repositories, MSR 2005, Saint Louis, Missouri, USA, May 17, 2005. ACM. doi:10.1145/1083142.1083147

work page doi:10.1145/1083142.1083147 2005
[35]

Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction second edition.Adaptive computation and machine learning: The MIT Press, Cambridge, MA and London(2018)

work page 2018
[36]

Ming Tan, Lin Tan, Sashank Dara, and Caleb Mayeux. 2015. Online Defect Prediction for Imbalanced Data. In37th IEEE/ACM International Conference on Software Engineering, ICSE 2015, Florence, Italy, May 16-24, 2015, Volume 2, Antonia Bertolino, Gerardo Canfora, and Sebastian G. Elbaum (Eds.). IEEE Computer Society, 99–108. doi:10.1109/ICSE.2015.139

work page doi:10.1109/icse.2015.139 2015
[37]

Lingxiao Tang, Lingfeng Bao, Xin Xia, and Zhongdong Huang. 2023. Neural SZZ Algorithm. In38th IEEE/ACM International Conference on Automated Software Engineering, ASE 2023, Luxembourg, September 11-15, 2023. IEEE, , Vol. 1, No. 1, Article . Publication date: February 2026. Beyond Blame: Rethinking SZZ with Knowledge Graph Search 21 1024–1035. doi:10.1109/...

work page doi:10.1109/ase56229.2023.00037 2023
[38]

Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, and Lingfeng Bao. 2025. LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models.Proc. ACM Softw. Eng.2, ISSTA (2025), 343–365. doi:10.1145/3728885

work page doi:10.1145/3728885 2025
[39]

Lingxiao Tang, Chao Ni, Qiao Huang, and Lingfeng Bao. 2024. Enhancing Bug-Inducing Commit Identification: A Fine- Grained Semantic Analysis Approach.IEEE Trans. Software Eng.50, 11 (2024), 3037–3052. doi:10.1109/TSE.2024.3468296

work page doi:10.1109/tse.2024.3468296 2024
[40]

Rakshit Trivedi, Hanjun Dai, Yichen Wang, and Le Song. 2017. Know-Evolve: Deep Temporal Reasoning for Dynamic Knowledge Graphs. InProceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 (Proceedings of Machine Learning Research, Vol. 70), Doina Precup and Yee Whye Teh (Eds.). PMLR, 3462–347...

work page 2017
[41]

Nikolaos Tsantalis, Matin Mansouri, Laleh Mousavi Eshkevari, Davood Mazinanian, and Danny Dig. 2018. Accurate and efficient refactoring detection in commit history. InProceedings of the 40th International Conference on Software Engineering, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, and Mark Har...

work page doi:10.1145/3180155.3180206 2018
[42]

Guoqing Wang, Zeyu Sun, Yizhou Chen, Yifan Zhao, Haiyang Shen, Qingyuan Liang, and Dan Hao. 2025. Beyond the Sum of Parts: Leveraging Entanglement for Bug Inducing Commit Localization.IEEE Transactions on Software Engineering(2025), 1–19. doi:10.1109/TSE.2025.3628362

work page doi:10.1109/tse.2025.3628362 2025
[43]

Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, and et al. 2025. OpenHands: An Open Platform for AI Software Developers as Generalist Agents....

work page 2025
[44]

Ming Wen, Rongxin Wu, Yepang Liu, Yongqiang Tian, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exploring and exploiting the correlations between bug-inducing and bug-fixing commits. InProceedings of the ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/SIGSOFT FSE 2019, Talli...

work page doi:10.1145/3338906.3338962 2019
[45]

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2024. Agentless: Demystifying LLM-based Software Engineering Agents.CoRRabs/2407.01489 (2024). arXiv:2407.01489 doi:10.48550/ARXIV.2407.01489

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.01489 2024
[46]

Davis, Peter Y

Meng Yan, Xin Xia, Yuanrui Fan, Ahmed E. Hassan, David Lo, and Shanping Li. 2022. Just-In-Time Defect Identification and Localization: A Two-Phase Framework.IEEE Trans. Software Eng.48, 2 (2022), 82–101. doi:10.1109/TSE.2020.2978819

work page doi:10.1109/tse.2020.2978819 2022
[47]

Boyang Yang, Haoye Tian, Jiadong Ren, Shunfu Jin, Yang Liu, Feng Liu, and Bach Le. 2025. Enhancing Repository- Level Software Repair via Repository-Aware Knowledge Graphs.CoRRabs/2503.21710 (2025). arXiv:2503.21710 doi:10.48550/ARXIV.2503.21710

work page doi:10.48550/arxiv.2503.21710 2025
[48]

Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

work page
[49]

SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and...

work page 2024
[50]

Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. AutoCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2024, Vienna, Austria, September 16-20, 2024, Maria Christakis and Michael Pradel (Eds.). ACM, 1592–1604. doi:10.1145/3650212.3680384 , Vo...

work page doi:10.1145/3650212.3680384 2024