pith. machine review for the scientific record. sign in

arxiv: 2604.02665 · v1 · submitted 2026-04-03 · 💻 cs.SE

Recognition: no theorem link

AgentSZZ: Teaching the LLM Agent to Play Detective with Bug-Inducing Commits

Authors on Pith no claims yet

Pith reviewed 2026-05-13 20:31 UTC · model grok-4.3

classification 💻 cs.SE
keywords SZZ algorithmbug-inducing commitsLLM agentssoftware repository miningdefect predictioncausal tracingReAct reasoninggit blame limitations
0
0 comments X

The pith

An LLM agent with task-specific tools and a reasoning loop identifies bug-inducing commits more accurately than prior SZZ methods by succeeding where git blame fails.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgentSZZ, an agent-based system that uses large language models to explore code repositories and locate the commits that introduced bugs. Traditional SZZ approaches depend on git blame to track line changes within single files, which leaves many cases untraceable when changes span files or lines disappear entirely. AgentSZZ instead gives the model custom tools for repository navigation, domain knowledge about bug patterns, and a step-by-step reasoning cycle that mimics how developers investigate issues. This yields higher accuracy overall and especially large gains on the hardest commit types. The result matters for any downstream task that relies on knowing exactly which change caused a defect.

Core claim

AgentSZZ is an agent-based framework that leverages LLM-driven agents to explore repositories and identify bug-inducing commits. It integrates task-specific tools, domain knowledge, and a ReAct-style loop to enable adaptive and causal tracing of bugs, supported by a structured compression module that reduces redundant context while preserving key evidence.

What carries the argument

AgentSZZ, an LLM agent that runs a ReAct-style reasoning loop equipped with repository exploration tools and bug-domain knowledge to perform iterative causal tracing of commit origins.

If this is right

  • SZZ-dependent tasks such as defect prediction and vulnerability analysis receive more complete and accurate input data because recall rises sharply on cross-file and ghost commits.
  • The compression module cuts token consumption by more than 30 percent with negligible accuracy loss, making repeated agent runs practical on large repositories.
  • Ablation results establish that removing either the task-specific tools or the domain knowledge sharply reduces performance, confirming both components are required.
  • The ReAct loop replaces fixed pipelines, allowing the agent to adapt its search strategy to the structure of each bug report.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same agent structure could be reused for related software-history tasks such as tracing the introduction of security vulnerabilities or performance regressions.
  • Interactive tool-augmented agents may outperform static analysis pipelines in other code-investigation settings that require multi-step causal reasoning.
  • The approach suggests a general pattern: supplying LLMs with narrow, high-precision tools plus domain rules can close performance gaps that pure text-based prompting leaves open.

Load-bearing premise

LLM agents supplied with the described tools and domain knowledge can reliably perform adaptive causal tracing of bug origins inside real code repositories, even in cases where line-based git blame cannot succeed.

What would settle it

A direct head-to-head evaluation on a fresh set of developer-annotated bug-inducing commits that contains a high proportion of cross-file and ghost cases; if AgentSZZ shows no substantial F1 improvement over the strongest prior LLM-based SZZ baseline, the central claim is false.

Figures

Figures reproduced from arXiv: 2604.02665 by Chengran Yang, David Lo, Hong Jin Kang, Jieke Shi, Julia Lawall, Junda He, Junkai Chen, Ratnadira Widyasari, Yunbo Lyu, Yuqing Niu, Zhou Yang.

Figure 1
Figure 1. Figure 1: Overview of AgentSZZ, illustrating the end-to-end workflow from preprocessing (1-3), through the ReAct investigation loop (4-7), to output generation (8-9). Given a bug-fixing commit (BFC), the agent investigates the repository through tool interactions guided by domain knowledge, with context compression, and outputs the identified bug-inducing commit (BIC). and domain knowledge for precise bug-inducing c… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt for AgentSZZ, encoding domain knowl￾edge for Bug-Inducing Commit (BIC) investigation. searches the codebase at a specific revision to locate definitions, call sites, and cross-file references. All five tools wrap standard Git commands (git blame, git show, git log -S, git log -L, and git grep), but are exposed as function-calling schemas via the API tools parameter, restricting the agent to invoking… view at source ↗
Figure 3
Figure 3. Figure 3: A successful case of AgentSZZ identifying a cross￾file bug-inducing commit. 3.6 Successful Example We illustrate AgentSZZ through a cross-file bug in the Linux ker￾nel’s SCSI subsystem.1 Although the fix modifies mvsas/mv_sas.c, the root cause lies in the core library libsas/sas_event.c, where the event handling mechanism was rewritten four years earlier [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Distribution of tool usage across investigation steps in [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
read the original abstract

The SZZ algorithm is the dominant technique for identifying bug-inducing commits and underpins many software engineering tasks, such as defect prediction and vulnerability analysis. Despite numerous variants, including recent LLM-based approaches, performance remains limited on developer-annotated datasets (e.g., recall of 0.552 on the Linux kernel). A key limitation is the reliance on git blame, which traces line-level changes within the same file, failing in common scenarios such as ghost and cross-file cases-making nearly one-quarter of bug-inducing commits inherently untraceable. Moreover, current approaches follow fixed pipelines that restrict iterative reasoning and exploration, unlike developers who investigate bugs through an interactive, multi-tool process. To address these challenges, we propose AgentSZZ, an agent-based framework that leverages LLM-driven agents to explore repositories and identify bug-inducing commits. Unlike prior methods, AgentSZZ integrates task-specific tools, domain knowledge, and a ReAct-style loop to enable adaptive and causal tracing of bugs. A structured compression module further improves efficiency by reducing redundant context while preserving key evidence. Extensive experiments on three widely used datasets show that AgentSZZ consistently outperforms state-of-the-art SZZ algorithms across all settings, achieving F1-score gains of up to 27.2% over prior LLM-based approaches. The improvements are especially pronounced in challenging scenarios such as cross-file and ghost commits, with recall gains of up to 300% and 60%, respectively. Ablation studies show that task-specific tools and domain knowledge are critical, while compression tool outputs reduce token consumption by over 30% with negligible impact. The replication package is available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AgentSZZ, an LLM-driven agent framework that employs a ReAct-style iterative loop, task-specific tools, domain knowledge, and a structured compression module to identify bug-inducing commits. Unlike prior SZZ variants that rely on fixed git-blame pipelines, AgentSZZ performs adaptive causal tracing; experiments on three developer-annotated datasets report consistent F1 gains up to 27.2 % over prior LLM-based SZZ methods, with especially large recall improvements (up to 300 % and 60 %) on cross-file and ghost commits.

Significance. If the empirical claims hold under proper statistical controls, the work would demonstrate that tool-augmented LLM agents can overcome well-known limitations of line-level blame in real repositories, offering a practical advance for downstream tasks such as defect prediction and vulnerability analysis. The availability of a replication package and the ablation results on tools and compression are positive indicators of reproducibility.

major comments (2)
  1. [Experimental results section] Experimental results section (and associated tables): all headline metrics (F1 = 0.272 gain, recall gains of 300 % / 60 % on cross-file/ghost subsets) are reported as single point estimates with no run-to-run variance, standard deviations, or statistical significance tests against the SOTA baselines. Because the ReAct agent performs stochastic tool use and iterative reasoning, the absence of these controls directly undermines the central claim of “consistent outperformance across all settings.”
  2. [Experimental results section] Baseline and data-preparation description: the manuscript does not provide sufficient detail on how the prior SZZ and LLM-based baselines were re-implemented, which data splits were used, or whether any post-hoc selection of runs occurred. Without this information it is impossible to verify that the reported deltas are not artifacts of implementation differences or cherry-picking.
minor comments (2)
  1. [Abstract] The abstract states that experiments were run on “three widely used datasets” but does not name them; adding the dataset names would improve immediate readability.
  2. [Method section] Notation for the compression module and tool interfaces could be clarified with a small diagram or pseudocode snippet to make the ReAct loop easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that will strengthen the empirical claims and reproducibility of the work.

read point-by-point responses
  1. Referee: [Experimental results section] Experimental results section (and associated tables): all headline metrics (F1 = 0.272 gain, recall gains of 300 % / 60 % on cross-file/ghost subsets) are reported as single point estimates with no run-to-run variance, standard deviations, or statistical significance tests against the SOTA baselines. Because the ReAct agent performs stochastic tool use and iterative reasoning, the absence of these controls directly undermines the central claim of “consistent outperformance across all settings.”

    Authors: We agree that single-point estimates are insufficient given the stochastic nature of the ReAct loop. In the revised manuscript we will rerun all experiments at least five times with different random seeds, report mean F1/recall scores together with standard deviations, and add statistical significance tests (Wilcoxon signed-rank test with Bonferroni correction) against each baseline. Updated tables and a new paragraph in the experimental results section will document these controls. revision: yes

  2. Referee: [Experimental results section] Baseline and data-preparation description: the manuscript does not provide sufficient detail on how the prior SZZ and LLM-based baselines were re-implemented, which data splits were used, or whether any post-hoc selection of runs occurred. Without this information it is impossible to verify that the reported deltas are not artifacts of implementation differences or cherry-picking.

    Authors: We acknowledge the gap in methodological transparency. The replication package already contains the exact baseline re-implementation scripts and dataset files. In the revised version we will expand the Experimental Setup subsection with: (1) a step-by-step description of how each baseline (including the prior LLM-based SZZ methods) was re-implemented, (2) explicit confirmation that the three developer-annotated datasets were used in their entirety with no custom train/test splits, and (3) a statement that no post-hoc run selection occurred. We will also add a direct pointer to the replication package in the main text. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper presents an empirical framework (AgentSZZ) evaluated on three public datasets against prior SZZ baselines. Performance metrics (F1, recall) are computed directly from agent outputs on held-out bug-inducing commits; no equations, fitted parameters, or derivations reduce the reported gains to inputs defined by the same data. Ablation results and tool descriptions are likewise measured outcomes rather than self-referential. Self-citations to prior SZZ work are external benchmarks, not load-bearing justifications for the central claim. The evaluation chain is therefore independent of the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the empirical claim that LLM agents can outperform fixed pipelines when given appropriate tools and knowledge; no free parameters are explicitly fitted in the abstract, and no new physical or mathematical entities are postulated.

axioms (1)
  • domain assumption LLM agents can effectively use provided tools and domain knowledge to explore repositories and perform causal tracing of bugs
    This assumption underpins the ReAct-style loop and the claimed superiority in ghost and cross-file cases.

pith-pipeline@v0.9.0 · 5630 in / 1340 out tokens · 38719 ms · 2026-05-13T20:31:27.455487+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

  1. [1]

    ACM SIGSOFT Impact Paper Award

    2026. ACM SIGSOFT Impact Paper Award. https://www2.sigsoft.org/awards/ impactpaper/

  2. [2]

    Anthropic. 2026. Claude Opus 4.6 System Card. https://www.anthropic.com/ system-cards

  3. [3]

    ashwinthandu03. 2025. GPT-5 models – Temperature. OpenAI Developer Com- munity. https://community.openai.com/t/gpt-5-models-temperature/1337957 Accessed: 2026-03-25

  4. [4]

    Hassan, and Xiaohu Yang

    Lingfeng Bao, Xin Xia, Ahmed E. Hassan, and Xiaohu Yang. 2022. V-SZZ: au- tomatic identification of version ranges affected by CVE vulnerabilities. InPro- ceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 2352–2364. doi:10.1145/3510003.3510113

  5. [5]

    Gabriele Bavota and Barbara Russo. 2015. Four eyes are better than two: On the impact of code reviews on software quality. In2015 IEEE International Conference on Software Maintenance and Evolution (ICSME). 81–90. doi:10.1109/ICSM.2015. 7332454

  6. [6]

    Boyuan Chen and Zhen Ming Jiang. 2019. Extracting and studying the Logging- Code-Issue-Introducing changes in Java-based large-scale open source software systems.Empirical Software Engineering24, 4 (2019), 2285–2322

  7. [7]

    Junkai Chen, Huihui Huang, Yunbo Lyu, Junwen An, Jieke Shi, Chengran Yang, Ting Zhang, Haoye Tian, Yikun Li, Zhenhao Li, et al. 2025. SecureAgentBench: Benchmarking Secure Code Generation under Realistic Vulnerability Scenarios. arXiv preprint arXiv:2509.22097(2025)

  8. [8]

    Xingchu Chen, Chengwei Liu, Jialun Cao, Yang Xiao, Xinyue Cai, Yeting Li, Jingyi Shi, Tianqi Sun, et al. 2025. Vulnerability-Affected Versions Identification: How Far Are We?arXiv preprint arXiv:2509.03876(2025)

  9. [9]

    Cognition AI. 2024. Devin, AI software engineer. https://www.cognition.ai/ introducing-devin

  10. [10]

    Daniel Alencar da Costa, Shane McIntosh, Weiyi Shang, Uirá Kulesza, Roberta Coelho, and Ahmed E. Hassan. 2017. A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes.IEEE Transactions on Software Engineering43, 7 (2017), 641–657. doi:10.1109/TSE.2016.2616306

  11. [11]

    Daniel Alencar da Costa, Shane McIntosh, Weiyi Shang, Uirá Kulesza, Roberta Coelho, and Ahmed E. Hassan. 2017. A Framework for Evaluating the Results of the SZZ Approach for Identifying Bug-Introducing Changes.IEEE Transactions on Software Engineering43, 7 (2017), 641–657. doi:10.6084/m9.figshare.31869097

  12. [12]

    Steven Davies, Marc Roper, and Murray Wood. 2014. Comparing text-based and dependence-based approaches for determining the origins of bugs.Journal of Software: Evolution and Process26, 1 (2014), 107–139

  13. [13]

    Yuanrui Fan, Xin Xia, Daniel Alencar Da Costa, David Lo, Ahmed E Hassan, and Shanping Li. 2019. The impact of mislabeled changes by szz on just-in-time defect prediction.IEEE transactions on software engineering47, 8 (2019), 1559–1586

  14. [14]

    Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, et al. 2020. Codebert: A pre-trained model for programming and natural languages.arXiv preprint arXiv:2002.08155 (2020)

  15. [15]

    Hideaki Hata, Osamu Mizuno, and Tohru Kikuno. 2012. Bug prediction based on fine-grained module histories. In2012 34th International Conference on Software Engineering (ICSE). 200–210. doi:10.1109/ICSE.2012.6227193

  16. [16]

    Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead.ACM Transactions on Software Engineering and Methodology(2025)

  17. [17]

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. 2023. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770(2023)

  18. [18]

    James Jr

    Sunghun Kim, Thomas Zimmermann, Kai Pan, and E. James Jr. Whitehead. 2006. Automatic Identification of Bug-Introducing Changes . InProceedings. 21st IEEE International Conference on Automated Software Engineering. IEEE Computer Society, Los Alamitos, CA, USA, 81–90. doi:10.1109/ASE.2006.23

  19. [19]

    LaToza, Gina Venolia, and Robert DeLine

    Thomas D. LaToza, Gina Venolia, and Robert DeLine. 2006. Maintaining mental models: a study of developer work habits. InProceedings of the 28th International Conference on Software Engineering(Shanghai, China)(ICSE ’06). Association for Computing Machinery, New York, NY, USA, 492–501. doi:10.1145/1134285. 1134355

  20. [20]

    Joseph Lawrance, Christopher Bogart, Margaret Burnett, Rachel Bellamy, Kyle Rector, and Scott D. Fleming. 2013. How Programmers Debug, Revisited: An Infor- mation Foraging Theory Perspective.IEEE Transactions on Software Engineering 39, 2 (2013), 197–215. doi:10.1109/TSE.2010.111

  21. [21]

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. 2026. Large Language Model-Based Agents for Software Engineering: A Survey.ACM Trans. Softw. Eng. Methodol.(March 2026). doi:10. 1145/3796507 Just Accepted

  22. [22]

    Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhan Jabbarvand. 2025. Process-Centric Analysis of Agentic Software Systems. arXiv preprint arXiv:2512.02393(2025)

  23. [23]

    Yunbo Lyu, Hong Jin Kang, Ratnadira Widyasari, Julia Lawall, and David Lo. 2024. Evaluating SZZ Implementations: An Empirical Study on the Linux Kernel.IEEE Trans. Softw. Eng.50, 9 (Sept. 2024), 2219–2239. doi:10.1109/TSE.2024.3406718

  24. [24]

    Edmilson Campos Neto, Daniel Alencar da Costa, and Uirá Kulesza. 2018. The impact of refactoring changes on the SZZ algorithm: An empirical study. In2018 IEEE 25th International Conference on Software Analysis, Evolution and Reengi- neering (SANER). 380–390. doi:10.1109/SANER.2018.8330225

  25. [25]

    OpenAI. 2024. GPT-4o-mini. https://platform.openai.com/docs/models#gpt-4o- mini

  26. [26]

    OpenAI. 2025. GPT-5 mini. https://developers.openai.com/api/docs/models/gpt- 5-mini

  27. [27]

    Chris Parnin and Alessandro Orso. 2011. Are automated debugging tech- niques actually helping programmers?. InProceedings of the 2011 International Symposium on Software Testing and Analysis(Toronto, Ontario, Canada)(IS- STA ’11). Association for Computing Machinery, New York, NY, USA, 199–209. doi:10.1145/2001420.2001445

  28. [28]

    Luca Pascarella, Fabio Palomba, and Alberto Bacchelli. 2019. Fine-grained just- in-time defect prediction.Journal of Systems and Software150 (2019), 22–36. doi:10.1016/j.jss.2018.12.001

  29. [29]

    Christophe Rezk, Yasutaka Kamei, and Shane McIntosh. 2022. The Ghost Commit Problem When Identifying Fix-Inducing Changes: An Empirical Study of Apache Projects.IEEE Transactions on Software Engineering48, 9 (2022), 3297–3309. doi:10.1109/TSE.2021.3087419

  30. [30]

    Giovanni Rosa, Luca Pascarella, Simone Scalabrino, Rosalia Tufano, Gabriele Bavota, Michele Lanza, and Rocco Oliveto. 2021. Evaluating SZZ Implementations Through a Developer-Informed Oracle. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 436–447. doi:10.1109/ICSE43902.2021. 00049

  31. [31]

    Yu Shi, Hao Li, Bram Adams, and Ahmed E Hassan. 2026. Beyond Blame: Re- thinking SZZ with Knowledge Graph Search.arXiv preprint arXiv:2602.02934 (2026)

  32. [32]

    Danilo Silva and Marco Tulio Valente. 2017. RefDiff: Detecting Refactorings in Version Histories. In2017 IEEE/ACM 14th International Conference on Mining Software Repositories (MSR). 269–279. doi:10.1109/MSR.2017.14

  33. [33]

    Jacek Śliwerski, Thomas Zimmermann, and Andreas Zeller. 2005. When do changes induce fixes?SIGSOFT Softw. Eng. Notes30, 4 (May 2005), 1–5. doi:10. 1145/1082983.1083147

  34. [34]

    Ming Tan, Lin Tan, Sashank Dara, and Caleb Mayeux. 2015. Online Defect Predic- tion for Imbalanced Data. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, Vol. 2. 99–108. doi:10.1109/ICSE.2015.139

  35. [35]

    Lingxiao Tang, Lingfeng Bao, Xin Xia, and Zhongdong Huang. 2023. Neural SZZ Algorithm. In2023 38th IEEE/ACM International Conference on Automated Software Engineering (ASE). 1024–1035. doi:10.1109/ASE56229.2023.00037

  36. [36]

    Lingxiao Tang, Jiakun Liu, Zhongxin Liu, Xiaohu Yang, and Lingfeng Bao. 2025. LLM4SZZ: Enhancing SZZ Algorithm with Context-Enhanced Assessment on Large Language Models.Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA016 (June 2025), 23 pages. doi:10.1145/3728885

  37. [37]

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al . 2026. Kimi K2. 5: Visual Agentic Intelligence.arXiv preprint arXiv:2602.02276(2026)

  38. [38]

    Eshkevari, Davood Mazinanian, and Danny Dig

    Nikolaos Tsantalis, Matin Mansouri, Laleh M. Eshkevari, Davood Mazinanian, and Danny Dig. 2018. Accurate and efficient refactoring detection in commit history. InProceedings of the 40th International Conference on Software Engineering (Gothenburg, Sweden)(ICSE ’18). Association for Computing Machinery, New York, NY, USA, 483–494. doi:10.1145/3180155.3180206

  39. [39]

    Michele Tufano, Gabriele Bavota, Denys Poshyvanyk, Massimiliano Di Penta, Rocco Oliveto, and Andrea De Lucia. 2017. An empirical study on developer- related factors characterizing fix-inducing commits.Journal of Software: Evolution and Process29, 1 (2017), e1797

  40. [40]

    Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. 2024. Executable code actions elicit better llm agents. InForty-first International Conference on Machine Learning

  41. [41]

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. 2024. Openhands: An open platform for ai software developers as generalist agents.arXiv preprint arXiv:2407.16741(2024)

  42. [42]

    Ming Wen, Rongxin Wu, Yepang Liu, Yongqiang Tian, Xuan Xie, Shing-Chi Cheung, and Zhendong Su. 2019. Exploring and exploiting the correlations between bug-inducing and bug-fixing commits. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium Preprint, Under Submission, Arxiv Lyu et al. on the Foundation...

  43. [43]

    Zhiheng Xi, Wenxiang Chen, Xin Guo, Wei He, Yiwen Ding, Boyang Hong, Ming Zhang, Junzhe Wang, Senjie Jin, Enyu Zhou, et al. 2025. The rise and potential of large language model based agents: A survey.Science China Information Sciences 68, 2 (2025), 121101

  44. [44]

    Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. 2025. De- mystifying LLM-Based Software Engineering Agents.Proc. ACM Softw. Eng.2, FSE, Article FSE037 (June 2025), 24 pages. doi:10.1145/3715754

  45. [45]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al . 2024. Os- world: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

  46. [46]

    Hassan, David Lo, and Shanping Li

    Meng Yan, Xin Xia, Yuanrui Fan, Ahmed E. Hassan, David Lo, and Shanping Li. 2022. Just-In-Time Defect Identification and Localization: A Two-Phase Framework.IEEE Transactions on Software Engineering48, 1 (2022), 82–101. doi:10.1109/TSE.2020.2978819

  47. [47]

    Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press

    John Yang, Carlos E. Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. 2024. SWE-agent: Agent-Computer In- terfaces Enable Automated Software Engineering. InAdvances in Neural In- formation Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran A...

  48. [48]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

  49. [49]

    Fengji Zhang, Bei Chen, Yue Zhang, Jacky Keung, Jin Liu, Daoguang Zan, Yi Mao, Jian-Guang Lou, and Weizhu Chen. 2023. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association...

  50. [50]

    Yuntong Zhang, Haifeng Ruan, Zhiyu Fan, and Abhik Roychoudhury. 2024. Au- toCodeRover: Autonomous Program Improvement. InProceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis(Vienna, Austria)(ISSTA 2024). Association for Computing Machinery, New York, NY, USA, 1592–1604. doi:10.1145/3650212.3680384