pith. sign in

arxiv: 2403.17134 · v2 · pith:UL3V7M5Gnew · submitted 2024-03-25 · 💻 cs.SE · cs.AI

RepairAgent: An Autonomous, LLM-Based Agent for Program Repair

Pith reviewed 2026-05-19 10:16 UTC · model grok-4.3

classification 💻 cs.SE cs.AI
keywords automated program repairlarge language modelsautonomous agentssoftware bug fixingDefects4Jtool-based repair
0
0 comments X p. Extension
pith:UL3V7M5G Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{UL3V7M5G}

Prints a linked pith:UL3V7M5G badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

A large language model can autonomously repair software bugs by deciding on tools and actions via a state machine and dynamic prompt.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper shows how to turn a large language model into an agent that fixes bugs on its own rather than following a fixed prompt or loop. The agent gathers information about the bug, collects possible fixes, and checks results by calling tools it chooses based on what it has learned so far. A reader would care because bugs harm software reliability and this setup makes automated repair more adaptable than earlier methods. On the Defects4J dataset the agent fixes 164 bugs, 39 of which no previous technique had repaired.

Core claim

RepairAgent is the first autonomous LLM-based agent for program repair. Unlike prior deep learning approaches that use a fixed prompt or fixed feedback loop, it treats the LLM as an agent that plans and executes actions by invoking tools. The agent interleaves gathering bug information, collecting repair ingredients, and validating fixes, choosing the next tool based on gathered data and feedback from earlier attempts. This is supported by a set of program repair tools, a dynamically updated prompt, and a finite state machine that guides tool use.

What carries the argument

The finite state machine that guides tool invocations together with a dynamically updated prompt allowing the LLM to interact with repair tools and respond to feedback.

If this is right

  • Automated repair can interleave information gathering, ingredient collection, and validation in a flexible order chosen by the agent.
  • The approach repairs bugs that fixed-prompt methods miss, adding 39 new fixes on Defects4J.
  • The cost remains modest, averaging 270000 tokens or 14 cents per bug with current GPT-3.5 pricing.
  • Agent designs with state guidance and tool sets open the door to similar autonomous methods across software engineering tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Stronger future language models could raise the repair rate by better following state guidance and avoiding loops.
  • Adding more program analysis tools to the agent's set might help with bugs that current tools cannot address.
  • The same pattern of dynamic prompts plus state machine could be tested on related tasks such as test-case generation.

Load-bearing premise

The large language model will reliably select and invoke the right tools, interpret tool feedback, and avoid unproductive loops when guided only by the finite state machine and dynamically updated prompt.

What would settle it

Testing the agent on a fresh collection of bugs and finding that it often picks unsuitable tools or enters repeated unproductive cycles without producing fixes would show the claim does not hold.

read the original abstract

Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent's effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI's GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces RepairAgent, the first autonomous LLM-based agent for program repair. Unlike prior DL approaches that use fixed prompts or loops, RepairAgent equips an LLM with repair-specific tools (for information gathering, ingredient collection, and validation), a dynamically updated prompt, and a finite state machine to guide tool invocation and interleaving of actions. The central claim is an empirical evaluation on Defects4J showing autonomous repair of 164 bugs, including 39 not fixed by prior techniques, at an average cost of 270,000 tokens (~14 cents USD with GPT-3.5) per bug.

Significance. If the autonomy premise holds, the result is significant as the first concrete demonstration of an agentic, tool-using LLM approach to automated program repair on a standard benchmark. It moves beyond static prompting, quantifies practical costs, and explicitly opens a new direction for agent-based techniques in software engineering. The use of Defects4J enables direct comparison with existing APR literature.

major comments (3)
  1. [§5] §5 (Evaluation) and abstract: The headline result of 164 autonomously repaired bugs (including 39 novel) provides no breakdown of the number of independent runs performed, statistical significance tests, or the fraction of repairs that completed without external restarts, human overrides, or post-hoc filtering. This detail is load-bearing for the claim that the FSM and dynamic prompt suffice to keep the LLM from unproductive loops or tool misuse.
  2. [§3] §3 (Agent Architecture): The finite state machine is presented as guiding tool selection and termination, yet the transition rules for handling ambiguous tool outputs, LLM misparsing of observations, or failed repair paths are not exhaustively specified. Without these, it is unclear how the architecture guarantees termination rather than non-terminating behavior on the stochastic LLM.
  3. [§5.2] §5.2 (Comparison to baselines): The count of 39 bugs not fixed by prior techniques requires an explicit enumeration of the exact prior tools/techniques, their search budgets, and model configurations used in the comparison. Differences in experimental protocol could inflate the novelty claim.
minor comments (3)
  1. [Figure 2] Figure 2 (state machine diagram): The labels on transitions are difficult to read at standard print size; enlarging or adding a textual legend would improve clarity.
  2. [Related Work] Related work section: While the paper positions itself as the first autonomous agent for APR, a brief discussion of contemporaneous LLM-agent work in other SE tasks (e.g., test generation) would better contextualize novelty.
  3. [Table 1] Table 1 (bug repair counts): Include standard deviation or range across runs for the token-cost column to accompany the reported average of 270,000 tokens.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment below and will incorporate revisions to improve clarity and rigor where appropriate.

read point-by-point responses
  1. Referee: [§5] §5 (Evaluation) and abstract: The headline result of 164 autonomously repaired bugs (including 39 novel) provides no breakdown of the number of independent runs performed, statistical significance tests, or the fraction of repairs that completed without external restarts, human overrides, or post-hoc filtering. This detail is load-bearing for the claim that the FSM and dynamic prompt suffice to keep the LLM from unproductive loops or tool misuse.

    Authors: We agree that additional experimental details are necessary to fully substantiate the autonomy claims. In the revised manuscript, we will expand §5 to specify that each bug was evaluated in a single autonomous run with no external restarts, human overrides, or post-hoc filtering applied. We will also note the absence of multiple independent runs per bug (as the agent is designed for one-shot autonomous repair) and include any relevant observations on termination behavior. This directly supports the role of the FSM and dynamic prompts in avoiding unproductive loops. revision: yes

  2. Referee: [§3] §3 (Agent Architecture): The finite state machine is presented as guiding tool selection and termination, yet the transition rules for handling ambiguous tool outputs, LLM misparsing of observations, or failed repair paths are not exhaustively specified. Without these, it is unclear how the architecture guarantees termination rather than non-terminating behavior on the stochastic LLM.

    Authors: The core FSM states and transitions are outlined in §3 to enforce progress toward repair or termination after a bounded number of steps. We acknowledge that more exhaustive coverage of edge cases would strengthen the presentation. In the revision, we will add explicit transition rules for ambiguous tool outputs (defaulting to an information-gathering state with updated prompt), LLM misparsing (retry with error feedback), and failed paths (transition to termination after maximum attempts), thereby clarifying the safeguards against non-termination. revision: yes

  3. Referee: [§5.2] §5.2 (Comparison to baselines): The count of 39 bugs not fixed by prior techniques requires an explicit enumeration of the exact prior tools/techniques, their search budgets, and model configurations used in the comparison. Differences in experimental protocol could inflate the novelty claim.

    Authors: We will revise §5.2 to include a comprehensive table enumerating all prior techniques used in the comparison, along with their reported search budgets, model configurations, and whether results were taken from original publications or reproduced under our protocol. This will provide full transparency and allow readers to assess the novelty of the 39 fixes under comparable conditions. revision: yes

Circularity Check

0 steps flagged

Empirical system evaluation with no circular derivation chain

full rationale

The paper presents an agent architecture (tools, dynamic prompt, finite state machine) and reports direct empirical counts of bugs repaired on the external Defects4J benchmark. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations are described that would reduce the 164-bug result to a tautology or construction. The evaluation is a measurement of system behavior rather than a derivation that collapses to its own premises.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions in the program-repair literature plus the untested premise that the LLM will behave as a reliable agent under the described control structure.

axioms (1)
  • domain assumption Defects4J constitutes a representative and sufficiently challenging benchmark for evaluating program repair techniques
    All performance claims are measured against this single dataset.

pith-pipeline@v0.9.0 · 5798 in / 1183 out tokens · 60579 ms · 2026-05-19T10:16:53.832425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents

    cs.SE 2026-05 unverdicted novelty 7.0

    PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.

  2. HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

    cs.SE 2026-05 accept novelty 7.0

    LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.

  3. HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair

    cs.SE 2026-05 unverdicted novelty 7.0

    HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.

  4. Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis

    cs.SE 2026-04 unverdicted novelty 7.0

    ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.

  5. DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging

    cs.SE 2026-04 unverdicted novelty 7.0

    DebugRepair improves LLM-based automated program repair by adding test semantic purification, simulated instrumentation, and debugging-driven conversational repair, fixing 224 Defects4J bugs with GPT-3.5 (26.2% above ...

  6. ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

    cs.SE 2026-04 unverdicted novelty 7.0

    ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...

  7. RACC: Representation-Aware Coverage Criteria for LLM Safety Testing

    cs.SE 2026-02 unverdicted novelty 7.0

    RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.

  8. Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems

    cs.SE 2025-11 unverdicted novelty 7.0

    Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.

  9. Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software

    cs.SE 2025-10 unverdicted novelty 7.0

    LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.

  10. SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution

    cs.SE 2025-02 unverdicted novelty 7.0

    SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.

  11. EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair

    cs.SE 2026-05 unverdicted novelty 6.0

    EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.

  12. CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction

    cs.SE 2026-04 unverdicted novelty 6.0

    CoRE benchmark shows frontier LLMs have large robustness gaps across equivalent code versions and often reach correct outputs via superficial execution without tracking intermediate states.

  13. HELO-APR: Enhancing Low-Resource Program Repair through Cross-Lingual Knowledge Transfer

    cs.SE 2026-04 unverdicted novelty 6.0

    HELO-APR improves LLM-based automatic program repair in low-resource languages by synthesizing cross-lingual training data and using curriculum learning, raising Pass@1 from 31.32% to 48.65% on DeepSeek-Coder for Ruby...

  14. Program Analysis Guided LLM Agent for Proof-of-Concept Generation

    cs.SE 2026-04 unverdicted novelty 6.0

    PAGENT integrates static and dynamic program analysis guidance with an LLM agent to improve automated proof-of-concept generation success by 132% over prior agentic methods.

  15. Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints

    cs.SE 2026-04 unverdicted novelty 6.0

    Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.

  16. LinkAnchor: An Autonomous LLM-Based Agent for Issue-to-Commit Link Recovery

    cs.SE 2025-08 unverdicted novelty 6.0

    LinkAnchor is an LLM-based autonomous agent with lazy-access architecture for recovering issue-to-commit links in software repositories.

  17. EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair

    cs.SE 2025-06 conditional novelty 6.0

    ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.

  18. Agentless: Demystifying LLM-based Software Engineering Agents

    cs.SE 2024-07 conditional novelty 6.0

    Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.

  19. ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction

    cs.CR 2025-06 unverdicted novelty 5.0

    ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.

  20. LLM-Based Automated Diagnosis Of Integration Test Failures At Google

    cs.SE 2026-04 unverdicted novelty 4.0

    Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.

  21. Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++

    cs.SE 2025-08 unverdicted novelty 4.0

    LLMs perform well on basic syntactic and semantic bugs in small code but struggle with complex security vulnerabilities and large production codebases.

Reference graph

Works this paper leans on

74 extracted references · 74 canonical work pages · cited by 20 Pith papers · 6 internal anchors

  1. [1]

    Automated program repair,

    C. Le Goues, M. Pradel, and A. Roychoudhury, “Automated program repair,” Commun. ACM , vol. 62, no. 12, pp. 56–65,

  2. [2]

    Available: https://doi.org/10.1145/3318162

    [Online]. Available: https://doi.org/10.1145/3318162

  3. [3]

    Genprog: A generic method for automatic software repair,

    C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,” IEEE Trans. Software Eng., vol. 38, no. 1, pp. 54–72, 2012

  4. [4]

    Tbar: revisiting template-based automated program repair,

    K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyand ´e, “Tbar: revisiting template-based automated program repair,” in ISSTA. ACM, 2019, pp. 31–42. [Online]. Available: https://doi.org/10.1145/3293882.3330577

  5. [5]

    Automatic patch gen- eration learned from human-written patches

    D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch gen- eration learned from human-written patches.” in International Conference on Software Engineering (ICSE), 2013, pp. 802–811

  6. [6]

    Getafix: Learning to fix bugs automatically,

    J. Bader, A. Scott, M. Pradel, and S. Chandra, “Getafix: Learning to fix bugs automatically,” Proc. ACM Program. Lang., vol. 3, no. OOPSLA, pp. 159:1–159:27, 2019. [Online]. Available: https://doi.org/10.1145/3360585

  7. [7]

    Phoenix: automated data-driven synthesis of repairs for static analysis violations,

    R. Bavishi, H. Yoshida, and M. R. Prasad, “Phoenix: automated data-driven synthesis of repairs for static analysis violations,” in ESEC/FSE, 2019, pp. 613–624. [Online]. Available: https://doi.org/10.1145/3338906.3338952

  8. [8]

    Semfix: program repair via semantic analysis,

    H. D. T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra, “Semfix: program repair via semantic analysis,” in 35th Inter- national Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013 , 2013, pp. 772–781

  9. [9]

    Nopol: Automatic repair of conditional statement bugs in java programs,

    J. Xuan, M. Martinez, F. Demarco, M. Clement, S. L. Marcote, T. Durieux, D. Le Berre, and M. Monperrus, “Nopol: Automatic repair of conditional statement bugs in java programs,” IEEE Transactions on Software Engineering, vol. 43, no. 1, pp. 34–55, 2016

  10. [10]

    Angelix: Scalable multiline program patch synthesis via symbolic analysis,

    S. Mechtaev, J. Yi, and A. Roychoudhury, “Angelix: Scalable multiline program patch synthesis via symbolic analysis,” in Proceedings of the 38th international conference on software engineering, 2016, pp. 691–701

  11. [11]

    Automatic patch generation by learning correct code,

    F. Long and M. Rinard, “Automatic patch generation by learning correct code,” in POPL, 2016, pp. 298–312

  12. [12]

    Deepfix: Fixing common C language errors by deep learning,

    R. Gupta, S. Pal, A. Kanade, and S. K. Shevade, “Deepfix: Fixing common C language errors by deep learning,” in AAAI, 2017, pp. 1345–1351. [Online]. Available: http://aaai.org/ocs/ index.php/AAAI/AAAI17/paper/view/14603

  13. [13]

    On learning meaningful code changes via neural machine translation,

    M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk, “On learning meaningful code changes via neural machine translation,” in ICSE, 2019, pp. 25–36. [Online]. Available: https://dl.acm.org/citation.cfm?id=3339509

  14. [14]

    Differential regression testing for REST APIs

    T. Lutellier, H. V . Pham, L. Pang, Y . Li, M. Wei, and L. Tan, “Coconut: combining context-aware neural translation models using ensemble for program repair,” in ISSTA. ACM, 2020, pp. 101–114. [Online]. Available: https://doi.org/10.1145/3395363. 3397369

  15. [15]

    SequenceR: Sequence-to-sequence learning for end-to-end program repair,

    Z. Chen, S. Kommrusch, M. Tufano, L. Pouchet, D. Poshyvanyk, and M. Monperrus, “SequenceR: Sequence-to-sequence learning for end-to-end program repair,” IEEE Trans. Software Eng. , vol. 47, no. 9, pp. 1943–1959, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2940179

  16. [16]

    Dlfix: Context-based code transformation learning for automated program repair,

    Y . Li, S. Wang, and T. N. Nguyen, “Dlfix: Context-based code transformation learning for automated program repair,” in ICSE, 2020

  17. [17]

    A syntax-guided edit decoder for neural program repair,

    Q. Zhu, Z. Sun, Y . Xiao, W. Zhang, K. Yuan, Y . Xiong, and L. Zhang, “A syntax-guided edit decoder for neural program repair,” in ESEC/FSE. ACM, 2021, pp. 341–353. [Online]. Available: https://doi.org/10.1145/3468264.3468544

  18. [18]

    Automated program repair in the era of large pre-trained language models,

    C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , 2023, pp. 1482–1494

  19. [19]

    Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan

    N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language models on automated program repair,” in ICSE, 2023, pp. 1430–1442. [Online]. Available: https://doi.org/10. 1109/ICSE48619.2023.00125

  20. [20]

    Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT,

    C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT,” 2023

  21. [21]

    Explainable automated debugging via large language model-driven scientific debugging,

    S. Kang, B. Chen, S. Yoo, and J. Lou, “Explainable automated debugging via large language model-driven scientific debugging,” CoRR, vol. abs/2304.02195, 2023. [Online]. Available: https: //doi.org/10.48550/arXiv.2304.02195

  22. [22]

    Iter: Iterative neural repair for multi- location patches,

    H. Ye and M. Monperrus, “Iter: Iterative neural repair for multi- location patches,” in ICSE, 2024

  23. [23]

    An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks,

    A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung, “An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks,” IEEE Transactions on software engineering , vol. 32, no. 12, pp. 971– 987, 2006

  24. [24]

    Where is the bug and how is it fixed? an experiment with practitioners,

    M. B ¨ohme, E. O. Soremekun, S. Chattopadhyay, E. Ugherughe, and A. Zeller, “Where is the bug and how is it fixed? an experiment with practitioners,” inESEC/FSE, 2017, pp. 117–128

  25. [25]

    Defects4j: a database of existing faults to enable controlled testing studies for java programs,

    R. Just, D. Jalali, and M. D. Ernst, “Defects4j: a database of existing faults to enable controlled testing studies for java programs,” in ISSTA, 2014, pp. 437–440

  26. [26]

    Gitbug-java: A re- producible benchmark of recent java bugs,

    A. Silva, N. Saavedra, and M. Monperrus, “Gitbug-java: A re- producible benchmark of recent java bugs,” in Proceedings of the 21st International Conference on Mining Software Repositories

  27. [27]

    Sparks of Artificial General Intelligence: Early experiments with GPT-4

    S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. M. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang, “Sparks of artificial general intelligence: Early experiments with GPT- 4,” CoRR, vol. abs/2303.12712, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.12712

  28. [28]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” CoRR, vol. abs/2302.04761, 2023. [Online]. Available: https://doi.org/ 10.48550/arXiv.2302.04761

  29. [29]

    Gorilla: Large Language Model Connected with Massive APIs

    S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” CoRR, vol. abs/2305.15334, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.15334

  30. [30]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,” 2023

  31. [31]

    Augmented Language Models: a Survey

    G. Mialon, R. Dess `ı, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozi `ere, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y . LeCun, and T. Scialom, “Augmented language models: a survey,” CoRR, vol. abs/2302.07842, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.07842

  32. [32]

    How often do single- statement bugs occur?

    R.-M. Karampatsis and C. Sutton, “How often do single- statement bugs occur?” Jun. 2020. [Online]. Available: http: //dx.doi.org/10.1145/3379597.3387491

  33. [33]

    Chain-of-thought prompting elicits reason- ing in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reason- ing in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  34. [34]

    Code search: A survey of techniques for finding code,

    L. D. Grazia and M. Pradel, “Code search: A survey of techniques for finding code,” ACM Comput. Surv. , vol. 55, no. 11, pp. 220:1–220:31, 2023. [Online]. Available: https: //doi.org/10.1145/3565971

  35. [35]

    De-hallucinator: Mitigating llm hallucinations in code generation tasks via iterative grounding.arXiv preprint arXiv:2401.01701,

    A. Eghbali and M. Pradel, “De-hallucinator: Iterative grounding for llm-based code completion,” CoRR, vol. abs/2401.01701,

  36. [36]

    Available: https://doi.org/10.48550/arXiv.2401

    [Online]. Available: https://doi.org/10.48550/arXiv.2401. 01701

  37. [37]

    Evaluating Large Language Models Trained on Code

    M. Chen et al. , “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374

  38. [38]

    Gzoltar: an eclipse plug-in for testing and debugging,

    J. Campos, A. Riboira, A. Perez, and R. Abreu, “Gzoltar: an eclipse plug-in for testing and debugging,” in ASE, 2012, pp. 378–381

  39. [39]

    Zeller, Why programs fail: a guide to systematic debugging

    A. Zeller, Why programs fail: a guide to systematic debugging . Elsevier, 2009

  40. [40]

    Selfapr: Self-supervised program repair with test execution diagnostics,

    H. Ye, M. Martinez, X. Luo, T. Zhang, and M. Monperrus, “Selfapr: Self-supervised program repair with test execution diagnostics,” in ASE, 2022, pp. 92:1–92:13. [Online]. Available: https://doi.org/10.1145/3551349.3556926

  41. [41]

    History driven program repair,

    X. D. Le, D. Lo, and C. Le Goues, “History driven program repair,” in SANER, 2016, pp. 213–224. [Online]. Available: https://doi.org/10.1109/SANER.2016.76

  42. [42]

    Repairing programs with semantic code search (t),

    Y . Ke, K. T. Stolee, C. Le Goues, and Y . Brun, “Repairing programs with semantic code search (t),” in ASE. IEEE, 2015, pp. 295–306

  43. [43]

    Static automated program repair for heap properties,

    R. van Tonder and C. L. Goues, “Static automated program repair for heap properties,” in ICSE, 2018, pp. 151–162. [Online]. Available: https://doi.org/10.1145/3180155.3180250

  44. [44]

    Program repair guided by datalog-defined static analysis,

    Y . Liu, S. Mechtaev, P. Suboti´c, and A. Roychoudhury, “Program repair guided by datalog-defined static analysis,” in ESEC/FSE, 2023, pp. 1216–1228

  45. [45]

    Staticfixer: From static analysis to static repair,

    N. Jain, S. Gandhi, A. Sonwane, A. Kanade, N. Natarajan, S. Parthasarathy, S. Rajamani, and R. Sharma, “Staticfixer: From static analysis to static repair,” 2023

  46. [46]

    Transplantfix: Graph differencing-based code transplantation for automated program repair,

    D. Yang, X. Mao, L. Chen, X. Xu, Y . Lei, D. Lo, and J. He, “Transplantfix: Graph differencing-based code transplantation for automated program repair,” in ASE, 2022, pp. 107:1–107:13. [Online]. Available: https://doi.org/10.1145/3551349.3556893

  47. [47]

    Sapfix: Automated end-to-end repair at scale,

    A. Marginean, J. Bader, S. Chandra, M. Harman, Y . Jia, K. Mao, A. Mols, and A. Scott, “Sapfix: Automated end-to-end repair at scale,” in ICSE-SEIP, 2019

  48. [48]

    Search, align, and repair: data-driven feedback generation for introductory programming exercises,

    K. Wang, R. Singh, and Z. Su, “Search, align, and repair: data-driven feedback generation for introductory programming exercises,” in PLDI, 2018, pp. 481–495

  49. [49]

    Deep reinforcement learning for syntactic error repair in student programs,

    R. Gupta, A. Kanade, and S. K. Shevade, “Deep reinforcement learning for syntactic error repair in student programs,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligen...

  50. [50]

    AAAI Press, 2019, pp. 930–937. [Online]. Available: https://doi.org/10.1609/aaai.v33i01.3301930

  51. [51]

    Seq2parse: neurosymbolic parse error repair,

    G. Sakkas, M. Endres, P. J. Guo, W. Weimer, and R. Jhala, “Seq2parse: neurosymbolic parse error repair,” Proc. ACM Program. Lang., vol. 6, no. OOPSLA2, pp. 1180–1206, 2022. [Online]. Available: https://doi.org/10.1145/3563330

  52. [52]

    Pinpointing and repairing performance bottlenecks in concurrent programs,

    T. Yu and M. Pradel, “Pinpointing and repairing performance bottlenecks in concurrent programs,” Empirical Software Engi- neering (EMSE), pp. 1–38, 2017

  53. [53]

    Learning to repair software vulnerabilities with generative adversarial networks,

    J. Harer, O. Ozdemir, T. Lazovich, C. P. Reale, R. L. Russell, L. Y . Kim, and S. P. Chin, “Learning to repair software vulnerabilities with generative adversarial networks,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr ´eal, Canada. , 2018,...

  54. [54]

    Pyty: Repairing static type errors in python,

    Y . W. Chow, L. D. Grazia, and M. Pradel, “Pyty: Repairing static type errors in python,” in International Conference on Software Engineering (ICSE), 2024

  55. [55]

    AUTOTRAINER: an automatic DNN training problem detection and repair system,

    X. Zhang, J. Zhai, S. Ma, and C. Shen, “AUTOTRAINER: an automatic DNN training problem detection and repair system,” in 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 . IEEE, 2021, pp. 359–371. [Online]. Available: https://doi.org/10.1109/ICSE43902.2021.00043

  56. [56]

    Learning to fix build errors with graph2diff neural networks,

    D. Tarlow, S. Moitra, A. Rice, Z. Chen, P. Manzagol, C. Sutton, and E. Aftandilian, “Learning to fix build errors with graph2diff neural networks,” in ICSE ’20: 42nd International Conference on Software Engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020 . ACM, 2020, pp. 19–20. [Online]. Available: https://doi.org/10.1145/3387940.3392181

  57. [57]

    Neural program repair by jointly learning to localize and repair,

    M. Vasic, A. Kanade, P. Maniatis, D. Bieber, and R. Singh, “Neural program repair by jointly learning to localize and repair,” in ICLR, 2019

  58. [58]

    Neural program repair with execution-based backpropagation,

    H. Ye, M. Martinez, and M. Monperrus, “Neural program repair with execution-based backpropagation,” in ICSE, 2022

  59. [59]

    A survey of learning-based automated program repair,

    Q. Zhang, C. Fang, Y . Ma, W. Sun, and Z. Chen, “A survey of learning-based automated program repair,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 2, pp. 1–69, 2023

  60. [60]

    Repair is nearly generation: Multilingual program repair with llms,

    H. Joshi, J. P. C. S ´anchez, S. Gulwani, V . Le, G. Verbruggen, and I. Radicek, “Repair is nearly generation: Multilingual program repair with llms,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advance...

  61. [61]

    Cigar: Cost-efficient program repair with llms,

    D. Hidv ´egi, K. Etemadi, S. Bobadilla, and M. Monperrus, “Cigar: Cost-efficient program repair with llms,” arXiv preprint arXiv:2402.06598, 2024

  62. [62]

    Repository- level prompt generation for large language models of code,

    D. Shrivastava, H. Larochelle, and D. Tarlow, “Repository- level prompt generation for large language models of code,” in International Conference on Machine Learning . PMLR, 2023, pp. 31 693–31 715

  63. [63]

    Fuzz4all: Universal fuzzing with large language models,

    C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Fuzz4all: Universal fuzzing with large language models,” in ICSE, 2024

  64. [64]

    Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,

    C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 45th International Conference on Software Engineering, ser. ICSE , 2023

  65. [65]

    An empirical evaluation of using large language models for automated unit test generation,

    M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Trans. Software Eng. , vol. 50, no. 1, pp. 85–105, 2024. [Online]. Available: https://doi.org/10.1109/TSE. 2023.3334955

  66. [66]

    Code-aware prompting: A study of coverage guided test generation in regression setting using llm,

    G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage guided test generation in regression setting using llm,” in FSE, 2024

  67. [67]

    Automated unit test improvement using large lan- guage models at meta

    N. Alshahwan, J. Chheda, A. Finegenova, B. Gokkaya, M. Harman, I. Harper, A. Marginean, S. Sengupta, and E. Wang, “Automated unit test improvement using large language models at meta,” in FSE, vol. abs/2402.09171, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.09171

  68. [68]

    Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan

    S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring llm-based general bug reproduction,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE , 2023, pp. 2312–2323. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00194

  69. [69]

    Prompting is all your need: Automated android bug replay with large language models,

    S. Feng and C. Chen, “Prompting is all your need: Automated android bug replay with large language models,” in ICSE, 2024

  70. [70]

    Codeplan: Repository-level coding using llms and planning,

    R. Bairi, A. Sonwane, A. Kanade, V . D. C, A. Iyer, S. Parthasarathy, S. Rajamani, B. Ashok, and S. Shet, “Codeplan: Repository-level coding using llms and planning,” 2023

  71. [71]

    An in-context learning agent for formal theorem-proving,

    A. Thakur, G. Tsoukalas, Y . Wen, J. Xin, and S. Chaudhuri, “An in-context learning agent for formal theorem-proving,” 2024. [Online]. Available: https://arxiv.org/abs/2310.04353

  72. [72]

    Au- tocoderover: Autonomous program improvement,

    Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury, “Au- tocoderover: Autonomous program improvement,” 2024

  73. [73]

    Swe-agent: Agent-computer in- terfaces enable automated software engineering,

    J. Yang, C. E. Jimenez, K. Lieret, S. Yao, A. Wettig, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer in- terfaces enable automated software engineering,” 2024

  74. [74]

    PAL: Program-aided Language Models

    L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: program-aided language models,” CoRR, vol. abs/2211.10435, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.10435