RepairAgent: An Autonomous, LLM-Based Agent for Program Repair
Pith reviewed 2026-05-19 10:16 UTC · model grok-4.3
pith:UL3V7M5G Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{UL3V7M5G}
Prints a linked pith:UL3V7M5G badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
A large language model can autonomously repair software bugs by deciding on tools and actions via a state machine and dynamic prompt.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
RepairAgent is the first autonomous LLM-based agent for program repair. Unlike prior deep learning approaches that use a fixed prompt or fixed feedback loop, it treats the LLM as an agent that plans and executes actions by invoking tools. The agent interleaves gathering bug information, collecting repair ingredients, and validating fixes, choosing the next tool based on gathered data and feedback from earlier attempts. This is supported by a set of program repair tools, a dynamically updated prompt, and a finite state machine that guides tool use.
What carries the argument
The finite state machine that guides tool invocations together with a dynamically updated prompt allowing the LLM to interact with repair tools and respond to feedback.
If this is right
- Automated repair can interleave information gathering, ingredient collection, and validation in a flexible order chosen by the agent.
- The approach repairs bugs that fixed-prompt methods miss, adding 39 new fixes on Defects4J.
- The cost remains modest, averaging 270000 tokens or 14 cents per bug with current GPT-3.5 pricing.
- Agent designs with state guidance and tool sets open the door to similar autonomous methods across software engineering tasks.
Where Pith is reading between the lines
- Stronger future language models could raise the repair rate by better following state guidance and avoiding loops.
- Adding more program analysis tools to the agent's set might help with bugs that current tools cannot address.
- The same pattern of dynamic prompts plus state machine could be tested on related tasks such as test-case generation.
Load-bearing premise
The large language model will reliably select and invoke the right tools, interpret tool feedback, and avoid unproductive loops when guided only by the finite state machine and dynamically updated prompt.
What would settle it
Testing the agent on a fresh collection of bugs and finding that it often picks unsuitable tools or enters repeated unproductive cycles without producing fixes would show the claim does not hold.
read the original abstract
Automated program repair has emerged as a powerful technique to mitigate the impact of software bugs on system reliability and user experience. This paper introduces RepairAgent, the first work to address the program repair challenge through an autonomous agent based on a large language model (LLM). Unlike existing deep learning-based approaches, which prompt a model with a fixed prompt or in a fixed feedback loop, our work treats the LLM as an agent capable of autonomously planning and executing actions to fix bugs by invoking suitable tools. RepairAgent freely interleaves gathering information about the bug, gathering repair ingredients, and validating fixes, while deciding which tools to invoke based on the gathered information and feedback from previous fix attempts. Key contributions that enable RepairAgent include a set of tools that are useful for program repair, a dynamically updated prompt format that allows the LLM to interact with these tools, and a finite state machine that guides the agent in invoking the tools. Our evaluation on the popular Defects4J dataset demonstrates RepairAgent's effectiveness in autonomously repairing 164 bugs, including 39 bugs not fixed by prior techniques. Interacting with the LLM imposes an average cost of 270,000 tokens per bug, which, under the current pricing of OpenAI's GPT-3.5 model, translates to 14 cents of USD per bug. To the best of our knowledge, this work is the first to present an autonomous, LLM-based agent for program repair, paving the way for future agent-based techniques in software engineering.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces RepairAgent, the first autonomous LLM-based agent for program repair. Unlike prior DL approaches that use fixed prompts or loops, RepairAgent equips an LLM with repair-specific tools (for information gathering, ingredient collection, and validation), a dynamically updated prompt, and a finite state machine to guide tool invocation and interleaving of actions. The central claim is an empirical evaluation on Defects4J showing autonomous repair of 164 bugs, including 39 not fixed by prior techniques, at an average cost of 270,000 tokens (~14 cents USD with GPT-3.5) per bug.
Significance. If the autonomy premise holds, the result is significant as the first concrete demonstration of an agentic, tool-using LLM approach to automated program repair on a standard benchmark. It moves beyond static prompting, quantifies practical costs, and explicitly opens a new direction for agent-based techniques in software engineering. The use of Defects4J enables direct comparison with existing APR literature.
major comments (3)
- [§5] §5 (Evaluation) and abstract: The headline result of 164 autonomously repaired bugs (including 39 novel) provides no breakdown of the number of independent runs performed, statistical significance tests, or the fraction of repairs that completed without external restarts, human overrides, or post-hoc filtering. This detail is load-bearing for the claim that the FSM and dynamic prompt suffice to keep the LLM from unproductive loops or tool misuse.
- [§3] §3 (Agent Architecture): The finite state machine is presented as guiding tool selection and termination, yet the transition rules for handling ambiguous tool outputs, LLM misparsing of observations, or failed repair paths are not exhaustively specified. Without these, it is unclear how the architecture guarantees termination rather than non-terminating behavior on the stochastic LLM.
- [§5.2] §5.2 (Comparison to baselines): The count of 39 bugs not fixed by prior techniques requires an explicit enumeration of the exact prior tools/techniques, their search budgets, and model configurations used in the comparison. Differences in experimental protocol could inflate the novelty claim.
minor comments (3)
- [Figure 2] Figure 2 (state machine diagram): The labels on transitions are difficult to read at standard print size; enlarging or adding a textual legend would improve clarity.
- [Related Work] Related work section: While the paper positions itself as the first autonomous agent for APR, a brief discussion of contemporaneous LLM-agent work in other SE tasks (e.g., test generation) would better contextualize novelty.
- [Table 1] Table 1 (bug repair counts): Include standard deviation or range across runs for the token-cost column to accompany the reported average of 270,000 tokens.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We have addressed each major comment below and will incorporate revisions to improve clarity and rigor where appropriate.
read point-by-point responses
-
Referee: [§5] §5 (Evaluation) and abstract: The headline result of 164 autonomously repaired bugs (including 39 novel) provides no breakdown of the number of independent runs performed, statistical significance tests, or the fraction of repairs that completed without external restarts, human overrides, or post-hoc filtering. This detail is load-bearing for the claim that the FSM and dynamic prompt suffice to keep the LLM from unproductive loops or tool misuse.
Authors: We agree that additional experimental details are necessary to fully substantiate the autonomy claims. In the revised manuscript, we will expand §5 to specify that each bug was evaluated in a single autonomous run with no external restarts, human overrides, or post-hoc filtering applied. We will also note the absence of multiple independent runs per bug (as the agent is designed for one-shot autonomous repair) and include any relevant observations on termination behavior. This directly supports the role of the FSM and dynamic prompts in avoiding unproductive loops. revision: yes
-
Referee: [§3] §3 (Agent Architecture): The finite state machine is presented as guiding tool selection and termination, yet the transition rules for handling ambiguous tool outputs, LLM misparsing of observations, or failed repair paths are not exhaustively specified. Without these, it is unclear how the architecture guarantees termination rather than non-terminating behavior on the stochastic LLM.
Authors: The core FSM states and transitions are outlined in §3 to enforce progress toward repair or termination after a bounded number of steps. We acknowledge that more exhaustive coverage of edge cases would strengthen the presentation. In the revision, we will add explicit transition rules for ambiguous tool outputs (defaulting to an information-gathering state with updated prompt), LLM misparsing (retry with error feedback), and failed paths (transition to termination after maximum attempts), thereby clarifying the safeguards against non-termination. revision: yes
-
Referee: [§5.2] §5.2 (Comparison to baselines): The count of 39 bugs not fixed by prior techniques requires an explicit enumeration of the exact prior tools/techniques, their search budgets, and model configurations used in the comparison. Differences in experimental protocol could inflate the novelty claim.
Authors: We will revise §5.2 to include a comprehensive table enumerating all prior techniques used in the comparison, along with their reported search budgets, model configurations, and whether results were taken from original publications or reproduced under our protocol. This will provide full transparency and allow readers to assess the novelty of the 39 fixes under comparable conditions. revision: yes
Circularity Check
Empirical system evaluation with no circular derivation chain
full rationale
The paper presents an agent architecture (tools, dynamic prompt, finite state machine) and reports direct empirical counts of bugs repaired on the external Defects4J benchmark. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations are described that would reduce the 164-bug result to a tautology or construction. The evaluation is a measurement of system behavior rather than a derivation that collapses to its own premises.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Defects4J constitutes a representative and sufficiently challenging benchmark for evaluating program repair techniques
Forward citations
Cited by 22 Pith papers
-
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
-
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
HEJ-Robust benchmark shows LLM-based program repair models drop over 50% in accuracy when buggy code is rewritten with equivalent syntax.
-
HEJ-Robust: A Robustness Benchmark for LLM-Based Automated Program Repair
LLM-based Java program repair models lose over 50% of their bug-fixing success rate when presented with equivalent but syntactically varied buggy code.
-
Empowering Autonomous Debugging Agents with Efficient Dynamic Analysis
ADI equips AI debugging agents with function-level interaction via a new execution trace structure, raising SWE-bench Verified resolution to 63.8% at $1.28 per task and delivering 6-18% gains when added to existing agents.
-
DebugRepair: Enhancing LLM-Based Automated Program Repair via Self-Directed Debugging
DebugRepair improves LLM-based automated program repair by adding test semantic purification, simulated instrumentation, and debugging-driven conversational repair, fixing 224 Defects4J bugs with GPT-3.5 (26.2% above ...
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 11...
-
RACC: Representation-Aware Coverage Criteria for LLM Safety Testing
RACC defines six representation-aware coverage criteria that score jailbreak test suites by measuring activation of safety concepts extracted from LLM hidden states on a calibration set.
-
Can Language Models Go Beyond Coding? Assessing the Capability of Language Models to Build Real-World Systems
Build-bench is the first architecture-aware benchmark that evaluates LLMs on repairing cross-ISA build failures via iterative tool-augmented reasoning, with the best model reaching 63.19% success.
-
Do AI Models Dream of Faster Code? An Empirical Study on LLM-Proposed Performance Improvements in Real-World Software
LLMs propose volatile performance improvements on real-world Java tasks that lag human developers on average, showing algorithmic benchmarks overestimate capabilities.
-
SWE-RL: Advancing LLM Reasoning via Reinforcement Learning on Open Software Evolution
SWE-RL uses RL on software evolution data to train LLMs achieving 41% on SWE-bench Verified with generalization to other reasoning tasks.
-
MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair
MemRepair is a hierarchical memory-augmented agent framework that raises repository-level vulnerability repair rates to 58.0-58.2% on Python/Go/JS benchmarks and 30.58% on C++ by combining history, pattern, and refine...
-
EvidenT: An Evidence-Preserving Framework for Iterative System-Level Package Repair
EvidenT repairs 53.88% of real-world RISC-V system-level package build failures by preserving repair history and build artifacts in a closed-loop validation system, outperforming baselines by a wide margin.
-
CoRE: A Fine-Grained Code Reasoning Benchmark Beyond Output Prediction
CoRE benchmark shows frontier LLMs have large robustness gaps across equivalent code versions and often reach correct outputs via superficial execution without tracking intermediate states.
-
HELO-APR: Enhancing Low-Resource Program Repair through Cross-Lingual Knowledge Transfer
HELO-APR improves LLM-based automatic program repair in low-resource languages by synthesizing cross-lingual training data and using curriculum learning, raising Pass@1 from 31.32% to 48.65% on DeepSeek-Coder for Ruby...
-
Program Analysis Guided LLM Agent for Proof-of-Concept Generation
PAGENT integrates static and dynamic program analysis guidance with an LLM agent to improve automated proof-of-concept generation success by 132% over prior agentic methods.
-
Beyond Fixed Tests: Repository-Level Issue Resolution as Coevolution of Code and Behavioral Constraints
Agent-CoEvo is a multi-agent LLM framework that coevolves code patches and test patches to resolve repository-level issues, outperforming fixed-test baselines on SWE-bench Lite and SWT-bench Lite.
-
LinkAnchor: An Autonomous LLM-Based Agent for Issue-to-Commit Link Recovery
LinkAnchor is an LLM-based autonomous agent with lazy-access architecture for recovering issue-to-commit links in software repositories.
-
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair
ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.
-
Agentless: Demystifying LLM-based Software Engineering Agents
Agentless, a basic three-phase LLM pipeline for bug localization, repair, and validation, outperforms complex open-source agents on SWE-bench Lite with 32% success rate at $0.70 cost.
-
ReGA: Model-Based Safeguard for LLMs via Representation-Guided Abstraction
ReGA uses safety-critical representations to guide abstraction in model-based analysis, enabling scalable detection of harmful LLM inputs with reported AUROC of 0.975 at prompt level.
-
LLM-Based Automated Diagnosis Of Integration Test Failures At Google
Auto-Diagnose applies LLMs to summarize and diagnose root causes of integration test failures, reporting 90.14% accuracy on 71 manual cases and positive adoption after Google-wide rollout.
-
Can LLMs Find Bugs in Code? An Evaluation from Beginner Errors to Security Vulnerabilities in Python and C++
LLMs perform well on basic syntactic and semantic bugs in small code but struggle with complex security vulnerabilities and large production codebases.
Reference graph
Works this paper leans on
-
[1]
C. Le Goues, M. Pradel, and A. Roychoudhury, “Automated program repair,” Commun. ACM , vol. 62, no. 12, pp. 56–65,
-
[2]
Available: https://doi.org/10.1145/3318162
[Online]. Available: https://doi.org/10.1145/3318162
-
[3]
Genprog: A generic method for automatic software repair,
C. Le Goues, T. Nguyen, S. Forrest, and W. Weimer, “Genprog: A generic method for automatic software repair,” IEEE Trans. Software Eng., vol. 38, no. 1, pp. 54–72, 2012
work page 2012
-
[4]
Tbar: revisiting template-based automated program repair,
K. Liu, A. Koyuncu, D. Kim, and T. F. Bissyand ´e, “Tbar: revisiting template-based automated program repair,” in ISSTA. ACM, 2019, pp. 31–42. [Online]. Available: https://doi.org/10.1145/3293882.3330577
-
[5]
Automatic patch gen- eration learned from human-written patches
D. Kim, J. Nam, J. Song, and S. Kim, “Automatic patch gen- eration learned from human-written patches.” in International Conference on Software Engineering (ICSE), 2013, pp. 802–811
work page 2013
-
[6]
Getafix: Learning to fix bugs automatically,
J. Bader, A. Scott, M. Pradel, and S. Chandra, “Getafix: Learning to fix bugs automatically,” Proc. ACM Program. Lang., vol. 3, no. OOPSLA, pp. 159:1–159:27, 2019. [Online]. Available: https://doi.org/10.1145/3360585
-
[7]
Phoenix: automated data-driven synthesis of repairs for static analysis violations,
R. Bavishi, H. Yoshida, and M. R. Prasad, “Phoenix: automated data-driven synthesis of repairs for static analysis violations,” in ESEC/FSE, 2019, pp. 613–624. [Online]. Available: https://doi.org/10.1145/3338906.3338952
-
[8]
Semfix: program repair via semantic analysis,
H. D. T. Nguyen, D. Qi, A. Roychoudhury, and S. Chandra, “Semfix: program repair via semantic analysis,” in 35th Inter- national Conference on Software Engineering, ICSE ’13, San Francisco, CA, USA, May 18-26, 2013 , 2013, pp. 772–781
work page 2013
-
[9]
Nopol: Automatic repair of conditional statement bugs in java programs,
J. Xuan, M. Martinez, F. Demarco, M. Clement, S. L. Marcote, T. Durieux, D. Le Berre, and M. Monperrus, “Nopol: Automatic repair of conditional statement bugs in java programs,” IEEE Transactions on Software Engineering, vol. 43, no. 1, pp. 34–55, 2016
work page 2016
-
[10]
Angelix: Scalable multiline program patch synthesis via symbolic analysis,
S. Mechtaev, J. Yi, and A. Roychoudhury, “Angelix: Scalable multiline program patch synthesis via symbolic analysis,” in Proceedings of the 38th international conference on software engineering, 2016, pp. 691–701
work page 2016
-
[11]
Automatic patch generation by learning correct code,
F. Long and M. Rinard, “Automatic patch generation by learning correct code,” in POPL, 2016, pp. 298–312
work page 2016
-
[12]
Deepfix: Fixing common C language errors by deep learning,
R. Gupta, S. Pal, A. Kanade, and S. K. Shevade, “Deepfix: Fixing common C language errors by deep learning,” in AAAI, 2017, pp. 1345–1351. [Online]. Available: http://aaai.org/ocs/ index.php/AAAI/AAAI17/paper/view/14603
work page 2017
-
[13]
On learning meaningful code changes via neural machine translation,
M. Tufano, J. Pantiuchina, C. Watson, G. Bavota, and D. Poshyvanyk, “On learning meaningful code changes via neural machine translation,” in ICSE, 2019, pp. 25–36. [Online]. Available: https://dl.acm.org/citation.cfm?id=3339509
work page 2019
-
[14]
Differential regression testing for REST APIs
T. Lutellier, H. V . Pham, L. Pang, Y . Li, M. Wei, and L. Tan, “Coconut: combining context-aware neural translation models using ensemble for program repair,” in ISSTA. ACM, 2020, pp. 101–114. [Online]. Available: https://doi.org/10.1145/3395363. 3397369
-
[15]
SequenceR: Sequence-to-sequence learning for end-to-end program repair,
Z. Chen, S. Kommrusch, M. Tufano, L. Pouchet, D. Poshyvanyk, and M. Monperrus, “SequenceR: Sequence-to-sequence learning for end-to-end program repair,” IEEE Trans. Software Eng. , vol. 47, no. 9, pp. 1943–1959, 2021. [Online]. Available: https://doi.org/10.1109/TSE.2019.2940179
-
[16]
Dlfix: Context-based code transformation learning for automated program repair,
Y . Li, S. Wang, and T. N. Nguyen, “Dlfix: Context-based code transformation learning for automated program repair,” in ICSE, 2020
work page 2020
-
[17]
A syntax-guided edit decoder for neural program repair,
Q. Zhu, Z. Sun, Y . Xiao, W. Zhang, K. Yuan, Y . Xiong, and L. Zhang, “A syntax-guided edit decoder for neural program repair,” in ESEC/FSE. ACM, 2021, pp. 341–353. [Online]. Available: https://doi.org/10.1145/3468264.3468544
-
[18]
Automated program repair in the era of large pre-trained language models,
C. S. Xia, Y . Wei, and L. Zhang, “Automated program repair in the era of large pre-trained language models,” in2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE) , 2023, pp. 1482–1494
work page 2023
-
[19]
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan
N. Jiang, K. Liu, T. Lutellier, and L. Tan, “Impact of code language models on automated program repair,” in ICSE, 2023, pp. 1430–1442. [Online]. Available: https://doi.org/10. 1109/ICSE48619.2023.00125
-
[20]
Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT,
C. S. Xia and L. Zhang, “Keep the conversation going: Fixing 162 out of 337 bugs for $0.42 each using ChatGPT,” 2023
work page 2023
-
[21]
Explainable automated debugging via large language model-driven scientific debugging,
S. Kang, B. Chen, S. Yoo, and J. Lou, “Explainable automated debugging via large language model-driven scientific debugging,” CoRR, vol. abs/2304.02195, 2023. [Online]. Available: https: //doi.org/10.48550/arXiv.2304.02195
-
[22]
Iter: Iterative neural repair for multi- location patches,
H. Ye and M. Monperrus, “Iter: Iterative neural repair for multi- location patches,” in ICSE, 2024
work page 2024
-
[23]
A. J. Ko, B. A. Myers, M. J. Coblenz, and H. H. Aung, “An exploratory study of how developers seek, relate, and collect relevant information during software maintenance tasks,” IEEE Transactions on software engineering , vol. 32, no. 12, pp. 971– 987, 2006
work page 2006
-
[24]
Where is the bug and how is it fixed? an experiment with practitioners,
M. B ¨ohme, E. O. Soremekun, S. Chattopadhyay, E. Ugherughe, and A. Zeller, “Where is the bug and how is it fixed? an experiment with practitioners,” inESEC/FSE, 2017, pp. 117–128
work page 2017
-
[25]
Defects4j: a database of existing faults to enable controlled testing studies for java programs,
R. Just, D. Jalali, and M. D. Ernst, “Defects4j: a database of existing faults to enable controlled testing studies for java programs,” in ISSTA, 2014, pp. 437–440
work page 2014
-
[26]
Gitbug-java: A re- producible benchmark of recent java bugs,
A. Silva, N. Saavedra, and M. Monperrus, “Gitbug-java: A re- producible benchmark of recent java bugs,” in Proceedings of the 21st International Conference on Mining Software Repositories
-
[27]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y . T. Lee, Y . Li, S. M. Lundberg, H. Nori, H. Palangi, M. T. Ribeiro, and Y . Zhang, “Sparks of artificial general intelligence: Early experiments with GPT- 4,” CoRR, vol. abs/2303.12712, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2303.12712
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.12712 2023
-
[28]
Toolformer: Language Models Can Teach Themselves to Use Tools
T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” CoRR, vol. abs/2302.04761, 2023. [Online]. Available: https://doi.org/ 10.48550/arXiv.2302.04761
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.04761 2023
-
[29]
Gorilla: Large Language Model Connected with Massive APIs
S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” CoRR, vol. abs/2305.15334, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2305.15334
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.15334 2023
-
[30]
A survey on large language model based autonomous agents,
L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y . Lin, W. X. Zhao, Z. Wei, and J.-R. Wen, “A survey on large language model based autonomous agents,” 2023
work page 2023
-
[31]
Augmented Language Models: a Survey
G. Mialon, R. Dess `ı, M. Lomeli, C. Nalmpantis, R. Pasunuru, R. Raileanu, B. Rozi `ere, T. Schick, J. Dwivedi-Yu, A. Celikyilmaz, E. Grave, Y . LeCun, and T. Scialom, “Augmented language models: a survey,” CoRR, vol. abs/2302.07842, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2302.07842
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2302.07842 2023
-
[32]
How often do single- statement bugs occur?
R.-M. Karampatsis and C. Sutton, “How often do single- statement bugs occur?” Jun. 2020. [Online]. Available: http: //dx.doi.org/10.1145/3379597.3387491
-
[33]
Chain-of-thought prompting elicits reason- ing in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al. , “Chain-of-thought prompting elicits reason- ing in large language models,” Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[34]
Code search: A survey of techniques for finding code,
L. D. Grazia and M. Pradel, “Code search: A survey of techniques for finding code,” ACM Comput. Surv. , vol. 55, no. 11, pp. 220:1–220:31, 2023. [Online]. Available: https: //doi.org/10.1145/3565971
-
[35]
A. Eghbali and M. Pradel, “De-hallucinator: Iterative grounding for llm-based code completion,” CoRR, vol. abs/2401.01701,
-
[36]
Available: https://doi.org/10.48550/arXiv.2401
[Online]. Available: https://doi.org/10.48550/arXiv.2401. 01701
-
[37]
Evaluating Large Language Models Trained on Code
M. Chen et al. , “Evaluating large language models trained on code,” CoRR, vol. abs/2107.03374, 2021. [Online]. Available: https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[38]
Gzoltar: an eclipse plug-in for testing and debugging,
J. Campos, A. Riboira, A. Perez, and R. Abreu, “Gzoltar: an eclipse plug-in for testing and debugging,” in ASE, 2012, pp. 378–381
work page 2012
-
[39]
Zeller, Why programs fail: a guide to systematic debugging
A. Zeller, Why programs fail: a guide to systematic debugging . Elsevier, 2009
work page 2009
-
[40]
Selfapr: Self-supervised program repair with test execution diagnostics,
H. Ye, M. Martinez, X. Luo, T. Zhang, and M. Monperrus, “Selfapr: Self-supervised program repair with test execution diagnostics,” in ASE, 2022, pp. 92:1–92:13. [Online]. Available: https://doi.org/10.1145/3551349.3556926
-
[41]
History driven program repair,
X. D. Le, D. Lo, and C. Le Goues, “History driven program repair,” in SANER, 2016, pp. 213–224. [Online]. Available: https://doi.org/10.1109/SANER.2016.76
-
[42]
Repairing programs with semantic code search (t),
Y . Ke, K. T. Stolee, C. Le Goues, and Y . Brun, “Repairing programs with semantic code search (t),” in ASE. IEEE, 2015, pp. 295–306
work page 2015
-
[43]
Static automated program repair for heap properties,
R. van Tonder and C. L. Goues, “Static automated program repair for heap properties,” in ICSE, 2018, pp. 151–162. [Online]. Available: https://doi.org/10.1145/3180155.3180250
-
[44]
Program repair guided by datalog-defined static analysis,
Y . Liu, S. Mechtaev, P. Suboti´c, and A. Roychoudhury, “Program repair guided by datalog-defined static analysis,” in ESEC/FSE, 2023, pp. 1216–1228
work page 2023
-
[45]
Staticfixer: From static analysis to static repair,
N. Jain, S. Gandhi, A. Sonwane, A. Kanade, N. Natarajan, S. Parthasarathy, S. Rajamani, and R. Sharma, “Staticfixer: From static analysis to static repair,” 2023
work page 2023
-
[46]
Transplantfix: Graph differencing-based code transplantation for automated program repair,
D. Yang, X. Mao, L. Chen, X. Xu, Y . Lei, D. Lo, and J. He, “Transplantfix: Graph differencing-based code transplantation for automated program repair,” in ASE, 2022, pp. 107:1–107:13. [Online]. Available: https://doi.org/10.1145/3551349.3556893
-
[47]
Sapfix: Automated end-to-end repair at scale,
A. Marginean, J. Bader, S. Chandra, M. Harman, Y . Jia, K. Mao, A. Mols, and A. Scott, “Sapfix: Automated end-to-end repair at scale,” in ICSE-SEIP, 2019
work page 2019
-
[48]
Search, align, and repair: data-driven feedback generation for introductory programming exercises,
K. Wang, R. Singh, and Z. Su, “Search, align, and repair: data-driven feedback generation for introductory programming exercises,” in PLDI, 2018, pp. 481–495
work page 2018
-
[49]
Deep reinforcement learning for syntactic error repair in student programs,
R. Gupta, A. Kanade, and S. K. Shevade, “Deep reinforcement learning for syntactic error repair in student programs,” in The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligen...
work page 2019
-
[50]
AAAI Press, 2019, pp. 930–937. [Online]. Available: https://doi.org/10.1609/aaai.v33i01.3301930
-
[51]
Seq2parse: neurosymbolic parse error repair,
G. Sakkas, M. Endres, P. J. Guo, W. Weimer, and R. Jhala, “Seq2parse: neurosymbolic parse error repair,” Proc. ACM Program. Lang., vol. 6, no. OOPSLA2, pp. 1180–1206, 2022. [Online]. Available: https://doi.org/10.1145/3563330
-
[52]
Pinpointing and repairing performance bottlenecks in concurrent programs,
T. Yu and M. Pradel, “Pinpointing and repairing performance bottlenecks in concurrent programs,” Empirical Software Engi- neering (EMSE), pp. 1–38, 2017
work page 2017
-
[53]
Learning to repair software vulnerabilities with generative adversarial networks,
J. Harer, O. Ozdemir, T. Lazovich, C. P. Reale, R. L. Russell, L. Y . Kim, and S. P. Chin, “Learning to repair software vulnerabilities with generative adversarial networks,” in Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montr ´eal, Canada. , 2018,...
work page 2018
-
[54]
Pyty: Repairing static type errors in python,
Y . W. Chow, L. D. Grazia, and M. Pradel, “Pyty: Repairing static type errors in python,” in International Conference on Software Engineering (ICSE), 2024
work page 2024
-
[55]
AUTOTRAINER: an automatic DNN training problem detection and repair system,
X. Zhang, J. Zhai, S. Ma, and C. Shen, “AUTOTRAINER: an automatic DNN training problem detection and repair system,” in 43rd IEEE/ACM International Conference on Software Engineering, ICSE 2021, Madrid, Spain, 22-30 May 2021 . IEEE, 2021, pp. 359–371. [Online]. Available: https://doi.org/10.1109/ICSE43902.2021.00043
-
[56]
Learning to fix build errors with graph2diff neural networks,
D. Tarlow, S. Moitra, A. Rice, Z. Chen, P. Manzagol, C. Sutton, and E. Aftandilian, “Learning to fix build errors with graph2diff neural networks,” in ICSE ’20: 42nd International Conference on Software Engineering, Workshops, Seoul, Republic of Korea, 27 June - 19 July, 2020 . ACM, 2020, pp. 19–20. [Online]. Available: https://doi.org/10.1145/3387940.3392181
-
[57]
Neural program repair by jointly learning to localize and repair,
M. Vasic, A. Kanade, P. Maniatis, D. Bieber, and R. Singh, “Neural program repair by jointly learning to localize and repair,” in ICLR, 2019
work page 2019
-
[58]
Neural program repair with execution-based backpropagation,
H. Ye, M. Martinez, and M. Monperrus, “Neural program repair with execution-based backpropagation,” in ICSE, 2022
work page 2022
-
[59]
A survey of learning-based automated program repair,
Q. Zhang, C. Fang, Y . Ma, W. Sun, and Z. Chen, “A survey of learning-based automated program repair,” ACM Transactions on Software Engineering and Methodology, vol. 33, no. 2, pp. 1–69, 2023
work page 2023
-
[60]
Repair is nearly generation: Multilingual program repair with llms,
H. Joshi, J. P. C. S ´anchez, S. Gulwani, V . Le, G. Verbruggen, and I. Radicek, “Repair is nearly generation: Multilingual program repair with llms,” in Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advance...
-
[61]
Cigar: Cost-efficient program repair with llms,
D. Hidv ´egi, K. Etemadi, S. Bobadilla, and M. Monperrus, “Cigar: Cost-efficient program repair with llms,” arXiv preprint arXiv:2402.06598, 2024
-
[62]
Repository- level prompt generation for large language models of code,
D. Shrivastava, H. Larochelle, and D. Tarlow, “Repository- level prompt generation for large language models of code,” in International Conference on Machine Learning . PMLR, 2023, pp. 31 693–31 715
work page 2023
-
[63]
Fuzz4all: Universal fuzzing with large language models,
C. S. Xia, M. Paltenghi, J. L. Tian, M. Pradel, and L. Zhang, “Fuzz4all: Universal fuzzing with large language models,” in ICSE, 2024
work page 2024
-
[64]
Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,
C. Lemieux, J. P. Inala, S. K. Lahiri, and S. Sen, “Codamosa: Escaping coverage plateaus in test generation with pre-trained large language models,” in 45th International Conference on Software Engineering, ser. ICSE , 2023
work page 2023
-
[65]
An empirical evaluation of using large language models for automated unit test generation,
M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “An empirical evaluation of using large language models for automated unit test generation,” IEEE Trans. Software Eng. , vol. 50, no. 1, pp. 85–105, 2024. [Online]. Available: https://doi.org/10.1109/TSE. 2023.3334955
work page doi:10.1109/tse 2024
-
[66]
Code-aware prompting: A study of coverage guided test generation in regression setting using llm,
G. Ryan, S. Jain, M. Shang, S. Wang, X. Ma, M. K. Ramanathan, and B. Ray, “Code-aware prompting: A study of coverage guided test generation in regression setting using llm,” in FSE, 2024
work page 2024
-
[67]
Automated unit test improvement using large lan- guage models at meta
N. Alshahwan, J. Chheda, A. Finegenova, B. Gokkaya, M. Harman, I. Harper, A. Marginean, S. Sengupta, and E. Wang, “Automated unit test improvement using large language models at meta,” in FSE, vol. abs/2402.09171, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.09171
-
[68]
Zhiyu Fan, Xiang Gao, Martin Mirchev, Abhik Roychoudhury, and Shin Hwei Tan
S. Kang, J. Yoon, and S. Yoo, “Large language models are few-shot testers: Exploring llm-based general bug reproduction,” in 45th IEEE/ACM International Conference on Software Engineering, ICSE , 2023, pp. 2312–2323. [Online]. Available: https://doi.org/10.1109/ICSE48619.2023.00194
-
[69]
Prompting is all your need: Automated android bug replay with large language models,
S. Feng and C. Chen, “Prompting is all your need: Automated android bug replay with large language models,” in ICSE, 2024
work page 2024
-
[70]
Codeplan: Repository-level coding using llms and planning,
R. Bairi, A. Sonwane, A. Kanade, V . D. C, A. Iyer, S. Parthasarathy, S. Rajamani, B. Ashok, and S. Shet, “Codeplan: Repository-level coding using llms and planning,” 2023
work page 2023
-
[71]
An in-context learning agent for formal theorem-proving,
A. Thakur, G. Tsoukalas, Y . Wen, J. Xin, and S. Chaudhuri, “An in-context learning agent for formal theorem-proving,” 2024. [Online]. Available: https://arxiv.org/abs/2310.04353
-
[72]
Au- tocoderover: Autonomous program improvement,
Y . Zhang, H. Ruan, Z. Fan, and A. Roychoudhury, “Au- tocoderover: Autonomous program improvement,” 2024
work page 2024
-
[73]
Swe-agent: Agent-computer in- terfaces enable automated software engineering,
J. Yang, C. E. Jimenez, K. Lieret, S. Yao, A. Wettig, K. Narasimhan, and O. Press, “Swe-agent: Agent-computer in- terfaces enable automated software engineering,” 2024
work page 2024
-
[74]
PAL: Program-aided Language Models
L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y . Yang, J. Callan, and G. Neubig, “PAL: program-aided language models,” CoRR, vol. abs/2211.10435, 2022. [Online]. Available: https://doi.org/10.48550/arXiv.2211.10435
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2211.10435 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.