pith. sign in

arxiv: 2508.00083 · v2 · pith:T3CR6JFHnew · submitted 2025-07-31 · 💻 cs.SE · cs.AI· cs.CL· cs.LG

A Survey on Code Generation with LLM-based Agents

Pith reviewed 2026-05-19 22:58 UTC · model grok-4.3

classification 💻 cs.SE cs.AIcs.CLcs.LG
keywords LLM agentscode generationsoftware development lifecyclesingle-agent architecturemulti-agent systemsevaluation benchmarkstool integrationreliability
0
0 comments X p. Extension
pith:T3CR6JFH Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{T3CR6JFH}

Prints a linked pith:T3CR6JFH badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

LLM-based code generation agents manage entire software projects autonomously from task breakdown through debugging and deployment.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how large language model agents are shifting software creation away from isolated code snippets toward full-lifecycle systems. It organizes the growing literature around single-agent and multi-agent designs while mapping their use across planning, implementation, testing, and maintenance. A sympathetic reader would care because the work frames a move from algorithmic novelty to practical concerns such as reliability, workflow control, and external tool use. The survey also collects benchmarks, metrics, and representative tools and closes by naming open challenges and long-term research directions.

Core claim

LLM-based code generation agents are defined by three distinguishing traits: autonomy that lets them oversee complete workflows without constant human direction, an expanded scope that reaches the full software development lifecycle rather than single functions or modules, and a practical engineering focus that stresses system reliability, process coordination, and integration with development tools over pure algorithmic advances.

What carries the argument

The three core features of autonomy, expanded task scope across the software development lifecycle, and enhancement of engineering practicality, used to classify single-agent versus multi-agent architectures and to structure the review of applications, benchmarks, and tools.

If this is right

  • Agents are applied across every phase of the software development lifecycle rather than only code writing.
  • Research attention shifts from new generation algorithms toward reliability, process management, and tool integration.
  • Evaluation moves beyond isolated code correctness to end-to-end project success measured by new benchmarks and metrics.
  • Multi-agent systems allow specialized roles and collaboration to tackle larger, more complex development tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Successful maturation of these agents would likely change how human developers spend their time, moving emphasis from routine coding to specification, oversight, and integration decisions.
  • The single-versus-multi-agent split may become less sharp as hybrid designs that combine both styles appear in real systems.
  • If the proposed research directions are pursued, non-experts could gain practical ways to build and maintain software with minimal manual coding.

Load-bearing premise

The survey assumes that the rapidly expanding literature can be cleanly and comprehensively sorted into single-agent and multi-agent categories with no major omissions or alternative groupings that would change the overall picture.

What would settle it

Discovery of several widely cited, high-impact papers on LLM code generation whose architectures or workflows resist placement in either the single-agent or multi-agent category would show that the chosen organizational frame leaves out significant work.

read the original abstract

Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys LLM-based code generation agents, claiming they are distinguished from prior techniques by three core features: autonomy to independently manage the full workflow from task decomposition to debugging; expanded scope across the entire software development lifecycle rather than isolated code snippets; and a shift toward engineering practicality including reliability, process management, and tool integration. It traces the developmental trajectory, categorizes core techniques into single-agent and multi-agent architectures, details applications across the SDLC, summarizes benchmarks/metrics and representative tools, and proposes long-term research directions based on identified challenges.

Significance. If the taxonomy and coverage hold, the survey would provide a useful organizing framework for a rapidly expanding subfield at the intersection of LLMs and software engineering, helping researchers identify patterns in agent architectures and gaps in practical deployment. The explicit focus on engineering challenges rather than pure algorithmic novelty is a constructive framing that aligns with industry needs.

major comments (2)
  1. [Abstract and §1] Abstract and §1 (Introduction): The central claim that the three features (autonomy, expanded SDLC scope, and engineering practicality) distinctly characterize LLM-based agents is load-bearing for the entire survey structure, yet the text provides no explicit contrast with prior non-agent code generation methods (e.g., direct LLM prompting or fine-tuned models) to demonstrate that these features are not already present or emergent in earlier work; without this grounding, the subsequent single/multi-agent categorization risks being an arbitrary overlay rather than a natural developmental trajectory.
  2. [§3] §3 (Architectures) and the literature selection description: The single-agent versus multi-agent taxonomy is presented as systematic, but the manuscript does not report search protocol, inclusion/exclusion criteria, database sources, or date range for the surveyed papers; this omission directly undermines the claim that the selected works represent core developments without major omissions, as hybrid or tool-centric systems that do not fit cleanly into the binary split could be under-represented.
minor comments (2)
  1. [Applications section] The abstract lists applications across the full SDLC but the corresponding section would benefit from a table summarizing which agent architectures are applied to which SDLC phases to improve readability.
  2. [Benchmarks section] Ensure that all cited benchmarks (e.g., HumanEval, MBPP extensions) include the exact metrics reported in the original papers rather than paraphrased summaries.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments on our survey. These observations help clarify the presentation of our core claims and improve the methodological transparency of the work. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract and §1] Abstract and §1 (Introduction): The central claim that the three features (autonomy, expanded SDLC scope, and engineering practicality) distinctly characterize LLM-based agents is load-bearing for the entire survey structure, yet the text provides no explicit contrast with prior non-agent code generation methods (e.g., direct LLM prompting or fine-tuned models) to demonstrate that these features are not already present or emergent in earlier work; without this grounding, the subsequent single/multi-agent categorization risks being an arbitrary overlay rather than a natural developmental trajectory.

    Authors: We agree that the distinction would benefit from more explicit grounding. The manuscript states that agents are 'distinct from previous code generation techniques' and enumerates the three features, but does not include a direct comparison. In the revision we will insert a short subsection (or expanded paragraph) in §1 that contrasts LLM-based agents with direct prompting and fine-tuned models, using concrete examples to show how autonomy over the full workflow, SDLC breadth, and engineering focus become central only in the agent setting. This addition will better motivate the subsequent taxonomy without altering the survey's scope. revision: yes

  2. Referee: [§3] §3 (Architectures) and the literature selection description: The single-agent versus multi-agent taxonomy is presented as systematic, but the manuscript does not report search protocol, inclusion/exclusion criteria, database sources, or date range for the surveyed papers; this omission directly undermines the claim that the selected works represent core developments without major omissions, as hybrid or tool-centric systems that do not fit cleanly into the binary split could be under-represented.

    Authors: We accept that the current draft lacks a transparent literature-selection description. Although the taxonomy reflects the dominant architectural patterns we observed, we will add a dedicated 'Literature Review Methodology' subsection at the start of §3. It will specify the databases searched (arXiv, Google Scholar, IEEE Xplore, ACM DL), the keyword combinations and date range (primarily 2022–2024), inclusion criteria (papers that explicitly describe LLM-powered agents for code generation), and exclusion criteria, together with a brief note on how hybrid or tool-centric systems are classified within the single- versus multi-agent framework. This revision directly addresses the concern about potential under-representation. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes external literature without self-referential derivations

full rationale

This paper is a literature review that references external prior work to categorize LLM-based code generation agents into single-agent and multi-agent architectures and to trace developmental trajectories. It contains no equations, no fitted parameters, no predictions derived from its own inputs, and no self-citation chains that bear the central claims. The three core features (autonomy, expanded SDLC scope, engineering practicality) are presented as characterizations drawn from the surveyed body of work rather than results forced by the paper's own definitions or citations. Completeness of coverage is an assumption of any survey but does not constitute circularity under the defined criteria, as no reduction of a claimed result to the paper's own inputs is exhibited.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper that reviews and organizes existing research on LLM-based code generation agents. It introduces no new free parameters, mathematical axioms, or invented entities; all content draws from cited prior literature.

pith-pipeline@v0.9.0 · 5753 in / 1189 out tokens · 42990 ms · 2026-05-19T22:58:34.479588+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 24 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

    cs.SE 2026-05 unverdicted novelty 7.0

    SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

  2. From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

    cs.SE 2026-05 unverdicted novelty 7.0

    TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

  3. BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

    cs.CE 2026-05 unverdicted novelty 7.0

    BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

  4. PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

    cs.SE 2026-05 unverdicted novelty 7.0

    PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.

  5. AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

    cs.SE 2026-04 unverdicted novelty 7.0

    AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

  6. Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

    cs.SE 2026-04 unverdicted novelty 7.0

    Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

  7. Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

    cs.CL 2026-04 unverdicted novelty 7.0

    LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

  8. Think Anywhere in Code Generation

    cs.SE 2026-03 unverdicted novelty 7.0

    Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

  9. Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation

    cs.SE 2026-02 conditional novelty 7.0

    SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.

  10. CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

    cs.SE 2025-10 conditional novelty 7.0

    CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reas...

  11. Context Training with Active Information Seeking

    cs.CL 2026-05 unverdicted novelty 6.0

    Adding active search tools to LLM context optimization works only when combined with a multi-candidate search-based training procedure that prunes contexts, delivering gains across low-resource translation, health, an...

  12. AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

    cs.SE 2026-05 unverdicted novelty 6.0

    More capable LLMs and agents generate code with greater volume and architectural decay, following a Volume-Quality Inverse Law that neither functional correctness nor prompting mitigates.

  13. QuantClaw: Precision Where It Matters for OpenClaw

    cs.AI 2026-04 unverdicted novelty 6.0

    QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

  14. Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

    cs.SE 2026-04 unverdicted novelty 6.0

    LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.

  15. ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

    cs.DC 2026-02 unverdicted novelty 6.0

    ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.

  16. Context Training with Active Information Seeking

    cs.CL 2026-05 unverdicted novelty 5.0

    Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-too...

  17. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  18. TDD Governance for Multi-Agent Code Generation via Prompt Engineering

    cs.SE 2026-04 unverdicted novelty 5.0

    An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code g...

  19. Agentic Insight Generation in VSM Simulations

    cs.CL 2026-04 unverdicted novelty 5.0

    A two-step agentic system for extracting insights from VSM simulations achieves up to 86% accuracy with top LLMs by using progressive data discovery and slim context.

  20. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 3.0

    The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.

  21. Sustainable Code Generation Using Large Language Models: A Systematic Literature Review

    cs.SE 2026-03 unverdicted novelty 3.0

    A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.

  22. LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review

    cs.SE 2026-02 unverdicted novelty 3.0

    A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future res...

  23. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    The paper surveys the conceptual foundations, methodological innovations, challenges, and future directions of agentic reinforcement learning frameworks that embed cognitive capabilities like meta-reasoning and self-r...

  24. Rethinking Agentic Reinforcement Learning In Large Language Models

    cs.AI 2026-04 unverdicted novelty 2.0

    This review synthesizes conceptual foundations, methods, challenges, and future directions for agentic reinforcement learning in large language models.

Reference graph

Works this paper leans on

167 extracted references · 167 canonical work pages · cited by 21 Pith papers · 1 internal anchor

  1. [1]

    Inductive programming: A survey of program synthesis techniques,

    E. Kitzelmann, “Inductive programming: A survey of program synthesis techniques,” inInternational Work- shop on Approaches and Applications of Inductive Pro- gramming (AAIP), 2009, pp. 50–73

  2. [2]

    Latent predictor networks for code generation,

    W. Ling, E. Grefenstette, K. M. Hermann, T. Ko ˇcisk`y, A. Senior, F. Wang, and P . Blunsom, “Latent predictor networks for code generation,” inMeeting of the As- sociation for Computational Linguistics (ACL), 2016, pp. 599–609

  3. [3]

    A syntactic neural model for general-purpose code generation,

    P . Yin and G. Neubig, “A syntactic neural model for general-purpose code generation,” inMeeting of the Association for Computational Linguistics (ACL), 2017, pp. 440–450

  4. [4]

    LLaMA: Open and efficient founda- tion language models,

    H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozi `ere, N. Goyal, E. Hambro, F. Azharet al., “LLaMA: Open and efficient founda- tion language models,” 2023

  5. [5]

    LLaMA 2: Open foundation and fine- tuned chat models,

    H. Touvron, L. Martin, K. Stone, P . Albert, A. Alma- hairi, Y. Babaei, N. Bashlykov, S. Batra, P . Bhargava, S. Bhosaleet al., “LLaMA 2: Open foundation and fine- tuned chat models,” 2023. 19

  6. [6]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wain- wright, P . Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,” inConference on Neural Information Processing Systems (NeurIPS), 2022, pp. 27 730–27 744

  7. [7]

    Code LLaMA: Open foundation models for code,

    B. Roziere, J. Gehring, F. Gloeckle, S. Sootla, I. Gat, X. E. Tan, Y. Adi, J. Liu, R. Sauvestre, T. Remezet al., “Code LLaMA: Open foundation models for code,” 2023

  8. [8]

    CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,

    Y. Wang, W. Wang, S. Joty, and S. C. Hoi, “CodeT5: Identifier-aware unified pre-trained encoder-decoder models for code understanding and generation,” in Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021, pp. 8696–8708

  9. [9]

    Competition-level code generation with Alphacode,

    Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrit- twieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. Dal Lagoet al., “Competition-level code generation with Alphacode,”Science, vol. 378, no. 6624, pp. 1092– 1097, 2022

  10. [10]

    Self-planning code generation with large language models,

    X. Jiang, Y. Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 7, pp. 1–30, 2024

  11. [11]

    CodeChain: Towards modular code genera- tion through chain of self-revisions with representa- tive sub-modules,

    H. Le, H. Chen, A. Saha, A. Gokul, D. Sahoo, and S. Joty, “CodeChain: Towards modular code genera- tion through chain of self-revisions with representa- tive sub-modules,” inInternational Conference on Learn- ing Representations (ICLR), 2023

  12. [12]

    CodeCoR: An llm- based self-reflective multi-agent framework for code generation,

    R. Pan, H. Zhang, and C. Liu, “CodeCoR: An llm- based self-reflective multi-agent framework for code generation,” 2025

  13. [13]

    Codepori: Large scale model for autonomous software development by using multi- agents,

    Z. Rasheed, M. Waseem, M. Saari, K. Syst ¨a, and P . Abrahamsson, “Codepori: Large scale model for autonomous software development by using multi- agents,” 2024

  14. [14]

    An autonomous multi-agent llm frame- work for agile software development,

    S. Manish, “An autonomous multi-agent llm frame- work for agile software development,”International Journal of Trend in Scientific Research and Development, vol. 8, no. 5, pp. 892–898, 2024

  15. [15]

    Self-collaboration code generation via ChatGPT,

    Y. Dong, X. Jiang, Z. Jin, and G. Li, “Self-collaboration code generation via ChatGPT,”ACM Transactions on Software Engineering and Methodology, vol. 33, no. 7, pp. 1–38, 2024

  16. [16]

    Chatdev: Communicative agents for software development,

    C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Conget al., “Chatdev: Communicative agents for software development,” in Meeting of the Association for Computational Linguistics (ACL), 2023, pp. 15 174–15 186

  17. [17]

    Metagpt: Meta programming for multi-agent collaborative framework,

    S. Hong, X. Zheng, J. Chen, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou et al., “Metagpt: Meta programming for multi-agent collaborative framework,” inInternational Conference on Learning Representations (ICLR), 2023

  18. [18]

    CodeXGLUE: A machine learning benchmark dataset for code understanding and generation,

    S. Lu, D. Guo, S. Ren, J. Huang, A. Svyatkovskiy, A. Blanco, C. Clement, D. Drain, D. Jiang, D. Tang et al., “CodeXGLUE: A machine learning benchmark dataset for code understanding and generation,” in Conference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2021

  19. [19]

    ClarifyGPT: Empowering LLM- based code generation with intention clarification,

    F. Mu, L. Shi, S. Wang, Z. Yu, B. Zhang, C. Wang, S. Liu, and Q. Wang, “ClarifyGPT: Empowering LLM- based code generation with intention clarification,” 2023

  20. [20]

    Xuat-copilot: Multi-agent collaborative system for automated user acceptance testing with large language model,

    Z. Wang, W. Wang, Z. Li, L. Wang, C. Yi, X. Xu, L. Cao, H. Su, S. Chen, and J. Zhou, “Xuat-copilot: Multi-agent collaborative system for automated user acceptance testing with large language model,” 2024

  21. [21]

    Logiagent: Auto- mated logical testing for rest systems with llm-based multi-agents,

    K. Zhang, C. Zhang, C. Wang, C. Zhang, Y. Wu, Z. Xing, Y. Liu, Q. Li, and X. Peng, “Logiagent: Auto- mated logical testing for rest systems with llm-based multi-agents,” 2025

  22. [22]

    Ai-driven refactoring: A pipeline for identifying and correcting data clumps in git reposi- tories,

    N. Baumgartner, P . Iyenghar, T. Schoemaker, and E. Pulverm¨uller, “Ai-driven refactoring: A pipeline for identifying and correcting data clumps in git reposi- tories,”Electronics, vol. 13, no. 9, p. 1644, 2024

  23. [23]

    Abstract syntax networks for code generation and semantic parsing,

    M. Rabinovich, M. Stern, and D. Klein, “Abstract syntax networks for code generation and semantic parsing,” inMeeting of the Association for Computational Linguistics (ACL), 2017, pp. 1139–1149

  24. [24]

    GraphCodeBERT: Pre-training code representations with data flow,

    D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fuet al., “GraphCodeBERT: Pre-training code representations with data flow,” inInternational Conference on Learning Representations, 2021

  25. [25]

    UniXcoder: Unified cross-modal pre-training for code representation,

    D. Guo, S. Lu, N. Duan, Y. Wang, M. Zhou, and J. Yin, “UniXcoder: Unified cross-modal pre-training for code representation,” pp. 7212–7225, 2022

  26. [26]

    Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks,

    Y. Hu, Q. Zhou, Q. Chen, X. Li, L. Liu, D. Zhang, A. Kachroo, T. Oz, and O. Tripp, “Qualityflow: An agentic workflow for program synthesis controlled by llm quality checks,” 2025

  27. [27]

    Self-organized agents: A LLM multi-agent framework toward ultra large- scale code generation and optimization,

    Y. Ishibashi and Y. Nishimura, “Self-organized agents: A LLM multi-agent framework toward ultra large- scale code generation and optimization,” 2024

  28. [28]

    CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,

    K. Zhang, J. Li, G. Li, X. Shi, and Z. Jin, “CodeAgent: Enhancing code generation with tool-integrated agent systems for real-world repo-level coding challenges,” inMeeting of the Association for Computational Linguis- tics (ACL), 2024

  29. [29]

    Toolgen: Unified tool retrieval and calling via generation,

    R. Wang, X. Han, L. Ji, S. Wang, T. Baldwin, and H. Li, “Toolgen: Unified tool retrieval and calling via generation,” inInternational Conference on Learning Representations (ICLR), 2025

  30. [30]

    Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI,

    R. Sapkota, K. I. Roumeliotis, and M. Karkee, “Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic AI,” 2025

  31. [31]

    Adaptive test generation using a large language model,

    M. Sch ¨afer, S. Nadi, A. Eghbali, and F. Tip, “Adaptive test generation using a large language model,” 2023

  32. [32]

    A multi-agent llm-based juit test generation with strong oracles,

    Q. Xu, G. Wang, L. Briand, and K. Liu, “A multi-agent llm-based juit test generation with strong oracles,” 2025

  33. [33]

    Leveraging llms to automate energy-aware refactor- ing of parallel scientific codes,

    M. T. Dearing, Y. Tao, X. Wu, Z. Lan, and V . Taylor, “Leveraging llms to automate energy-aware refactor- ing of parallel scientific codes,” 2025

  34. [34]

    Sysllmatic: Large language models are software system optimizers,

    H. Peng, A. Gupte, R. Hasler, N. J. Eliopoulos, C.- C. Ho, R. Mantri, L. Deng, K. L ¨aufer, G. K. Thiru- vathukal, and J. C. Davis, “Sysllmatic: Large language models are software system optimizers,” 2025

  35. [35]

    Harnessing large language models for seed generation in greybox 20 fuzzing,

    W. Shi, Y. Zhang, X. Xing, and J. Xu, “Harnessing large language models for seed generation in greybox 20 fuzzing,” 2024

  36. [36]

    Mutation- guided llm-based test generation at meta,

    C. Foster, A. Gulati, M. Harman, I. Harper, K. Mao, J. Ritchey, H. Robert, and S. Sengupta, “Mutation- guided llm-based test generation at meta,” 2025

  37. [37]

    From LLMs to LLM-based agents for software engi- neering: A survey of current, challenges and future,

    H. Jin, L. Huang, H. Cai, J. Yan, B. Li, and H. Chen, “From LLMs to LLM-based agents for software engi- neering: A survey of current, challenges and future,” 2024

  38. [38]

    Large language model-based agents for software engineering: A survey,

    J. Liu, K. Wang, Y. Chen, X. Peng, Z. Chen, L. Zhang, and Y. Lou, “Large language model-based agents for software engineering: A survey,” 2024

  39. [39]

    Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead,

    J. He, C. Treude, and D. Lo, “Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead,”ACM Transactions on Software Engineering and Methodology, vol. 34, no. 5, pp. 1–30, 2025

  40. [40]

    Agents in software engineering: Survey, landscape, and vision,

    Y. Wang, W. Zhong, Y. Huang, E. Shi, M. Yang, J. Chen, H. Li, Y. Ma, Q. Wang, and Z. Zheng, “Agents in software engineering: Survey, landscape, and vision,” 2024

  41. [41]

    Intellicode compose: Code generation using trans- former,

    A. Svyatkovskiy, S. K. Deng, S. Fu, and N. Sundaresan, “Intellicode compose: Code generation using trans- former,” inACM Joint Meeting on European Software Engineering Conference and Symposium on the Founda- tions of Software Engineering, 2020, pp. 1433–1443

  42. [42]

    Herrington,Code generation in action

    J. Herrington,Code generation in action. Manning Publications Co., 2003

  43. [43]

    Programming is hard-or at least it used to be: Educational opportuni- ties and challenges of AI code generation,

    B. A. Becker, P . Denny, J. Finnie-Ansley, A. Luxton- Reilly, J. Prather, and E. A. Santos, “Programming is hard-or at least it used to be: Educational opportuni- ties and challenges of AI code generation,” inACM Technical Symposium on Computer Science Education V . 1 (SIGCSE), 2023, pp. 500–506

  44. [44]

    In-IDE code generation from natural language: Promise and chal- lenges,

    F. F. Xu, B. Vasilescu, and G. Neubig, “In-IDE code generation from natural language: Promise and chal- lenges,”ACM Transactions on Software Engineering and Methodology (TOSEM), vol. 31, no. 2, pp. 1–47, 2022

  45. [45]

    Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,

    N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,” 2025

  46. [46]

    ChatGPT for good? on opportunities and challenges of large lan- guage models for education,

    E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeieret al., “ChatGPT for good? on opportunities and challenges of large lan- guage models for education,”Learning and Individual Differences, vol. 103, p. 102274, 2023

  47. [47]

    A survey on evaluation of large language models,

    Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wanget al., “A survey on evaluation of large language models,”ACM transac- tions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

  48. [48]

    A survey of large language models,

    W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Donget al., “A survey of large language models,” 2023

  49. [49]

    A comprehensive overview of large language models,

    H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Akhtar, N. Barnes, and A. Mian, “A comprehensive overview of large language models,” ACM Transactions on Intelligent Systems and Technology, 2023

  50. [50]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” inConference on Neural Information Processing Systems (NeurIPS), 2017

  51. [51]

    Evaluating large language models trained on code,

    M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . D. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,” 2021

  52. [52]

    Deepseek-coder: When the large language model meets programming– the rise of code intelligence,

    D. Guo, Q. Zhu, D. Yang, Z. Xie, K. Dong, W. Zhang, G. Chen, X. Bi, Y. Wu, Y. Liet al., “Deepseek-coder: When the large language model meets programming– the rise of code intelligence,” 2024

  53. [53]

    Qwen2. 5-coder technical report,

    B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, L. Zhang, T. Liu, J. Zhang, B. Yu, K. Luet al., “Qwen2. 5-coder technical report,” 2024

  54. [54]

    Chain-of-thought prompting elicits reasoning in large language mod- els,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language mod- els,” inConference on Neural Information Processing Systems (NeurIPS), 2022, pp. 24 824–24 837

  55. [55]

    Planning in natural language improves LLM search for code generation,

    E. Wang, F. Cassano, C. Wu, Y. Bai, W. Song, V . Nath, Z. Han, S. Hendryx, S. Yue, and H. Zhang, “Planning in natural language improves LLM search for code generation,” inInternational Conference on Learning Representations (ICLR), 2025

  56. [56]

    WebGPT: Browser-assisted question-answering with human feedback,

    R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V . Kosaraju, W. Saun- ders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. Schulman, “WebGPT: Browser-assisted question-answering with human feedback,” 2022

  57. [57]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” inConference on Neu- ral Information Processing Systems (NeurIPS), 2023, pp. 68 539–68 551

  58. [58]

    Pal: Program-aided lan- guage models,

    L. Gao, A. Madaan, S. Zhou, U. Alon, P . Liu, Y. Yang, J. Callan, and G. Neubig, “Pal: Program-aided lan- guage models,” inInternational Conference on Machine Learning (ICML), 2023

  59. [59]

    Generative agents: Interactive simulacra of human behavior,

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P . Liang, and M. S. Bernstein, “Generative agents: Interactive simulacra of human behavior,” inProceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1–22

  60. [60]

    Improving grounded language understanding in a collaborative environ- ment by interacting with agents through help feed- back,

    N. Mehta, M. Teruel, P . F. Sanz, X. Deng, A. H. Awadallah, and J. Kiseleva, “Improving grounded language understanding in a collaborative environ- ment by interacting with agents through help feed- back,”arXiv preprint arXiv:2304.10750, 2023

  61. [61]

    The rise and potential of large language model based agents: A survey,

    Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhouet al., “The rise and potential of large language model based agents: A survey,”Science China Information Sciences, vol. 68, no. 2, p. 121101, 2025

  62. [62]

    A survey on large language model based autonomous agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Linet al., “A survey on large language model based autonomous agents,” Frontiers of Computer Science, vol. 18, no. 6, p. 186345, 2024

  63. [63]

    Large language model based multi-agents: A survey of progress and challenges,

    T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V . 21 Chawla, O. Wiest, and X. Zhang, “Large language model based multi-agents: A survey of progress and challenges,” 2024

  64. [64]

    Ex- ploring large language model based intelligent agents: Definitions, methods, and prospects,

    Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhaoet al., “Ex- ploring large language model based intelligent agents: Definitions, methods, and prospects,” 2024

  65. [65]

    Prompt engineering with ChatGPT: a guide for academic writers,

    L. Giray, “Prompt engineering with ChatGPT: a guide for academic writers,”Annals of biomedical engineering, vol. 51, no. 12, pp. 2629–2633, 2023

  66. [66]

    A prompt pattern catalog to enhance prompt engineering with chatgpt,

    J. White, Q. Fu, S. Hays, M. Sandborn, C. Olea, H. Gilbert, A. Elnashar, J. Spencer-Smith, and D. C. Schmidt, “A prompt pattern catalog to enhance prompt engineering with chatgpt,” 2023

  67. [67]

    Retrieval-augmented generation for large language models: A survey,

    Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, H. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” 2023

  68. [68]

    Retrieval-augmented generation for knowledge-intensive nlp tasks,

    P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt¨aschelet al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” inConferecce on Neural Information Processing Systems (NeurIPS), 2020, pp. 9459–9474

  69. [69]

    Large language model-aware in-context learning for code generation,

    J. Li, C. Tao, J. Li, G. Li, Z. Jin, H. Zhang, Z. Fang, and F. Liu, “Large language model-aware in-context learning for code generation,”ACM Transactions on Software Engineering and Methodology, 2023

  70. [70]

    Larger language models do in-context learning differently,

    J. Wei, J. Wei, Y. Tay, D. Tran, A. Webson, Y. Lu, X. Chen, H. Liu, D. Huang, D. Zhouet al., “Larger language models do in-context learning differently,” 2023

  71. [71]

    AgentCoder: Multi-agent-based code generation with iterative testing and optimisation,

    D. Huang, J. M. Zhang, M. Luck, Q. Bu, Y. Qing, and H. Cui, “AgentCoder: Multi-agent-based code generation with iterative testing and optimisation,” 2023

  72. [72]

    HyperAgent: Generalist software engineering agents to solve coding tasks at scale,

    H. N. Phan, T. N. Nguyen, P . X. Nguyen, and N. D. Bui, “HyperAgent: Generalist software engineering agents to solve coding tasks at scale,” 2024

  73. [73]

    ToolCoder: Teach code generation models to use API search tools,

    K. Zhang, H. Zhang, G. Li, J. Li, Z. Li, and Z. Jin, “ToolCoder: Teach code generation models to use API search tools,” 2023

  74. [74]

    Repohyper: Better context retrieval is all you need for repository-level code completion,

    H. N. Phan, H. N. Phan, T. N. Nguyen, and N. D. Bui, “Repohyper: Better context retrieval is all you need for repository-level code completion,” 2024

  75. [75]

    Self-refine: Iterative refinement with self-feedback,

    A. Madaan, N. Tandon, P . Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yanget al., “Self-refine: Iterative refinement with self-feedback,” inConference on Neural Information Pro- cessing Systems (NeurIPS), 2023, pp. 46 534–46 594

  76. [76]

    Self-Edit: Fault- aware code editor for code generation,

    K. Zhang, Z. Li, J. Li, G. Li, and Z. Jin, “Self-Edit: Fault- aware code editor for code generation,” inMeeting of the Association for Computational Linguistics (ACL), 2023, pp. 769–787

  77. [77]

    Executable code actions elicit better LLM agents,

    X. Wang, Y. Chen, L. Yuan, Y. Zhang, Y. Li, H. Peng, and H. Ji, “Executable code actions elicit better LLM agents,” inInternational Conference on Machine Learning (ICML), 2024

  78. [78]

    Knowledge-aware code generation with large lan- guage models,

    T. Huang, Z. Sun, Z. Jin, G. Li, and C. Lyu, “Knowledge-aware code generation with large lan- guage models,” inIEEE/ACM International Conference on Program Comprehension (ICPC), 2024, pp. 52–63

  79. [79]

    A real-world webagent with planning, long context understanding, and program synthesis,

    I. Gur, H. Furuta, A. Huang, M. Safdari, Y. Matsuo, D. Eck, and A. Faust, “A real-world webagent with planning, long context understanding, and program synthesis,” inInternational Conference on Learning Rep- resentations (ICLR), 2024

  80. [80]

    Codeplan: Repository-level coding using LLMs and planning,

    R. Bairi, A. Sonwane, A. Kanade, V . D. C, A. Iyer, S. Parthasarathy, S. Rajamani, B. Ashok, and S. Shet, “Codeplan: Repository-level coding using LLMs and planning,”ACM on Software Engineering, vol. 1, no. FSE, pp. 675–698, 2024

Showing first 80 references.