hub Canonical reference

A Survey on Code Generation with LLM-based Agents

Yihong Dong, Xue Jiang, Jiaru Qian, Tian Wang, Kechi Zhang, Zhi Jin · 2025 · cs.SE · arXiv 2508.00083

Canonical reference. 89% of citing Pith papers cite this work as background.

27 Pith papers citing it

Background 89% of classified citations

open full Pith review browse 27 citing papers arXiv PDF

abstract

Code generation agents powered by large language models (LLMs) are revolutionizing the software development paradigm. Distinct from previous code generation techniques, code generation agents are characterized by three core features. 1) Autonomy: the ability to independently manage the entire workflow, from task decomposition to coding and debugging. 2) Expanded task scope: capabilities that extend beyond generating code snippets to encompass the full software development lifecycle (SDLC). 3) Enhancement of engineering practicality: a shift in research emphasis from algorithmic innovation toward practical engineering challenges, such as system reliability, process management, and tool integration. This domain has recently witnessed rapid development and an explosion in research, demonstrating significant application potential. This paper presents a systematic survey of the field of LLM-based code generation agents. We trace the technology's developmental trajectory from its inception and systematically categorize its core techniques, including both single-agent and multi-agent architectures. Furthermore, this survey details the applications of LLM-based agents across the full SDLC, summarizes mainstream evaluation benchmarks and metrics, and catalogs representative tools. Finally, by analyzing the primary challenges, we identify and propose several foundational, long-term research directions for the future work of the field.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9

citation-polarity summary

background 8 unclear 1

representative citing papers

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.

From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements

cs.SE · 2026-05-17 · unverdicted · novelty 7.0

TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.

BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks

cs.CE · 2026-05-15 · unverdicted · novelty 7.0

BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.

PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization

cs.SE · 2026-05-13 · unverdicted · novelty 7.0

PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.

AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search

cs.SE · 2026-04-12 · unverdicted · novelty 7.0

AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.

Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation

cs.SE · 2026-04-03 · unverdicted · novelty 7.0

Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.

Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy

cs.CL · 2026-04-03 · unverdicted · novelty 7.0

LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.

Think Anywhere in Code Generation

cs.SE · 2026-03-31 · unverdicted · novelty 7.0

Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.

Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation

cs.SE · 2026-02-06 · conditional · novelty 7.0

SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.

CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding

cs.SE · 2025-11-30 · unverdicted · novelty 7.0 · 2 refs

Human-AI collaboration on CentaurEval's collaboration-necessary tasks reaches 31.11% success, far above standalone humans at 18.89% or LLMs at 0.67%.

CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment

cs.SE · 2025-10-21 · conditional · novelty 7.0

CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.

AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development

cs.SE · 2026-05-04 · unverdicted · novelty 6.0

More capable LLMs and agents generate code with greater volume and architectural decay, following a Volume-Quality Inverse Law that neither functional correctness nor prompting mitigates.

QuantClaw: Precision Where It Matters for OpenClaw

cs.AI · 2026-04-24 · unverdicted · novelty 6.0

QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.

Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution

cs.SE · 2026-04-07 · unverdicted · novelty 6.0

LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.

ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness

cs.DC · 2026-02-14 · unverdicted · novelty 6.0

ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.

Large Language Model Agent for User-friendly Chemical Process Simulations

physics.chem-ph · 2026-01-15 · unverdicted · novelty 6.0

An LLM agent integrated with AVEVA Process Simulation via MCP enables natural language driven flowsheet analysis, optimization, and construction for chemical separation processes.

Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum

cs.RO · 2026-05-20 · unverdicted · novelty 5.0

A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.

Code as Agent Harness

cs.CL · 2026-05-18 · accept · novelty 5.0

A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.

Context Training with Active Information Seeking

cs.CL · 2026-05-13 · unverdicted · novelty 5.0 · 2 refs

Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-tool baselines.

Nautilus: From One Prompt to Plug-and-Play Robot Learning

cs.RO · 2026-05-12 · unverdicted · novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

TDD Governance for Multi-Agent Code Generation via Prompt Engineering

cs.SE · 2026-04-29 · unverdicted · novelty 5.0

An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code generation.

Agentic Insight Generation in VSM Simulations

cs.CL · 2026-04-14 · unverdicted · novelty 5.0

A two-step agentic system for extracting insights from VSM simulations achieves up to 86% accuracy with top LLMs by using progressive data discovery and slim context.

Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development

cs.SE · 2026-05-19 · unverdicted · novelty 4.0

Agentic Agile-V uses Agile-V as backbone and a Specify-Constrain-Orchestrate-Prove-Evolve-Verify loop to convert AI agent conversations into traceable engineering artifacts with acceptance evidence.

citing papers explorer

Showing 27 of 27 citing papers.

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 15 · internal anchor
PRISM benchmark of over 10k pairs shows LLMs have a 41% average drop from code execution success to spatial correctness in programmatic video generation.
SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering cs.SE · 2026-05-17 · unverdicted · none · ref 12 · internal anchor
SaaSBench introduces a heterogeneous benchmark for enterprise SaaS engineering and shows that state-of-the-art coding agents fail over 95% of the time before reaching deep business logic due to setup and integration problems.
From Runnable to Shippable: Multi-Agent Test-Driven Development for Generating Full-Stack Web Applications from Requirements cs.SE · 2026-05-17 · unverdicted · none · ref 16 · internal anchor
TDDev automates the full TDD loop for web app generation from requirements, delivering 34-48 percentage point quality gains and zero manual intervention in user studies.
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks cs.CE · 2026-05-15 · unverdicted · none · ref 46 · internal anchor
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
PerfCodeBench: Benchmarking LLMs for System-Level High-Performance Code Optimization cs.SE · 2026-05-13 · unverdicted · none · ref 8 · internal anchor
PerfCodeBench reveals that state-of-the-art LLMs produce functionally correct but significantly slower code than expert-optimized versions on system-level tasks, especially those involving parallelism and GPUs.
AdverMCTS: Combating Pseudo-Correctness in Code Generation via Adversarial Monte Carlo Tree Search cs.SE · 2026-04-12 · unverdicted · none · ref 12 · internal anchor
AdverMCTS frames code generation as a minimax game where an attacker evolves tests to expose flaws in solver-generated code, yielding more robust outputs than static-test baselines.
Evaluating the Environmental Impact of using SLMs and Prompt Engineering for Code Generation cs.SE · 2026-04-03 · unverdicted · none · ref 10 · internal anchor
Chain-of-Thought prompting balances high accuracy with low energy use in small language models for code generation, while multi-sampling strategies add high energy costs for small accuracy gains.
Evaluating the Formal Reasoning Capabilities of Large Language Models through Chomsky Hierarchy cs.CL · 2026-04-03 · unverdicted · none · ref 16 · internal anchor
LLMs display clear performance stratification on formal language tasks aligned with Chomsky hierarchy complexity levels, limited by severe efficiency barriers rather than absolute capability.
Think Anywhere in Code Generation cs.SE · 2026-03-31 · unverdicted · none · ref 6 · internal anchor
Think-Anywhere lets LLMs invoke on-demand reasoning at any token during code generation via cold-start imitation followed by outcome-based RL, reaching state-of-the-art results on LeetCode, LiveCodeBench, HumanEval, and MBPP.
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation cs.SE · 2026-02-06 · conditional · none · ref 10 · internal anchor
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
CentaurEval: Benchmarking Human-in-the-Loop Value in Agentic Coding cs.SE · 2025-11-30 · unverdicted · none · ref 1 · 2 links · internal anchor
Human-AI collaboration on CentaurEval's collaboration-necessary tasks reaches 31.11% success, far above standalone humans at 18.89% or LLMs at 0.67%.
CodeRL+: Improving Code Generation via Reinforcement with Execution Semantics Alignment cs.SE · 2025-10-21 · conditional · none · ref 6 · internal anchor
CodeRL+ integrates variable-level execution trajectory inference into RLVR training to align textual code representations with execution semantics, delivering 4.6% relative pass@1 gains and generalization to code-reasoning and test-output tasks.
AI-Generated Smells: An Analysis of Code and Architecture in LLM and Agent-Driven Development cs.SE · 2026-05-04 · unverdicted · none · ref 9 · internal anchor
More capable LLMs and agents generate code with greater volume and architectural decay, following a Volume-Quality Inverse Law that neither functional correctness nor prompting mitigates.
QuantClaw: Precision Where It Matters for OpenClaw cs.AI · 2026-04-24 · unverdicted · none · ref 9 · internal anchor
QuantClaw dynamically routes precision in agent workflows to cut cost by up to 21.4% and latency by 15.7% while keeping or improving task performance.
Does Pass Rate Tell the Whole Story? Evaluating Design Constraint Compliance in LLM-based Issue Resolution cs.SE · 2026-04-07 · unverdicted · none · ref 13 · internal anchor
LLM agents resolve fewer than half of issues while satisfying design constraints despite passing tests, as shown by a benchmark of 495 issues and 1787 constraints from six repositories.
ACE-Bench: A Lightweight Benchmark for Evaluating Azure SDK Usage Correctness cs.DC · 2026-02-14 · unverdicted · none · ref 3 · internal anchor
ACE-Bench is an execution-free benchmark that scores LLM coding agents on correct Azure SDK usage via deterministic regex checks and reference-based LLM judges derived from official documentation.
Large Language Model Agent for User-friendly Chemical Process Simulations physics.chem-ph · 2026-01-15 · unverdicted · none · ref 9 · internal anchor
An LLM agent integrated with AVEVA Process Simulation via MCP enables natural language driven flowsheet analysis, optimization, and construction for chemical separation processes.
Humanoid Whole-Body Manipulation via Active Spatial Brain and Generalizable Action Cerebellum cs.RO · 2026-05-20 · unverdicted · none · ref 8 · internal anchor
A multi-agent LLM framework for humanoid loco-manipulation that separates active spatial perception and task planning from generalizable action generation without task-specific real-robot data.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 177 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
Context Training with Active Information Seeking cs.CL · 2026-05-13 · unverdicted · none · ref 23 · 2 links · internal anchor
Active information seeking via search tools, when combined with multi-candidate context pruning during training, produces consistent gains on translation, health, and reasoning tasks over naive tool addition or no-tool baselines.
Nautilus: From One Prompt to Plug-and-Play Robot Learning cs.RO · 2026-05-12 · unverdicted · none · ref 2 · internal anchor
NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
TDD Governance for Multi-Agent Code Generation via Prompt Engineering cs.SE · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
An AI-native TDD framework operationalizes classical TDD principles as prompt-level and workflow-level governance mechanisms in a layered multi-agent architecture to improve stability and reproducibility of LLM code generation.
Agentic Insight Generation in VSM Simulations cs.CL · 2026-04-14 · unverdicted · none · ref 6 · internal anchor
A two-step agentic system for extracting insights from VSM simulations achieves up to 86% accuracy with top LLMs by using progressive data discovery and slim context.
Agentic Agile-V: From Vibe Coding to Verified Engineering in Software and Hardware Development cs.SE · 2026-05-19 · unverdicted · none · ref 3 · internal anchor
Agentic Agile-V uses Agile-V as backbone and a Specify-Constrain-Orchestrate-Prove-Evolve-Verify loop to convert AI agent conversations into traceable engineering artifacts with acceptance evidence.
Rethinking Agentic Reinforcement Learning In Large Language Models cs.AI · 2026-04-30 · unverdicted · none · ref 14 · 3 links · internal anchor
The paper reviews conceptual foundations, methodological innovations, effective designs, critical challenges, and future directions for LLM-based Agentic Reinforcement Learning.
Sustainable Code Generation Using Large Language Models: A Systematic Literature Review cs.SE · 2026-03-01 · unverdicted · none · ref 49 · internal anchor
A systematic review finds research on the sustainability of LLM-generated code to be limited, fragmented, and without accepted frameworks for measurement or benchmarking.
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review cs.SE · 2026-02-25 · unverdicted · none · ref 86 · internal anchor
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.

A Survey on Code Generation with LLM-based Agents

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer