hub Canonical reference

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M.Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, Heming Cui · 2023 · cs.CL · arXiv 2312.13010

Canonical reference. 78% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 78% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 method 1

citation-polarity summary

background 7 unclear 1 use method 1

representative citing papers

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

cs.AI · 2026-05-18 · unverdicted · novelty 8.0

Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

cs.LG · 2026-05-14 · conditional · novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.

BIM Information Extraction Through LLM-based Adaptive Exploration

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

An Iterative Test-and-Repair Framework for Competitive Code Generation

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.

BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

cs.NE · 2026-03-30 · unverdicted · novelty 7.0

BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.

Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation

cs.SE · 2026-02-06 · conditional · novelty 7.0

SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

cs.SE · 2025-12-20 · unverdicted · novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

cs.CL · 2024-10-09 · unverdicted · novelty 7.0

MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.

LLM Agents can Autonomously Exploit One-day Vulnerabilities

cs.CR · 2024-04-11 · unverdicted · novelty 7.0

GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework

cs.MA · 2026-05-22 · unverdicted · novelty 6.0

TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.

AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA code generation performance.

Conformal Agent Error Attribution

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A new filtration-based conformal prediction method attributes errors in multi-agent systems by producing contiguous sequence sets with finite-sample coverage guarantees, enabling rollback recovery.

Tail-aware N-version Machine Learning Models for Reliable API Recommendation

cs.SE · 2026-04-30 · unverdicted · novelty 6.0

NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.

SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

cs.SE · 2026-04-28 · unverdicted · novelty 6.0

SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.

No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows

cs.SE · 2026-04-25 · unverdicted · novelty 6.0

MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.

You Don't Need Public Tests to Generate Correct Code

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.

Explicit Trait Inference for Multi-Agent Coordination

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.

ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation

cs.SE · 2026-04-13 · unverdicted · novelty 6.0

ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.

citing papers explorer

Showing 40 of 40 citing papers.

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints cs.AI · 2026-05-18 · unverdicted · none · ref 21 · internal anchor
Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale cs.LG · 2026-05-14 · conditional · none · ref 12 · internal anchor
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation cs.SE · 2026-05-04 · unverdicted · none · ref 27 · internal anchor
ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
BIM Information Extraction Through LLM-based Adaptive Exploration cs.CL · 2026-05-03 · unverdicted · none · ref 39 · internal anchor
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
Social Bias in LLM-Generated Code: Benchmark and Mitigation cs.SE · 2026-05-01 · unverdicted · none · ref 95 · internal anchor
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
Evaluating LLM Agents on Automated Software Analysis Tasks cs.SE · 2026-04-13 · unverdicted · none · ref 30 · internal anchor
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help? cs.AI · 2026-04-10 · unverdicted · none · ref 11 · internal anchor
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories cs.SE · 2026-04-08 · unverdicted · none · ref 25 · internal anchor
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
An Iterative Test-and-Repair Framework for Competitive Code Generation cs.SE · 2026-04-07 · unverdicted · none · ref 20 · internal anchor
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations cs.NE · 2026-03-30 · unverdicted · none · ref 15 · internal anchor
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation cs.SE · 2026-02-06 · conditional · none · ref 20 · internal anchor
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios cs.SE · 2025-12-20 · unverdicted · none · ref 21 · internal anchor
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering cs.CL · 2024-10-09 · unverdicted · none · ref 13 · internal anchor
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
LLM Agents can Autonomously Exploit One-day Vulnerabilities cs.CR · 2024-04-11 · unverdicted · none · ref 6 · internal anchor
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework cs.MA · 2026-05-22 · unverdicted · none · ref 32 · internal anchor
TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.
AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs cs.SE · 2026-05-17 · unverdicted · none · ref 16 · internal anchor
A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution cs.AI · 2026-05-14 · unverdicted · none · ref 14 · internal anchor
Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA code generation performance.
Conformal Agent Error Attribution cs.LG · 2026-05-07 · unverdicted · none · ref 16 · internal anchor
A new filtration-based conformal prediction method attributes errors in multi-agent systems by producing contiguous sequence sets with finite-sample coverage guarantees, enabling rollback recovery.
Tail-aware N-version Machine Learning Models for Reliable API Recommendation cs.SE · 2026-04-30 · unverdicted · none · ref 12 · internal anchor
NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing? cs.SE · 2026-04-28 · unverdicted · none · ref 9 · internal anchor
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows cs.SE · 2026-04-25 · unverdicted · none · ref 7 · internal anchor
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
You Don't Need Public Tests to Generate Correct Code cs.SE · 2026-04-23 · unverdicted · none · ref 6 · internal anchor
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.
Explicit Trait Inference for Multi-Agent Coordination cs.AI · 2026-04-21 · unverdicted · none · ref 21 · internal anchor
ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation cs.SE · 2026-04-13 · unverdicted · none · ref 36 · internal anchor
ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding cs.CL · 2025-10-09 · unverdicted · none · ref 10 · internal anchor
MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.
Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation cs.SE · 2025-10-04 · conditional · none · ref 1 · internal anchor
ACToR improves C-to-Rust translation correctness via an adversarial LLM-agent loop that generates differential fuzz tests to drive iterative refinements, achieving over 90% pass rates on 63 real-world utilities.
Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback cs.SE · 2025-06-23 · unverdicted · none · ref 44 · internal anchor
PGS generates property-oriented, structurally minimal feedback from high-level program properties to refine LLM code, yielding up to 13.4% pass@1 gains and 1.4-1.6x higher bug-fix rates than prior TDD and debugging baselines.
Mutation-Guided Unit Test Generation with a Large Language Model cs.SE · 2025-06-03 · conditional · none · ref 38 · internal anchor
MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 50 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback cs.SE · 2026-05-18 · unverdicted · none · ref 35 · internal anchor
A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation cs.SE · 2026-04-17 · unverdicted · none · ref 20 · internal anchor
REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
Code Semantic Zooming cs.HC · 2025-10-07 · unverdicted · none · ref 18 · internal anchor
CodeZoom is a pseudocode-based multi-layer abstraction tool that improves developer control and comprehension over LLM code generation compared to direct use of agents like Claude Code.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems cs.AI · 2025-08-10 · unverdicted · none · ref 39 · internal anchor
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation cs.SE · 2024-04-01 · unverdicted · none · ref 1 · internal anchor
LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.
Position: agentic AI orchestration should be Bayes-consistent cs.AI · 2026-05-01 · unverdicted · none · ref 9 · internal anchor
Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence cs.AI · 2025-07-28 · accept · none · ref 163 · internal anchor
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We? cs.SE · 2024-12-05 · unverdicted · none · ref 15 · internal anchor
NL specifications alone do not improve LLM code translation performance, but combining them with source code yields gains in select language pairs with no overall consistent benefit.
Large Language Model-Based Agents for Software Engineering: A Survey cs.SE · 2024-09-04 · unverdicted · none · ref 109 · internal anchor
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review cs.SE · 2026-02-25 · unverdicted · none · ref 6 · internal anchor
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 106 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer