hub Canonical reference

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

Dong Huang, Jie M.Zhang, Michael Luck, Qingwen Bu, Yuhao Qing, Heming Cui · 2023 · cs.CL · arXiv 2312.13010

Canonical reference. 70% of citing Pith papers cite this work as background.

40 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 40 citing papers arXiv PDF

abstract

The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 8 baseline 1 method 1

citation-polarity summary

background 7 baseline 1 unclear 1 use method 1

representative citing papers

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

cs.AI · 2026-05-18 · unverdicted · novelty 8.0

Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.

FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale

cs.LG · 2026-05-14 · conditional · novelty 7.0

FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.

ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation

cs.SE · 2026-05-04 · unverdicted · novelty 7.0

ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.

BIM Information Extraction Through LLM-based Adaptive Exploration

cs.CL · 2026-05-03 · unverdicted · novelty 7.0

LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Evaluating LLM Agents on Automated Software Analysis Tasks

cs.SE · 2026-04-13 · unverdicted · novelty 7.0

A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.

HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

cs.AI · 2026-04-10 · unverdicted · novelty 7.0

HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories

cs.SE · 2026-04-08 · unverdicted · novelty 7.0

ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.

An Iterative Test-and-Repair Framework for Competitive Code Generation

cs.SE · 2026-04-07 · unverdicted · novelty 7.0

FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.

BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations

cs.NE · 2026-03-30 · unverdicted · novelty 7.0

BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.

Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation

cs.SE · 2026-02-06 · conditional · novelty 7.0

SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

cs.SE · 2025-12-20 · unverdicted · novelty 7.0

SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.

MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering

cs.CL · 2024-10-09 · unverdicted · novelty 7.0

MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.

LLM Agents can Autonomously Exploit One-day Vulnerabilities

cs.CR · 2024-04-11 · unverdicted · novelty 7.0

GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.

Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework

cs.MA · 2026-05-22 · unverdicted · novelty 6.0

TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.

AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs

cs.SE · 2026-05-17 · unverdicted · novelty 6.0

A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.

Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution

cs.AI · 2026-05-14 · unverdicted · novelty 6.0

Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA code generation performance.

Conformal Agent Error Attribution

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

A new filtration-based conformal prediction method attributes errors in multi-agent systems by producing contiguous sequence sets with finite-sample coverage guarantees, enabling rollback recovery.

Tail-aware N-version Machine Learning Models for Reliable API Recommendation

cs.SE · 2026-04-30 · unverdicted · novelty 6.0

NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.

SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

cs.SE · 2026-04-28 · unverdicted · novelty 6.0

SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.

No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows

cs.SE · 2026-04-25 · unverdicted · novelty 6.0

MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.

You Don't Need Public Tests to Generate Correct Code

cs.SE · 2026-04-23 · unverdicted · novelty 6.0

DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.

Explicit Trait Inference for Multi-Agent Coordination

cs.AI · 2026-04-21 · unverdicted · novelty 6.0

ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.

ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation

cs.SE · 2026-04-13 · unverdicted · novelty 6.0

ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 50 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.

AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer