Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
hub Canonical reference
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
Canonical reference. 78% of citing Pith papers cite this work as background.
abstract
The advancement of natural language processing (NLP) has been significantly boosted by the development of transformer-based large language models (LLMs). These models have revolutionized NLP tasks, particularly in code generation, aiding developers in creating software with enhanced efficiency. Despite their advancements, challenges in balancing code snippet generation with effective test case generation and execution persist. To address these issues, this paper introduces Multi-Agent Assistant Code Generation (AgentCoder), a novel solution comprising a multi-agent framework with specialized agents: the programmer agent, the test designer agent, and the test executor agent. During the coding procedure, the programmer agent will focus on the code generation and refinement based on the test executor agent's feedback. The test designer agent will generate test cases for the generated code, and the test executor agent will run the code with the test cases and write the feedback to the programmer. This collaborative system ensures robust code generation, surpassing the limitations of single-agent models and traditional methodologies. Our extensive experiments on 9 code generation models and 12 enhancement approaches showcase AgentCoder's superior performance over existing code generation models and prompt engineering techniques across various benchmarks. For example, AgentCoder (GPT-4) achieves 96.3\% and 91.8\% pass@1 in HumanEval and MBPP datasets with an overall token overhead of 56.9K and 66.3K, while state-of-the-art obtains only 90.2\% and 78.9\% pass@1 with an overall token overhead of 138.2K and 206.5K.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.
A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.
Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA code generation performance.
A new filtration-based conformal prediction method attributes errors in multi-agent systems by producing contiguous sequence sets with finite-sample coverage guarantees, enabling rollback recovery.
NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.
ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.
citing papers explorer
-
Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints
Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.
-
FrontierSmith: Synthesizing Open-Ended Coding Problems at Scale
FrontierSmith automates synthesis of open-ended coding problems from closed-ended seeds and shows measurable gains on two open-ended LLM coding benchmarks.
-
ARIADNE: Agentic Reward-Informed Adaptive Decision Exploration via Blackboard-Driven MCTS for Competitive Program Generation
ARIADNE combines blackboard architecture with MCTS to coordinate strategy, code, test, evaluation, and repair stages, yielding higher Pass@1 scores than prior LLM baselines on APPS, CodeContests, and related benchmarks.
-
BIM Information Extraction Through LLM-based Adaptive Exploration
LLM adaptive exploration via runtime code execution outperforms static query generation for information extraction from heterogeneous BIM models on the new ifc-bench v2 benchmark.
-
Social Bias in LLM-Generated Code: Benchmark and Mitigation
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
-
Evaluating LLM Agents on Automated Software Analysis Tasks
A custom LLM agent achieves 94% manually verified success on a new benchmark of 35 software analysis setups, outperforming baselines at 77%, but struggles with stage mixing, error localization, and overestimating its own success.
-
HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?
HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.
-
ReCodeAgent: A Multi-Agent Workflow for Language-agnostic Translation and Validation of Large-scale Repositories
ReCodeAgent uses a multi-agent system to translate and validate large code repositories across multiple programming languages, achieving 60.8% higher test pass rates than prior neuro-symbolic and agentic methods on 118 real-world projects.
-
An Iterative Test-and-Repair Framework for Competitive Code Generation
FixAudit improves LLM code generation on competitive programming benchmarks by training a shared model for iterative code-aware test generation and repair, achieving 35%+ gains in Pass@1 over baselines on the same 7B model.
-
BACE: LLM-based Code Generation through Bayesian Anchored Co-Evolution of Code and Test Populations
BACE reformulates LLM code synthesis as Bayesian co-evolution of code and test populations anchored on minimal public examples, achieving superior performance on LiveCodeBench v6.
-
Software Self-Extension with SelfEvolve: an Agentic Architecture for Runtime Code Generation
SelfEvolve achieves 92.7% Pass@1 success on 11 runtime self-extension tasks and outperforms baselines like AutoGen by 61.8% with statistical significance.
-
SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios
SWE-EVO shows GPT-5.4 with OpenHands reaching only 25% success on complex multi-file evolution tasks versus 72.8% on SWE-Bench Verified, and introduces Fix Rate as a partial-progress metric.
-
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
MLE-bench evaluates frontier language models as ML engineering agents on 75 Kaggle competitions, with the top setup (o1-preview + AIDE) reaching bronze medal level in 16.9% of tasks.
-
LLM Agents can Autonomously Exploit One-day Vulnerabilities
GPT-4 LLM agents autonomously exploit 87% of tested one-day vulnerabilities when given CVE descriptions, far outperforming other models and tools.
-
Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework
TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.
-
AgentModernize: Preserving Business Logic in Legacy Modernization with Multi-Agent LLMs and Behavioral Specification Graphs
A multi-agent LLM framework with Behavioral Specification Graphs preserves business logic in legacy modernization, achieving non-zero mean BER on all tested scenarios where baseline LLM approaches scored zero.
-
Solvita: Enhancing Large Language Models for Competitive Programming via Agentic Evolution
Solvita is an agentic evolution system using Planner, Solver, Oracle, and Hacker agents with trainable graph knowledge networks updated by reinforcement learning on pass/fail and vulnerability signals to achieve SOTA code generation performance.
-
Conformal Agent Error Attribution
A new filtration-based conformal prediction method attributes errors in multi-agent systems by producing contiguous sequence sets with finite-sample coverage guarantees, enabling rollback recovery.
-
Tail-aware N-version Machine Learning Models for Reliable API Recommendation
NvRec profiles multiple API recommendation models on tail-API performance and applies majority voting with reliability filters to raise true accept rates while controlling rejection of uncertain outputs.
-
SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?
SAFEdit reaches 68.6% task success on EditBench code edits by using planner, editor, and verifier agents plus a failure abstraction layer, beating single-model and ReAct baselines.
-
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
-
You Don't Need Public Tests to Generate Correct Code
DryRUN lets LLMs create their own test inputs and run internal simulations for self-correcting code generation, matching the performance of test-dependent methods like CodeSIM on LiveCodeBench without public tests or external signals.
-
Explicit Trait Inference for Multi-Agent Coordination
ETI lets LLM agents infer and track partners' psychological traits (warmth and competence) from histories, cutting payoff loss 45-77% in games and boosting performance 3-29% on MultiAgentBench versus CoT baselines.
-
ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation
ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.
-
MOSAIC: Multi-agent Orchestration for Task-Intelligent Scientific Coding
MOSAIC is a training-free multi-agent LLM framework with rationale, coding, reflection, and debugging agents plus a consolidated context window that outperforms prior methods on scientific coding benchmarks.
-
Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation
ACToR improves C-to-Rust translation correctness via an adversarial LLM-agent loop that generates differential fuzz tests to drive iterative refinements, achieving over 90% pass rates on 63 real-world utilities.
-
Effective LLM Code Refinement via Property-Oriented and Structurally Minimal Feedback
PGS generates property-oriented, structurally minimal feedback from high-level program properties to refine LLM code, yielding up to 13.4% pass@1 gains and 1.4-1.6x higher bug-fix rates than prior TDD and debugging baselines.
-
Mutation-Guided Unit Test Generation with a Large Language Model
MUTGEN incorporates mutation feedback into LLM prompts and uses iteration to generate unit tests that achieve higher mutation scores than EvoSuite or vanilla LLM prompting on 204 benchmark subjects.
-
Code as Agent Harness
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
-
A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback
A-ProS uses a hybrid multi-model feedback framework with stateful refinement to improve success rates on competitive programming problems, achieving over 2x gains compared to baseline agent loops.
-
Bridging the Gap between User Intent and LLM: A Requirement Alignment Approach for Code Generation
REA-Coder improves LLM code generation by iteratively aligning requirements with model understanding and verifying outputs against the aligned spec.
-
Code Semantic Zooming
CodeZoom is a pseudocode-based multi-layer abstraction tool that improves developer control and comprehension over LLM code generation compared to direct use of agents like Claude Code.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
Assessing, Exploiting, and Mitigating Syntactic Robustness Failures in LLM-Based Code Generation
LLM code generation lacks syntactic robustness on math-formula prompts, but formula-reduction pre-processing raises it from 54.05% to 74.42%.
-
Position: agentic AI orchestration should be Bayes-consistent
Agentic AI orchestration should apply Bayesian principles for belief maintenance, updating from interactions, and utility-based action selection.
-
A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence
The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.
-
Specification-Driven Code Translation Powered by Large Language Models: How Far Are We?
NL specifications alone do not improve LLM code translation performance, but combining them with source code yields gains in select language pairs with no overall consistent benefit.
-
Large Language Model-Based Agents for Software Engineering: A Survey
A literature survey that collects and categorizes 124 papers on LLM-based agents for software engineering from SE and agent perspectives.
-
LLM-Based Multi-Agent Systems for Code Generation: A Multi-Vocal Literature Review
A review of 114 studies classifies motivations into nine categories, analyzes common models and benchmarks, synthesizes challenges into six categories with 26 subcategories and solutions, and identifies six future research directions with 18 subcategories.
-
A Survey on Large Language Models for Code Generation
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.