super hub Canonical reference

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Ceyao Zhang, Jiaqi Chen, Mingchen Zhuge, Sirui Hong, Xiawu Zheng, Yuheng Cheng · 2023 · cs.AI · arXiv 2308.00352

Canonical reference. 92% of citing Pith papers cite this work as background.

121 Pith papers citing it

Background 92% of classified citations

open full Pith review browse 121 citing papers more from Ceyao Zhang arXiv PDF

abstract

Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for more streamlined workflows, thus allowing agents with human-like domain expertise to verify intermediate results and reduce errors. MetaGPT utilizes an assembly line paradigm to assign diverse roles to various agents, efficiently breaking down complex tasks into subtasks involving many agents working together. On collaborative software engineering benchmarks, MetaGPT generates more coherent solutions than previous chat-based multi-agent systems. Our project can be found at https://github.com/geekan/MetaGPT

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 35 method 2 dataset 1 other 1

citation-polarity summary

background 36 unclear 1 use dataset 1 use method 1

claims ledger

abstract Remarkable progress has been made on automated problem solving through societies of agents based on large language models (LLMs). Existing LLM-based multi-agent systems can already solve simple dialogue tasks. Solutions to more complex tasks, however, are complicated through logic inconsistencies due to cascading hallucinations caused by naively chaining LLMs. Here we introduce MetaGPT, an innovative meta-programming framework incorporating efficient human workflows into LLM-based multi-agent collaborations. MetaGPT encodes Standardized Operating Procedures (SOPs) into prompt sequences for mor

authors

Ceyao Zhang Jiaqi Chen Mingchen Zhuge Sirui Hong Xiawu Zheng Yuheng Cheng

co-cited works

representative citing papers

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

cs.AI · 2026-05-18 · unverdicted · novelty 8.0

Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.

ExCyTIn-Bench: Evaluating LLM agents on Cyber Threat Investigation

cs.CR · 2025-07-14 · unverdicted · novelty 8.0

ExCyTIn-Bench is the first benchmark of 7542 questions from Microsoft Sentinel threat investigation graphs, where the best LLM agent achieves a reward of 0.606.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

cs.CL · 2026-05-21 · unverdicted · novelty 7.0 · 2 refs

Boiling the Frog is a new stateful multi-turn benchmark that finds an aggregate 44.4% strict attack success rate for incremental safety violations across nine AI models, with rates ranging from 20.5% to 92.9%.

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 7.0

Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.

DecisionBench: A Benchmark for Emergent Delegation in Long-Horizon Agentic Workflows

cs.AI · 2026-05-18 · unverdicted · novelty 7.0

DecisionBench supplies a fixed task suite, model pool, delegation interface, and multi-axis metrics to evaluate emergent delegation, showing similar quality across awareness conditions but 15-31 point headroom under perfect delegation.

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.

Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics

cond-mat.stat-mech · 2026-05-11 · unverdicted · novelty 7.0

LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.

TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents

cs.CY · 2026-05-11 · unverdicted · novelty 7.0

TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.

MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.

Social Bias in LLM-Generated Code: Benchmark and Mitigation

cs.SE · 2026-05-01 · unverdicted · novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.

Symbolic Execution Meets Multi-LLM Orchestration: Detecting Memory Vulnerabilities in Incomplete Rust CVE Snippets

cs.CR · 2026-04-28 · unverdicted · novelty 7.0

A 4-agent LLM orchestration with KLEE symbolic execution generates harnesses for incomplete Rust CVE snippets, achieving 90.3% compilation success and detecting 1206 errors across 26 of 31 files versus far lower rates from Clippy and Miri.

Constraint-Guided Multi-Agent Decompilation for Executable Binary Recovery

cs.SE · 2026-04-27 · unverdicted · novelty 7.0

A constraint-guided multi-agent system turns raw decompiler output into re-executable code at 84-97% success rates, outperforming prior LLM decompilation methods on real binaries.

ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.

A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework

cs.CR · 2026-04-25 · unverdicted · novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.

Synthesizing Multi-Agent Harnesses for Vulnerability Discovery

cs.CR · 2026-04-22 · unverdicted · novelty 7.0

AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.

Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

cs.AI · 2026-04-22 · unverdicted · novelty 7.0

MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.

Weak-Link Optimization for Multi-Agent Reasoning and Collaboration

cs.AI · 2026-04-17 · unverdicted · novelty 7.0

WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.

Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities

cs.CR · 2026-04-15 · conditional · novelty 7.0

LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained limited.

Detecting Multi-Agent Collusion Through Multi-Agent Interpretability

cs.AI · 2026-04-01 · conditional · novelty 7.0

NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.

Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?

cs.AI · 2026-03-26 · conditional · novelty 7.0

An agent factory combining sub-kernel ILP assembly with multi-agent cross-optimization lets general coding agents deliver mean 8.27x speedups in HLS designs on standard benchmarks.

What Do AI Agents Talk About? Discourse and Architectural Constraints in the First AI-Only Social Network

cs.CL · 2026-03-09 · unverdicted · novelty 7.0

Discourse among AI agents on Moltbook is largely determined by architectural constraints like context windows and identity files, appearing as social learning but actually short-horizon contextual conditioning.

citing papers explorer

Showing 34 of 34 citing papers after filters.

EVOCHAMBER: Test-Time Co-evolution of Multi-Agent System at Individual, Team, and Population Scales cs.AI · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
EVOCHAMBER enables test-time co-evolution of multi-agent systems across three scales, producing emergent niche specialists and performance gains of up to 32% relative on math tasks with Qwen3-8B.
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics cond-mat.stat-mech · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents cs.CY · 2026-05-11 · unverdicted · none · ref 32 · internal anchor
TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.
MOTOR-Bench: A Real-world Dataset and Multi-agent Framework for Zero-shot Human Mental State Understanding cs.CV · 2026-05-10 · unverdicted · none · ref 25 · internal anchor
MOTOR-Bench supplies a real-world video dataset for structured mental state understanding in learning settings, while MOTOR-MAS improves zero-shot prediction of behavior, cognition, and emotion labels over single models and other multi-agent systems.
RepoDoc: A Knowledge Graph-Based Framework to Automatic Documentation Generation and Incremental Updates cs.SE · 2026-04-29 · unverdicted · none · ref 15 · internal anchor
RepoDoc uses a repository knowledge graph with module clustering and semantic impact propagation to generate more complete documentation 3x faster with 85% fewer tokens and handle incremental updates 73% faster than prior LLM-based tools.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework cs.CR · 2026-04-25 · unverdicted · none · ref 107 · internal anchor
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery cs.CR · 2026-04-22 · unverdicted · none · ref 20 · internal anchor
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new zero-days in Chrome including two critical sandbox escapes.
Weak-Link Optimization for Multi-Agent Reasoning and Collaboration cs.AI · 2026-04-17 · unverdicted · none · ref 24 · internal anchor
WORC improves multi-agent LLM reasoning to 82.2% average accuracy by predicting and compensating for the weakest agent via targeted extra sampling rather than uniform reinforcement.
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems cs.MA · 2025-06-05 · accept · none · ref 57 · internal anchor
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
WebGen-R1: Incentivizing Large Language Models to Generate Functional and Aesthetic Websites with Reinforcement Learning cs.CL · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
WebGen-R1 uses end-to-end RL with scaffold-driven generation and cascaded rewards for structure, function, and aesthetics to transform a 7B model into a generator of deployable multi-page websites that rivals much larger models.
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks cs.AI · 2026-04-22 · unverdicted · none · ref 13 · internal anchor
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing rankings between MCQ and LLM-judge scoring.
AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum cs.AI · 2026-04-20 · unverdicted · none · ref 8 · internal anchor
AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specific scheduling.
PolySwarm: A Multi-Agent Large Language Model Framework for Prediction Market Trading and Latency Arbitrage cs.AI · 2026-04-04 · unverdicted · none · ref 45 · internal anchor
PolySwarm aggregates predictions from 50 LLM personas for Polymarket trading using Bayesian combination and divergence metrics, outperforming single models in calibration while adding latency arbitrage via CEX price models.
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing cs.DC · 2026-04-03 · unverdicted · none · ref 9 · internal anchor
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis cs.AI · 2025-07-28 · unverdicted · none · ref 44 · internal anchor
GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming prior methods.
A Survey on Large Language Model based Autonomous Agents cs.AI · 2023-08-22 · accept · none · ref 23 · internal anchor
A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future directions.
Intermediate Artifacts as First-Class Citizens: A Data Model for Durable Intermediate Artifacts in Agentic Systems cs.AI · 2026-05-12 · unverdicted · none · ref 5 · internal anchor
A systems-level data model for preserving typed, addressable, versioned, and dependency-aware intermediate artifacts in agentic AI systems to improve long-term inspectability and maintainability.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability cs.AI · 2026-05-11 · unverdicted · none · ref 8 · internal anchor
A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work cs.AI · 2026-05-07 · conditional · none · ref 13 · internal anchor
Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.
From Perception to Autonomous Computational Modeling: A Multi-Agent Approach cs.CE · 2026-04-08 · unverdicted · none · ref 20 · internal anchor
A multi-agent LLM framework autonomously completes the full computational mechanics pipeline from a photograph to a code-compliant engineering report on a steel L-bracket example.
Beyond Autonomy: A Dynamic Tiered AgentRunner Framework for Governable and Resilient Enterprise AI Execution cs.AI · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
The Dynamic Tiered AgentRunner framework uses risk-adaptive tiering, separation of powers across agents, and verifier-recovery loops to enable governable and resilient enterprise AI execution.
AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System cs.RO · 2026-05-09 · unverdicted · none · ref 2 · internal anchor
AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.
Terminus-4B: Can a Smaller Model Replace Frontier LLMs at Agentic Execution Tasks? cs.AI · 2026-05-04 · unverdicted · none · ref 12 · internal anchor
A fine-tuned 4B model matches or exceeds frontier LLMs in terminal execution subagent tasks for coding agents, reducing main agent token usage by 30% with no performance loss.
From LLM Reasoning to Autonomous AI Agents: A Comprehensive Review cs.AI · 2025-04-28 · accept · none · ref 47 · internal anchor
A survey consolidating benchmarks, agent frameworks, real-world applications, and protocols for LLM-based autonomous agents into a proposed taxonomy with recommendations for future research.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems cs.LG · 2026-01-20 · unverdicted · none · ref 68 · internal anchor
A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.
A Survey of Scaling in Large Language Model Reasoning cs.AI · 2025-04-02 · unverdicted · none · ref 61 · internal anchor
A survey categorizing scaling in LLM reasoning across input size, steps, rounds, training, and future directions, noting that scaling can negatively affect performance.
A Survey on Large Language Models for Code Generation cs.CL · 2024-06-01 · unverdicted · none · ref 100 · internal anchor
A systematic literature review that organizes recent work on LLMs for code generation into a taxonomy covering data curation, model advances, evaluations, ethics, environmental impact, and applications, with benchmark comparisons.
A Survey on the Memory Mechanism of Large Language Model based Agents cs.AI · 2024-04-21 · accept · none · ref 112 · internal anchor
A systematic review of memory designs, evaluation methods, applications, limitations, and future directions for LLM-based agents.
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review cs.SE · 2026-04-10 · unverdicted · none · ref 24 · internal anchor
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprepared for deployable fair systems.
A Comprehensive Overview of Large Language Models cs.CL · 2023-07-12 · unverdicted · none · ref 229 · internal anchor
A survey paper providing an overview of Large Language Models, their background, and recent advances in the field.
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows cs.AI · 2026-05-12 · unreviewed · ref 8 · internal anchor
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt-Engineering Quality Assurance cs.SE · 2026-05-12 · unreviewed · ref 19 · internal anchor
Shepherd: Enabling Programmable Meta-Agents via Reversible Agentic Execution Traces cs.AI · 2026-05-11 · unreviewed · ref 10 · internal anchor
A Comprehensive Survey on Agent Skills: Taxonomy, Techniques, and Applications cs.IR · 2026-05-08 · unreviewed · ref 6 · 2 links · internal anchor

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer