The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
hub Canonical reference
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
Canonical reference. 96% of citing Pith papers cite this work as background.
abstract
Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-agent systems based on LLMs, as well as the challenges. Our goal is for readers to gain substantial insights on the following questions: What domains and environments do LLM-based multi-agents simulate? How are these agents profiled and how do they communicate? What mechanisms contribute to the growth of agents' capacities? For those interested in delving into this field of study, we also summarize the commonly used datasets or benchmarks for them to have convenient access. To keep researchers updated on the latest studies, we maintain an open-source GitHub repository, dedicated to outlining the research on LLM-based multi-agent systems.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Large Language Models (LLMs) have achieved remarkable success across a wide array of tasks. Due to the impressive planning and reasoning abilities of LLMs, they have been used as autonomous agents to do many tasks automatically. Recently, based on the development of using one LLM as a single planning or decision-making agent, LLM-based multi-agent systems have achieved considerable progress in complex problem-solving and world simulation. To provide the community with an overview of this dynamic field, we present this survey to offer an in-depth discussion on the essential aspects of multi-age
co-cited works
representative citing papers
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.
Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
WaterAdmin uses a bi-level design with LLM agents for dynamic context abstraction and optimization for real-time pump/valve control, achieving better pressure reliability and lower energy use than traditional methods in EPANET simulations of variable community water demands.
The first systematization of blockchain-based agent-to-agent payments organizes designs into discovery, authorization, execution, and accounting stages while identifying trust and security gaps.
NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.
Agentic Hives apply dynamic general equilibrium theory to variable populations of language-model agents, proving existence of equilibria, Pareto optimality, multiplicity, comparative-statics analogs, Hopf bifurcations, and stability conditions.
GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves
Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
MALLM-GAN uses multi-agent LLMs to emulate GAN architecture for generating higher-quality synthetic tabular data from small samples than prior models, while preserving privacy.
TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.
LCGuard applies adversarial training to transform KV cache artifacts in multi-agent LLMs, reducing reconstructable sensitive information while preserving task performance.
LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.
BLAgent achieves over 78% Top-1 accuracy on SWE-bench Lite for file-level bug localization using agentic RAG, at 18x lower cost than baselines, and boosts end-to-end APR success by over 20%.
citing papers explorer
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
-
\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer
MasFACT transfers historical topology priors across tasks via Fused Gromov-Wasserstein optimal transport and PAC-Bayes conservative adaptation to reduce topology forgetting in continual multi-agent settings.
-
Good Agentic Friends Do Not Just Give Verbal Advice: They Can Update Your Weights
TFlow enables multi-agent LLMs to collaborate via transient low-rank LoRA perturbations derived from sender activations, yielding up to 8.5 accuracy gains and 83% token reduction versus text-based baselines on Qwen3-4B models.
-
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
Successor-representation spectra of row-stochastic communication operators predict perturbation robustness, consensus speed, and error accumulation in multi-agent LLM topologies, with condition number showing perfect empirical rank correlation.
-
Evolving-RL: End-to-End Optimization of Experience-Driven Self-Evolving Capability within Agents
Evolving-RL jointly optimizes experience extraction and utilization in LLM agents via RL with separate evaluation signals, delivering up to 98.7% relative gains on out-of-distribution tasks in ALFWorld and Mind2Web.
-
Collective Alignment in LLM Multi-Agent Systems: Disentangling Bias from Cooperation via Statistical Physics
LLM multi-agent systems on lattices show bias-driven order-disorder crossovers instead of true phase transitions, with extracted effective couplings and fields serving as model-specific fingerprints.
-
TacoMAS: Test-Time Co-Evolution of Topology and Capability in LLM-based Multi-Agent Systems
TacoMAS performs test-time co-evolution of agent capabilities and communication topology in LLM multi-agent systems via fast capability updates and slow meta-LLM topology edits, delivering 13.3% average gains over strong baselines on four benchmarks.
-
TADI: Tool-Augmented Drilling Intelligence via Agentic LLM Orchestration over Heterogeneous Wellsite Data
TADI shows that domain-specialized tools orchestrated by an LLM over dual structured and semantic databases can convert heterogeneous wellsite data into evidence-grounded drilling intelligence, with tool design mattering more than model scale.
-
Exploring Agentic Visual Analytics: A Co-Evolutionary Framework of Roles and Workflows
A survey of 55 agentic VA systems proposes a co-evolutionary framework defining four agent roles (PLANNER, CREATOR, REVIEWER, CONTEXT MANAGER) mapped to visual analytics pipeline stages along with design guidelines.
-
WaterAdmin: Orchestrating Community Water Distribution Optimization via AI Agents
WaterAdmin uses a bi-level design with LLM agents for dynamic context abstraction and optimization for real-time pump/valve control, achieving better pressure reliability and lower energy use than traditional methods in EPANET simulations of variable community water demands.
-
SoK: Blockchain Agent-to-Agent Payments
The first systematization of blockchain-based agent-to-agent payments organizes designs into discovery, authorization, execution, and accounting stages while identifying trust and security gaps.
-
Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
-
GraphBit: A Graph-based Agentic Framework for Non-Linear Agent Orchestration
GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.
-
Agentic Hives: Equilibrium, Indeterminacy, and Endogenous Cycles in Self-Organizing Multi-Agent Systems
Agentic Hives apply dynamic general equilibrium theory to variable populations of language-model agents, proving existence of equilibria, Pareto optimality, multiplicity, comparative-statics analogs, Hopf bifurcations, and stability conditions.
-
GenCellAgent: Generalizable, Training-Free Cellular Image Segmentation via Large Language Model Agents
GenCellAgent deploys a planner-executor-evaluator LLM agent loop to automatically select, adapt, and refine segmentation tools for diverse cellular microscopy images, matching or exceeding specialist performance on 4,718 images across seven benchmarks while handling out-of-distribution and novel-ves
-
An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications
Empirical study of open-source AI agents shows testing effort concentrates on deterministic tools and workflows (over 70%) while the FM-based plan body gets under 5% and prompts appear in only 1% of tests.
-
From Standalone LLMs to Integrated Intelligence: A Survey of Compound Al Systems
A survey that defines Compound AI Systems, proposes a multi-dimensional taxonomy based on component roles and orchestration strategies, reviews four foundational paradigms, and identifies key challenges for future research.
-
Automated Design of Agentic Systems
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
-
MALLM-GAN: Multi-Agent Large Language Model as Generative Adversarial Network for Synthesizing Tabular Data
MALLM-GAN uses multi-agent LLMs to emulate GAN architecture for generating higher-quality synthetic tabular data from small samples than prior models, while preserving privacy.
-
Self-Refining Topology Optimization via an LLM-Based Multi-Agent Framework
TopOptAgents deploys six LLM agents in self-refining loops to automate the full topology optimization workflow and succeeds on problem classes where single LLMs fail.
-
LCGuard: Latent Communication Guard for Safe KV Sharing in Multi-Agent Systems
LCGuard applies adversarial training to transform KV cache artifacts in multi-agent LLMs, reducing reconstructable sensitive information while preserving task performance.
-
LACO: Adaptive Latent Communication for Collaborative Driving
LACO introduces Iterative Latent Deliberation, Cross-Horizon Saliency Attribution, and Structured Semantic Knowledge Distillation to enable low-latency latent communication in collaborative driving while preserving performance in CARLA simulations.
-
BLAgent: Agentic RAG for File-Level Bug Localization
BLAgent achieves over 78% Top-1 accuracy on SWE-bench Lite for file-level bug localization using agentic RAG, at 18x lower cost than baselines, and boosts end-to-end APR success by over 20%.
-
LEMON: Learning Executable Multi-Agent Orchestration via Counterfactual Reinforcement Learning
LEMON trains an LLM orchestrator with counterfactual-augmented GRPO to produce deployable multi-agent specifications that reach state-of-the-art results on six reasoning and coding benchmarks.
-
CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution
CANTANTE uses contrastive rollouts to attribute system rewards to individual agents, enabling better prompt optimization than prior methods on programming, math, and QA benchmarks.
-
CHAL: Council of Hierarchical Agentic Language
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
-
Deterministic vs. LLM-Controlled Orchestration for COBOL-to-Python Modernization
Deterministic orchestration matches LLM-controlled methods in COBOL-to-Python translation accuracy but improves worst-case robustness, reduces run-to-run variability, and cuts token consumption by up to 3.5 times.
-
Why Does Agentic Safety Fail to Generalize Across Tasks?
Agentic safety fails to generalize across tasks because the task-to-safe-controller mapping has a higher Lipschitz constant than the task-to-controller mapping alone, as proven in linear-quadratic control and demonstrated in quadcopter and LLM experiments.
-
When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems
CAFE finds positive distributional Jensen Gaps across five multi-agent LLM architectures under semantic stress, showing that quality drops can coexist with detectable stress geometry compatible with antifragile learning.
-
Multi-Agent Empowerment and Emergence of Complex Behavior in Groups
A multi-agent extension of empowerment produces emergent group organizations in tendon-coupled agent pairs and controllable Vicsek flocks.
-
Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design
Mol-Debate applies multi-agent debate in an iterative loop with perspective orchestration to achieve state-of-the-art text-guided molecular design, scoring 59.82% exact match on ChEBI-20 and 50.52% weighted success on S2-Bench.
-
ThreadSumm: Summarization of Nested Discourse Threads Using Tree of Thoughts
ThreadSumm improves structured summarization of nested discourse threads by combining LLM-based aspect and content unit extraction with sentence ordering and Tree of Thoughts search for better coherence and opinion coverage.
-
GraphDC: A Divide-and-Conquer Multi-Agent System for Scalable Graph Algorithm Reasoning
GraphDC applies divide-and-conquer multi-agent LLM reasoning to graph algorithms by decomposing graphs into subgraphs for local agents and integrating via a master agent, outperforming direct methods especially on large scales.
-
Agentic Frameworks for Reasoning Tasks: An Empirical Study
An empirical evaluation of 22 agentic frameworks on BBH, GSM8K, and ARC benchmarks shows stable performance in 12 frameworks but highlights orchestration failures and weaker mathematical reasoning.
-
AgentClick: A Skill-Based Human-in-the-Loop Review Layer for Terminal AI Agents
AgentClick is a localhost npm server and skill-based plugin that connects terminal AI agents to a structured web UI for human review of plans, code execution, memory, and errors.
-
Numerical Instability and Chaos: Quantifying the Unpredictability of Large Language Models
Large language models display three universal scale-dependent regimes of behavior—stable, chaotic, and signal-dominated—driven by floating-point rounding errors that produce an avalanche effect in early layers.
-
In-situ process monitoring for defect detection in wire-arc additive manufacturing: an agentic AI approach
A multi-agent AI framework using processing and acoustic agents achieves 91.6% accuracy and 0.821 F1 score for in-situ porosity defect detection in wire-arc additive manufacturing.
-
Heterogeneous Consensus-Progressive Reasoning for Efficient Multi-Agent Debate
HCP-MAD reduces token costs in multi-agent debates by using heterogeneous consensus verification, adaptive pair-agent stopping, and escalated collective voting based on task complexity signals.
-
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
-
From Spark to Fire: Modeling and Mitigating Error Cascades in LLM-Based Multi-Agent Collaboration
A graph-based propagation model for error cascades in LLM multi-agent systems plus a genealogy-graph governance plugin that prevents final infection in at least 89% of runs across tested frameworks.
-
SoK: Agentic Skills -- Beyond Tool Use in LLM Agents
The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.
-
When Identity Overrides Incentives: Representational Choices as Governance Decisions in Multi-Agent LLM Systems
Role-based personas in multi-agent LLM systems suppress payoff-aligned behavior, shifting equilibrium selection by up to 90 percentage points in Tragedy of the Commons versus Green Transition scenarios even with full payoff information.
-
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
BlindGuard introduces an unsupervised hierarchical agent encoder plus corruption-guided contrastive detector that identifies malicious agents in LLM-based multi-agent systems without any attack labels or prior knowledge of malicious behaviors.
-
EXPEREPAIR: Dual-Memory Enhanced LLM-based Repository-Level Program Repair
ExpeRepair improves LLM-based repository-level program repair by maintaining episodic memory of concrete fixes and semantic memory of abstract insights, reaching 60.3% and 74.6% pass@1 on SWE-Bench Lite and Verified.
-
CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening
Presents CultivAgents, a relationship-centered multi-agent system for socio-culturally grounded gardening support, with a mixed-methods evaluation showing modest gains in gardener confidence and motivation.
-
Evaluating Temporal Semantic Caching and Workflow Optimization in Agentic Plan-Execute Pipelines
Temporal semantic caching and MCP workflow optimizations deliver 30.6x median speedup on cache hits and 1.67x overall speedup with 40% latency reduction on the AssetOpsBench industrial agent benchmark.
-
Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents
LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.
-
AblateCell: A Reproduce-then-Ablate Agent for Virtual Cell Repositories
AblateCell reproduces baselines in three single-cell perturbation repositories with 88.9% success and recovers ground-truth critical components with 93.3% accuracy via closed-loop ablation.
-
WebMAC: A Multi-Agent Collaborative Framework for Scenario Testing of Web Systems
WebMAC uses three specialized multi-agent modules to clarify test scenarios, partition them for adequacy, and generate executable scripts, yielding 30-60% higher success rates and 29% better efficiency than SOTA on four web systems.