The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
hub Canonical reference
Why Do Multi-Agent LLM Systems Fail?
Canonical reference. 92% of citing Pith papers cite this work as background.
abstract
Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.
hub tools
citation-role summary
citation-polarity summary
claims ledger
- abstract Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of fail
co-cited works
roles
background 23representative citing papers
Six multi-agent architectures for LLM code generation on HumanEval form two complexity clusters separated by 50-130%, with no accuracy advantage for the complex cluster.
AgingBench demonstrates multi-dimensional degradation in deployed AI agents through four aging mechanisms diagnosed by temporal graphs and counterfactual probes across hundreds of runs.
Proposes EpG and OOI metrics showing agentic workflows use 4.33x more energy per successful goal than linear baselines due to orchestration structure.
Introduces the stochastic-deterministic boundary (SDB) as a load-bearing primitive for LLM agent runtimes and provides a five-step methodology plus catalog of six patterns adapted from distributed systems.
Evolutionary coding agents achieve most benchmark gains through a small subset of edit types and by cycling previously deleted code lines rather than developing new algorithmic structures.
AstraFlow decouples RL components into autonomous dataflow services to natively support multi-policy agentic LLM training, elastic scaling, and cross-region execution with 2.7x speedup on math, code, search, and AgentBench workloads.
A survey that unifies prior work on multi-agent LLM systems via the LIFE framework, mapping dependencies across collaboration, failure attribution, and autonomous self-evolution while identifying cross-stage challenges.
A span-decomposed evaluation framework for AI agents achieves state-of-the-art results on GAIA and SWE-Bench with up to 3.5x gains in localization accuracy by breaking traces into independent per-span judgments.
AgentLens reveals 10.7% of passing SWE-agent trajectories exhibit Lucky Pass behaviors and introduces a process-level evaluation framework with a new annotated dataset of 1,815 trajectories.
AgentForesight introduces an online auditor model that predicts decisive errors in multi-agent trajectories at the earliest step using a coarse-to-fine reinforcement learning recipe on a new curated dataset AFTraj-2K.
TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
TraceToChain models LLM agent traces as absorbing DTMCs using automatic clustering and smoothed MLE, with KS and AIC validation, to reconcile pass@k, pass^k, and RDC as projections of a single first-passage success-time distribution.
AgentEval evaluates agentic workflows via DAGs with step metrics, a 21-category failure taxonomy, and error propagation tracking, yielding 2.17x higher failure recall than end-to-end methods and strong human agreement.
HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, scheduling, and debate.
GraphBit is a DAG-based engine-orchestrated framework for agentic LLMs that achieves 67.6% accuracy with zero hallucinations on GAIA benchmarks.
FORGE enforces security policies in agentic systems via Datalog over abstract predicates with an observability service and reference monitor that guarantees policy semantics when the environment contract holds.
A large-scale empirical study categorizes bugs in LLM agents and demonstrates that a specialized LLM agent can annotate them accurately at very low cost.
Anonymization in multi-agent debate reduces identity bias by equalizing self and peer weights in a Bayesian update model, quantified by the Identity Bias Coefficient.
Study of 930k+ agent PRs shows repository explains ~50% of integration friction variance, with agents concentrating it twice as much as humans (ICC 0.30 vs 0.16) after controls.
StepFinder turns execution logs into temporal semantic sequences via LLMs then uses temporal modeling plus attention to attribute failures to specific steps more accurately and 79% faster than direct LLM methods on the Who&When benchmark.
citing papers explorer
-
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.
-
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
AgentEval evaluates agentic workflows via DAGs with step metrics, a 21-category failure taxonomy, and error propagation tracking, yielding 2.17x higher failure recall than end-to-end methods and strong human agreement.
-
How to Interpret Agent Behavior
ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
-
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
-
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
-
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity
A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.