The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
hub Canonical reference
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner -- which is important when agents' actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.
Introduces EPC-AW to mitigate epistemic miscalibration in LLM multi-agent planning via consistency-based selection and refinement, reporting 9.75% average success improvement.
AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from paper descriptions.
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
A topology-aware attack propagates adversarial contamination across LLM multi-agent systems to achieve 40-85% success rates on frameworks and real applications, revealing overlooked vulnerabilities.
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.
citing papers explorer
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
GraphFlow: A Graph-Based Workflow Management for Efficient LLM-Agent Serving
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
-
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
-
Agent Meltdowns: The Road to Hell Is Paved with Helpful Agents
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
-
State-Centric Decision Process
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
-
SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces
SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.
-
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
-
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
-
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
-
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
-
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
-
AgentBound: Securing Execution Boundaries of AI Agents
AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.
-
When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems
Introduces EPC-AW to mitigate epistemic miscalibration in LLM multi-agent planning via consistency-based selection and refinement, reporting 9.75% average success improvement.
-
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
-
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
-
BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications
BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from paper descriptions.
-
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
-
MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning
A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.
-
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
-
Don't Trust Your Upstream: Exploiting LLM Multi-Agent System via Topology-Guided Adversarial Propagation
A topology-aware attack propagates adversarial contamination across LLM multi-agent systems to achieve 40-85% success rates on frameworks and real applications, revealing overlooked vulnerabilities.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
-
AgentNLQ: A General-Purpose Agent for Natural Language to SQL
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
-
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
-
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.
-
MARS: Multi-Agent Robotic System with Multimodal Large Language Models for Assistive Intelligence
MARS introduces a four-agent MLLM system for risk-aware planning and personalized assistance in home robotics, claiming superior performance over state-of-the-art multimodal models on multiple datasets.
-
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
-
AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System
AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.
- Human-Guided Harm Recovery for Computer Use Agents