The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
hub Canonical reference
Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks
Canonical reference. 88% of citing Pith papers cite this work as background.
abstract
Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner -- which is important when agents' actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one
hub tools
citation-role summary
citation-polarity summary
representative citing papers
GraphFlow uses a unified wGraph to dynamically instantiate workflows and manage KV caches for LLM agents, reporting 4.95 pp average gains and 4x memory reduction on five benchmarks.
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
The paper defines accidental meltdowns as unsafe agent behavior triggered by benign errors and reports that such meltdowns occur in 64.7% of evaluated rollouts across GPT, Grok, and Gemini agents.
SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
SkillSmith is a boundary-first compiler-runtime system that turns skill packages into minimal executable interfaces, cutting token usage 57%, thinking iterations 43%, and solve time 51% versus raw skill injection on SkillsBench.
LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5% relative improvement while processing traces in 2.66 seconds.
OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, with full release of data and code planned.
Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
AgentBound is the first declarative access control framework for Model Context Protocol servers that generates policies from source code at 80.9% accuracy and blocks most threats in malicious servers with negligible overhead.
Introduces EPC-AW to mitigate epistemic miscalibration in LLM multi-agent planning via consistency-based selection and refinement, reporting 9.75% average success improvement.
AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from paper descriptions.
CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.
EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
A topology-aware attack propagates adversarial contamination across LLM multi-agent systems to achieve 40-85% success rates on frameworks and real applications, revealing overlooked vulnerabilities.
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
A multi-agent LLM framework with schema enrichment and business rules achieves 78.1% semantic accuracy on the BIRD NL2SQL benchmark.
RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.