{"work":{"id":"b186294a-cda7-4df0-9a28-27d379af92b2","openalex_id":null,"doi":null,"arxiv_id":"2503.13657","raw_key":null,"title":"Why Do Multi-Agent LLM Systems Fail?","authors":null,"authors_text":"Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari","year":2025,"venue":"cs.AI","abstract":"Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.","external_url":"https://arxiv.org/abs/2503.13657","cited_by_count":null,"metadata_source":"pith","metadata_fetched_at":"2026-05-25T05:55:24.710775+00:00","pith_arxiv_id":"2503.13657","created_at":"2026-05-09T22:34:07.732430+00:00","updated_at":"2026-06-05T21:23:00.469572+00:00","title_quality_ok":true,"display_title":"Why Do Multi-Agent LLM Systems Fail?","render_title":"Why Do Multi-Agent LLM Systems Fail?"},"hub":{"state":{"work_id":"b186294a-cda7-4df0-9a28-27d379af92b2","tier":"hub","tier_reason":"10+ Pith inbound or 1,000+ external citations","pith_inbound_count":68,"external_cited_by_count":null,"distinct_field_count":10,"first_pith_cited_at":"2025-07-28T17:55:08+00:00","last_pith_cited_at":"2026-05-22T09:24:12+00:00","author_build_status":"not_needed","summary_status":"needed","contexts_status":"needed","graph_status":"needed","ask_index_status":"not_needed","reader_status":"not_needed","recognition_status":"not_needed","updated_at":"2026-06-11T09:07:37.797449+00:00","tier_text":"hub"},"tier":"hub","role_counts":[{"context_role":"background","n":25}],"polarity_counts":[{"context_polarity":"background","n":23},{"context_polarity":"support","n":2}],"runs":{"context_extract":{"job_type":"context_extract","status":"succeeded","result":{"enqueued_papers":25},"error":null,"updated_at":"2026-05-14T18:29:32.123334+00:00"},"graph_features":{"job_type":"graph_features","status":"succeeded","result":{"co_cited":[{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":8},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":7},{"title":"Towards a Science of Scaling Agent Systems","work_id":"df1c366f-f1cf-469a-b862-5d59229b6b8f","shared_citers":6},{"title":"AI agents that matter","work_id":"07877e57-5393-47ee-ae5f-563ff8f9a6b2","shared_citers":4},{"title":"ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate","work_id":"eac74d79-d8d1-49dd-8565-53d713a84fff","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Large Language Model based Multi-Agents: A Survey of Progress and Challenges","work_id":"fb905249-ea5f-4765-80f0-2428ea66f15f","shared_citers":4},{"title":"MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework","work_id":"891b9780-a800-4e3c-bba0-53597ab8dc98","shared_citers":4},{"title":"OpenHands: An Open Platform for AI Software Developers as Generalist Agents","work_id":"f1762ea0-e382-4f38-a28c-adc643789859","shared_citers":4},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":4},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":4},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":3},{"title":"Autogen: Enabling next-gen llm applications via multi-agent conversations","work_id":"e57ce12a-7d16-4d21-a253-28bdb8094e1a","shared_citers":3},{"title":"Camel: Communicative agents for\" mind\" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008","work_id":"f0e6f682-8c56-41b2-8626-859078781db0","shared_citers":3},{"title":"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines","work_id":"d490f594-f5fc-47b0-ae6a-6550e50fe095","shared_citers":3},{"title":"Language agents as optimizable graphs","work_id":"7f01c57e-b214-4a9e-8c4a-eb17a7acc5b7","shared_citers":3},{"title":"Language agent tree search unifies reasoning acting and planning in language models","work_id":"810dbb8b-e954-4797-bf0d-5d8cad94e524","shared_citers":3},{"title":"Metagpt: Meta programming for a multi-agent collaborative framework","work_id":"c406797c-06b5-46e2-b568-86ecd25692f1","shared_citers":3},{"title":"Multi-Agent Collaboration Mechanisms: A Survey of LLMs","work_id":"34627f75-81fe-465e-bca5-d2a93270aa1d","shared_citers":3},{"title":"Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking","work_id":"d4c2a48d-ad9b-4f3b-91df-81f61d80ff64","shared_citers":3},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":3},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":3},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":3},{"title":"SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering","work_id":"01826cd9-a652-403c-a2ec-531da9fe2b6a","shared_citers":3}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"error":null,"updated_at":"2026-05-14T18:29:45.676799+00:00"},"identity_refresh":{"job_type":"identity_refresh","status":"succeeded","result":{"items":[{"title":"Qwen3 Technical Report","outcome":"unchanged","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","resolver":"local_arxiv","confidence":0.98,"old_work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e"}],"counts":{"fixed":0,"merged":0,"unchanged":1,"quarantined":0,"needs_external_resolution":0},"errors":[],"attempted":1},"error":null,"updated_at":"2026-05-14T18:29:40.966414+00:00"},"summary_claims":{"job_type":"summary_claims","status":"succeeded","result":{"title":"Why Do Multi-Agent LLM Systems Fail?","claims":[{"claim_text":"Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of fail","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Why Do Multi-Agent LLM Systems Fail? because it crossed a citation-hub threshold.","role_counts":[]},"error":null,"updated_at":"2026-05-14T18:30:15.148628+00:00"}},"summary":{"title":"Why Do Multi-Agent LLM Systems Fail?","claims":[{"claim_text":"Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of fail","claim_type":"abstract","evidence_strength":"source_metadata"}],"why_cited":"Pith tracks Why Do Multi-Agent LLM Systems Fail? because it crossed a citation-hub threshold.","role_counts":[]},"graph":{"co_cited":[{"title":"AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation","work_id":"92b7eb9c-c3d8-4518-a376-06fa15dd895b","shared_citers":8},{"title":"SWE-bench: Can Language Models Resolve Real-World GitHub Issues?","work_id":"d0effe15-a689-441a-8e3f-ea35f1c4e4b1","shared_citers":7},{"title":"Towards a Science of Scaling Agent Systems","work_id":"df1c366f-f1cf-469a-b862-5d59229b6b8f","shared_citers":6},{"title":"AI agents that matter","work_id":"07877e57-5393-47ee-ae5f-563ff8f9a6b2","shared_citers":4},{"title":"ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate","work_id":"eac74d79-d8d1-49dd-8565-53d713a84fff","shared_citers":4},{"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","shared_citers":4},{"title":"Large Language Model based Multi-Agents: A Survey of Progress and Challenges","work_id":"fb905249-ea5f-4765-80f0-2428ea66f15f","shared_citers":4},{"title":"MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework","work_id":"891b9780-a800-4e3c-bba0-53597ab8dc98","shared_citers":4},{"title":"OpenHands: An Open Platform for AI Software Developers as Generalist Agents","work_id":"f1762ea0-e382-4f38-a28c-adc643789859","shared_citers":4},{"title":"Voyager: An Open-Ended Embodied Agent with Large Language Models","work_id":"ffe0d207-86cf-4742-a100-e988ac8b9676","shared_citers":4},{"title":"WebArena: A Realistic Web Environment for Building Autonomous Agents","work_id":"7058ffd2-a339-4102-89eb-248eeb074652","shared_citers":4},{"title":"AgentBench: Evaluating LLMs as Agents","work_id":"a37549b4-4c94-412d-acc4-4efeb08509be","shared_citers":3},{"title":"Autogen: Enabling next-gen llm applications via multi-agent conversations","work_id":"e57ce12a-7d16-4d21-a253-28bdb8094e1a","shared_citers":3},{"title":"Camel: Communicative agents for\" mind\" exploration of large language model society.Advances in neural information processing systems, 36:51991–52008","work_id":"f0e6f682-8c56-41b2-8626-859078781db0","shared_citers":3},{"title":"DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines","work_id":"d490f594-f5fc-47b0-ae6a-6550e50fe095","shared_citers":3},{"title":"Language agents as optimizable graphs","work_id":"7f01c57e-b214-4a9e-8c4a-eb17a7acc5b7","shared_citers":3},{"title":"Language agent tree search unifies reasoning acting and planning in language models","work_id":"810dbb8b-e954-4797-bf0d-5d8cad94e524","shared_citers":3},{"title":"Metagpt: Meta programming for a multi-agent collaborative framework","work_id":"c406797c-06b5-46e2-b568-86ecd25692f1","shared_citers":3},{"title":"Multi-Agent Collaboration Mechanisms: A Survey of LLMs","work_id":"34627f75-81fe-465e-bca5-d2a93270aa1d","shared_citers":3},{"title":"Pro2guard: Proactive runtime enforcement of llm agent safety via probabilistic model checking","work_id":"d4c2a48d-ad9b-4f3b-91df-81f61d80ff64","shared_citers":3},{"title":"Qwen3 Technical Report","work_id":"25a4e30c-1232-48e7-9925-02fa12ba7c9e","shared_citers":3},{"title":"Reflexion: Language Agents with Verbal Reinforcement Learning","work_id":"778f739e-5f55-4961-8a2a-e4736a2757f4","shared_citers":3},{"title":"Self-Consistency Improves Chain of Thought Reasoning in Language Models","work_id":"8c6d5a6b-b5cc-4105-9c84-9c34bb9375bb","shared_citers":3},{"title":"SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering","work_id":"01826cd9-a652-403c-a2ec-531da9fe2b6a","shared_citers":3}],"time_series":[{"n":33,"year":2026}],"dependency_candidates":[]},"authors":[]}}