The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Llm-coordination: evaluating and analyzing multi-agent coordination abilities in large language models
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost-quality Pareto frontier.
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
Strategic agents can achieve high-harm outcomes via low-capacity channels by concentrating residual capacity on high-impact predicates of confidential data, so leakage bounds need not bound worst-case harm.
citing papers explorer
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost-quality Pareto frontier.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new streaming benchmark and evaluation framework for self-evolving memory in LLM agents, unifying over ten memory modules and introducing the ReMem pipeline for continual improvement on multi-turn and reasoning datasets.
-
VS-Bench: Evaluating VLMs for Strategic Abilities in Multi-Agent Environments
VS-Bench is a new benchmark of ten visual multi-agent environments that measures VLMs on element recognition, next-action prediction, and normalized episode return, showing strong perception but large gaps in reasoning and decision-making with the best model at 46.6% prediction accuracy and 31.4% of
-
A Note on the Strategic Confinement Problem
Strategic agents can achieve high-harm outcomes via low-capacity channels by concentrating residual capacity on high-impact predicates of confidential data, so leakage bounds need not bound worst-case harm.