arxiv: 2411.04468 · v1 · submitted 2024-11-07 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Adam Fourney , Gagan Bansal , Hussein Mozannar , Cheng Tan , Eduardo Salinas , Erkang (Eric) Zhu , Friederike Niedtner , Grace Proebsting

show 11 more authors

Griffin Bassman Jack Gerrits Jacob Alber Peter Chang Ricky Loynd Robert West Victor Dibia Ahmed Awadallah Ece Kamar Rafah Hosn Saleema Amershi

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:45 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords multi-agent systemsAI agentsorchestratoragentic benchmarksGAIAWebArenageneralist agents

0 comments

The pith

A multi-agent system with an orchestrator achieves competitive performance on complex AI agent benchmarks without modifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Magentic-One uses a lead Orchestrator agent to plan, track progress, and direct specialized agents for tasks like web browsing, file navigation, and code execution. This multi-agent setup delivers performance that is statistically competitive with the state of the art on three challenging benchmarks: GAIA, AssistantBench, and WebArena. The system requires no changes to the agents' core capabilities or collaboration methods to achieve these results across diverse tasks. Its modular architecture allows agents to be added or removed without prompt tuning or training, which supports easier development and adaptation to new scenarios. An open-source implementation and the AutoGenBench evaluation tool are provided to facilitate further research.

Core claim

Magentic-One is a generalist multi-agent system featuring an Orchestrator that plans and replans tasks while directing specialized agents to execute subtasks involving web browsers, local files, and Python code. It attains statistically competitive performance to existing state-of-the-art systems on the GAIA, AssistantBench, and WebArena benchmarks without any modifications to how the agents operate or interact. The design demonstrates progress toward generalist agentic systems by maintaining effectiveness across varied scenarios through modularity rather than specialization.

What carries the argument

The Orchestrator agent, which plans tasks, tracks progress, recovers from errors through replanning, and delegates to specialized agents for web operation, file navigation, and code writing and execution.

If this is right

The system integrates web, file, and code capabilities into unified task solving.
Competitive benchmark results hold without task-specific agent adjustments.
Agents can be swapped in or out to extend functionality without retraining.
Rigorous evaluation is supported by AutoGenBench with controls for repetition and isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Such systems may accelerate development of AI for new applications by allowing plug-in agents for specific domains.
Real-world deployment could benefit from the error-recovery mechanisms in long-running tasks.
Testing on additional benchmarks involving physical or multi-modal actions would reveal scalability limits.

Load-bearing premise

The modular multi-agent design with an orchestrator allows agents to be added or removed without additional prompt tuning or training while maintaining performance across tasks.

What would settle it

A demonstration that adding or removing a specialized agent requires prompt tuning or training to preserve competitive performance on the benchmarks would falsify the modularity claim.

read the original abstract

Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner -- which is important when agents' actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Magentic-One is a working open-source multi-agent system that hits competitive numbers on GAIA, AssistantBench, and WebArena with a modular orchestrator design and a useful evaluation tool.

read the letter

Magentic-One is a working open-source multi-agent system that hits competitive numbers on GAIA, AssistantBench, and WebArena with a modular orchestrator design and a useful evaluation tool. The orchestrator handles planning, progress tracking, and error recovery while calling on specialist agents for web browsing, code execution, and file operations. The paper shows this setup reaches statistically competitive performance across the three benchmarks without changing core agent capabilities or collaboration rules for each task. They also release the full implementation plus AutoGenBench, a tool that adds controls for repetition and isolation during evaluation, which addresses a real pain point in agent testing where actions can affect the environment. The modularity angle is presented clearly: agents can be added or removed without prompt retuning or retraining, and the results suggest this holds across the tested scenarios. Ablations and error analysis are mentioned, which helps ground the claims. The benchmarks themselves are solid choices because they cover realistic multi-step work rather than narrow synthetic tasks. On the soft spots, agent evaluations often show high variance from model stochasticity and environment details, so the strength of the statistical claims depends on the exact error bars, baseline implementations, and controls shown in the full experiments. The no-tuning claim when swapping agents is central to the generalist pitch, and it needs the specific ablation results to land convincingly. This paper is aimed at researchers and engineers building or benchmarking agent systems who want a concrete, extensible starting point rather than a purely theoretical advance. A reader focused on practical multi-agent work would get immediate value from the code and the evaluation harness. It deserves a serious referee because the architecture is straightforward, the implementation is public, and the evaluation approach is more rigorous than many agent papers.

Referee Report

2 major / 3 minor

Summary. The paper introduces Magentic-One, a multi-agent system with an Orchestrator that plans, tracks progress, and directs specialized agents for web browsing, file navigation, and Python code execution. It reports statistically competitive performance to the state-of-the-art on the GAIA, AssistantBench, and WebArena benchmarks without core modifications to agent capabilities or collaboration mechanisms, while emphasizing a modular design that permits adding or removing agents without additional prompt tuning or training. The work also releases an open-source implementation and AutoGenBench, a tool for controlled, isolated agentic evaluations with built-in repetition support.

Significance. If the benchmark results are statistically robust, the work advances generalist agentic systems by demonstrating a flexible, extensible multi-agent architecture that maintains performance across diverse tasks. The open-source release and AutoGenBench tool strengthen reproducibility and provide a practical evaluation framework for future agent research.

major comments (2)

[§4 (Experimental Results)] §4 (Experimental Results): The abstract and introduction claim 'statistically competitive performance' on GAIA, AssistantBench, and WebArena, yet the reported results lack explicit error bars, number of independent runs, and the precise statistical tests (e.g., paired t-tests or Wilcoxon) used for baseline comparisons. This information is load-bearing for the central performance claim and must be added with tables showing means, standard deviations, and p-values.
[§3.2 (Modular Architecture)] §3.2 (Modular Architecture): The claim that agents can be added or removed 'without additional prompt tuning or training' while preserving performance requires an explicit ablation (e.g., Table X or Figure Y) that measures end-to-end success rates before and after team modifications on at least one benchmark. Without this, the generality assertion rests on architectural description rather than empirical evidence.

minor comments (3)

[§1 (Introduction)] §1 (Introduction): Add direct citations to the original GAIA, AssistantBench, and WebArena papers when first describing each benchmark.
[Figure 1 (System Overview)] Figure 1 (System Overview): The diagram should include a legend clarifying the direction of control messages versus data flow between the Orchestrator and specialized agents.
[AutoGenBench description] AutoGenBench description: Specify the exact isolation mechanisms (e.g., containerization or sandboxing) used to prevent side-effects during repeated benchmark runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will revise the manuscript to incorporate the requested details and evidence.

read point-by-point responses

Referee: [§4 (Experimental Results)] The abstract and introduction claim 'statistically competitive performance' on GAIA, AssistantBench, and WebArena, yet the reported results lack explicit error bars, number of independent runs, and the precise statistical tests (e.g., paired t-tests or Wilcoxon) used for baseline comparisons. This information is load-bearing for the central performance claim and must be added with tables showing means, standard deviations, and p-values.

Authors: We agree that explicit statistical details are needed to fully support the performance claims. The experiments underlying our results were conducted with 3 independent runs per benchmark to capture variability. In the revised manuscript, we will update Section 4 with tables that report means and standard deviations, include error bars in figures, explicitly state the number of runs, and provide the outcomes of paired t-tests (including p-values) for comparisons against baselines. revision: yes
Referee: [§3.2 (Modular Architecture)] The claim that agents can be added or removed 'without additional prompt tuning or training' while preserving performance requires an explicit ablation (e.g., Table X or Figure Y) that measures end-to-end success rates before and after team modifications on at least one benchmark. Without this, the generality assertion rests on architectural description rather than empirical evidence.

Authors: We appreciate the suggestion to strengthen the empirical support for modularity. While the current results use the full agent team and the architecture is designed for easy addition/removal without prompt changes, we acknowledge that a dedicated ablation would provide clearer evidence. In the revision, we will add an ablation study (new table in Section 3.2) reporting GAIA success rates for the full team versus modified teams (e.g., without the CodeExecutor or WebSurfer agent), with no prompt or training adjustments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical system description and benchmark evaluation rather than any mathematical derivation chain. Performance claims rest on direct measurements against external public benchmarks (GAIA, AssistantBench, WebArena) using the provided AutoGenBench tool for controlled repetition; no equations, fitted parameters, or predictions are derived from internal data. The modular orchestrator architecture is described at the implementation level without self-definitional reductions, self-citation load-bearing premises, or ansatz smuggling. All load-bearing steps are externally falsifiable via the open-source release and benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on existing large foundation models and public benchmarks; no new mathematical axioms, free parameters fitted inside the paper, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5698 in / 997 out tokens · 19909 ms · 2026-05-16T18:45:15.680308+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LedgerCanonicality ZeroParameterComparisonLedger unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code.
HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
State-Centric Decision Process
cs.AI 2026-05 unverdicted novelty 7.0

SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
cs.AI 2026-05 unverdicted novelty 7.0

LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
cs.SE 2026-05 unverdicted novelty 7.0

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5...
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
cs.SE 2026-05 unverdicted novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...
From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company
cs.AI 2026-04 unverdicted novelty 7.0

OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).
MolmoWeb: Open Visual Web Agent and Open Data for the Open Web
cs.CV 2026-04 unverdicted novelty 7.0

Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...
Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI
cs.HC 2026-01 unverdicted novelty 7.0

Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators
cs.CL 2026-05 unverdicted novelty 6.0

AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.
Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding
cs.AI 2026-05 unverdicted novelty 6.0

LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications
cs.HC 2026-04 unverdicted novelty 6.0

BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from p...
Human-Guided Harm Recovery for Computer Use Agents
cs.AI 2026-04 conditional novelty 6.0

Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.
CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation
cs.AI 2026-04 unverdicted novelty 6.0

CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.
MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning
cs.AI 2026-04 unverdicted novelty 6.0

A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.
EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration
cs.AI 2025-12 unverdicted novelty 6.0

EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.
Don't Trust Your Upstream: Exploiting LLM Multi-Agent System via Topology-Guided Adversarial Propagation
cs.CR 2025-12 unverdicted novelty 6.0

A topology-aware attack propagates adversarial contamination across LLM multi-agent systems to achieve 40-85% success rates on frameworks and real applications, revealing overlooked vulnerabilities.
Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation
cs.AI 2026-05 unverdicted novelty 5.0

RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.
Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework
cs.CL 2025-11 unverdicted novelty 5.0

Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.
A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems
cs.AI 2025-08 unverdicted novelty 5.0

A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.
AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System
cs.RO 2026-05 unverdicted novelty 4.0

AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 20 Pith papers · 14 internal anchors

[1]

Abuelsaad, D

T. Abuelsaad, D. Akkil, P. Dey, A. Jagmohan, A. Vempaty, and R. Kokku. Agent-e: From autonomous web navigation to foundational design principles in agentic systems, 2024

work page 2024
[2]

Github — babyagi

BabyAGI. Github — babyagi. https://github.com/yoheinakajima/babyagi, 2023

work page 2023
[3]

Bonatti, D

R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y. Li, Y. Lu, J. Wagle, K. Koishida, A. Bucker, L. Jang, and Z. Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024

work page 2024
[4]

R. Cao, F. Lei, H. Wu, J. Chen, Y. Fu, H. Gao, X. Xiong, H. Zhang, Y. Mao, W. Hu, T. Xie, H. Xu, D. Zhang, S. Wang, R. Sun, P. Yin, C. Xiong, A. Ni, Q. Liu, V. Zhong, L. Chen, K. Yu, and T. Yu. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

work page 2024
[5]

Z. Chen, M. White, R. Mooney, A. Payani, Y. Su, and H. Sun. When is tree search useful for llm planning? it depends on the discriminator, 2024

work page 2024
[6]

arXiv preprint arXiv:2401.03428

Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhao, et al. Exploring large language model based intelligent agents: Definitions, meth- ods, and prospects. arXiv preprint arXiv:2401.03428 , 2024

work page arXiv 2024
[7]

D’Arcy, T

M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. Marg: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259 , 2024

work page arXiv 2024
[8]

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web, 2023

work page 2023
[9]

Dibia, A

V. Dibia, A. Fourney, G. Bansal, F. Poursabzi-Sangdeh, H. Liu, and S. Amershi. Aligning offline metrics and human judgments of value for code generation models. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 8516–8528, Toronto, Canada, July 2023. Association for Computatio...

work page 2023
[10]

Drouin, M

A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024

work page 2024
[11]

Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

B. J. Grosz and S. Kraus. The evolution of sharedplans. In Proceedings of the International Conference on Multi-Agent Systems , 1999

work page 1999
[13]

T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, and D. Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. 21

work page 2024
[15]

S. Hong, X. Zheng, J. Chen, Y. Cheng, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

N. R. Jennings and M. Wooldridge. Applications of intelligent agents. In Proceedings of the International Conference on Autonomous Agents , 1998

work page 1998
[17]

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe- bench: Can language models resolve real-world github issues?, 2024

work page 2024
[18]

Kapoor, B

S. Kapoor, B. Stroebl, Z. S. Siegel, N. Nadgir, and A. Narayanan. Ai agents that matter, 2024

work page 2024
[19]

J. Y. Koh, S. McAleer, D. Fried, and R. Salakhutdinov. Tree search for language model agents, 2024

work page 2024
[20]

Li and J

E. Li and J. Waldo. Websuite: Systematically evaluating why web agents fail. arXiv preprint arXiv:2406.01623, 2024

work page arXiv 2024
[21]

G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: Commu- nicative agents for ”mind” exploration of large scale language model society, 2023

work page 2023
[22]

Liang, Z

T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, Z. Tu, and S. Shi. Encouraging divergent thinking in large language models through multi-agent debate, 2023

work page 2023
[23]

J. Liu, Y. Song, B. Y. Lin, W. Lam, G. Neubig, Y. Li, and X. Yue. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding?, 2024

work page 2024
[24]

N. Liu, L. Chen, X. Tian, W. Zou, K. Chen, and M. Cui. From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv e-prints, pages arXiv–2401, 2024

work page 2024
[25]

Y. Liu, S. K. Lo, Q. Lu, L. Zhu, D. Zhao, X. Xu, S. Harrer, and J. Whittle. Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents. arXiv preprint arXiv:2405.10467 , 2024

work page arXiv 2024
[26]

C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

T. Masterman, S. Besen, M. Sawtell, and A. Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv preprint arXiv:2404.11584, 2024

work page internal anchor Pith review arXiv 2024
[28]

B. Messing. An introduction to multiagent systems. K¨ unstliche Intell., 17:58–, 2002

work page 2002
[29]

Mialon, C

G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants, 2023

work page 2023
[30]

GAIA: a benchmark for General AI Assistants

G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: benchmark for general ai assistants. arXiv preprint arXiv:2311.12983 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

WebGPT: Browser-assisted question-answering with human feedback

R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[32]

N. J. Nilsson. Stuart russell and peter norvig, artificial intelligence: A modern approach. Artificial Intelligence, 82:369–380, 1996. 22

work page 1996
[33]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[34]

J. Pan, Y. Zhang, N. Tomlin, Y. Zhou, S. Levine, and A. Suhr. Autonomous evaluation and refinement of digital agents, 2024

work page 2024
[35]

Y. Pan, D. Kong, S. Zhou, C. Cui, Y. Leng, B. Jiang, H. Liu, Y. Shang, S. Zhou, T. Wu, and Z. Wu. Webcanvas: Benchmarking web agents in online environments, 2024

work page 2024
[36]

ART: Automatic multi-step reasoning and tool-use for large language models

B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

work page internal anchor Pith review arXiv 2023
[37]

J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023

work page 2023
[38]

D. Paul, M. Ismayilzada, M. Peyrard, B. Borges, A. Bosselut, R. West, and B. Faltings. RE- FINER: Reasoning feedback on intermediate representations. In Y. Graham and M. Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1100–1126, St. Julian’s, Malta...

work page 2024
[39]

Putta, E

P. Putta, E. Mills, N. Garg, S. Motwani, C. Finn, D. Garg, and R. Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024

work page 2024
[40]

Y. Qin, S. Hu, Y. Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, Y. Huang, C. Xiao, C. Han, Y. R. Fung, Y. Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y. Ye, B. Li, Z. Tang, J. Yi, Y. Zhu, Z. Dai, L. Yan, X. Cong, Y. Lu, W. Zhao, Y. Huang, J. Yan, X. Han, X. Sun, D. Li, J. Phang, C. Yang, T. Wu, H. Ji, Z. Liu, and M. Sun. Tool lear...

work page 2023
[41]

Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023

work page 2023
[42]

Trase tops gaia leaderboard, 2024

Red Cell Partners. Trase tops gaia leaderboard, 2024

work page 2024
[43]

S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Trail of Bits Blog

M. Russinovich, A. Salem, and R. Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833 , 2024

work page arXiv 2024
[45]

Scerri, D

P. Scerri, D. V. Pynadath, and M. Tambe. Adjustable autonomy in real-world multi-agent environments. In International Conference on Autonomous Agents , 2001

work page 2001
[46]

Schick, J

T. Schick, J. Dwivedi-Yu, R. Dess` ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools, 2023

work page 2023
[47]

T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 2017

work page 2017
[48]

Shinn, F

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 23

work page 2024
[49]

Sodhi, S

P. Sodhi, S. R. K. Branavan, Y. Artzi, and R. McDonald. Step: Stacked llm policies for web actions, 2024

work page 2024
[50]

Y. Song, D. Yin, X. Yue, J. Huang, S. Li, and B. Y. Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024

work page 2024
[51]

Stone and M

P. Stone and M. Veloso. Multiagent systems: A survey from a machine learning perspective. Auton. Robots, 8(3):345–383, June 2000

work page 2000
[52]

Talebirad and A

Y. Talebirad and A. Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

work page 2023
[53]

M. Tambe. Implementing agent teams in dynamic multiagent environments. Appl. Artif. Intell., 12:189–210, 1998

work page 1998
[54]

L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Opendevin: An open platform for ai software developers as generalist agents, 2024

work page 2024
[56]

Y. Wang, T. Shen, L. Liu, and J. Xie. Sibyl: Simple yet effective agent framework for complex real-world reasoning, 2024

work page 2024
[57]

Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024

work page 2024
[58]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[59]

Wooldridge and N

M. Wooldridge and N. R. Jennings. Intelligent agents: theory and practice. The Knowledge Engineering Review, 10:115 – 152, 1995

work page 1995
[60]

Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. In COLM, 2024

work page 2024
[61]

Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, and L. Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024

work page 2024
[62]

Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui. The rise and potential of large language model based agents: A survey, 2023

work page 2023
[63]

C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu. Osworld: Bench- marking multimodal agents for open-ended tasks in real computer environments, 2024

work page 2024
[65]

H. Yang, S. Yue, and Y. He. Auto-gpt for online decision making: Benchmarks and additional opinions, 2023. 24

work page 2023
[66]

J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

work page 2023
[69]

S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023

work page 2023
[70]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Syner- gizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

work page 2023
[71]

Yoran, S

O. Yoran, S. J. Amouyal, C. Malaviya, B. Bogin, O. Press, and J. Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024

work page 2024
[72]

A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang. Agenttuning: Enabling generalized agent abilities for llms, 2023

work page 2023
[73]

Zhang, L

C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y. Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang. Ufo: A ui-focused agent for windows os interaction, 2024

work page 2024
[74]

Zhang, X

H. Zhang, X. Pan, H. Wang, K. Ma, W. Yu, and D. Yu. Cognitive kernel: An open-source agent system towards generalist autopilots, 2024

work page 2024
[75]

Zhang, Z

Y. Zhang, Z. Ma, Y. Ma, Z. Han, Y. Wu, and V. Tresp. Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration, 2024

work page 2024
[76]

Zhang, H

Y. Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis , pages 1592–1604, 2024

work page 2024
[77]

Zhang and A

Z. Zhang and A. Zhang. You only look at screens: Multimodal chain-of-action agents, 2024

work page 2024
[78]

Z. J. Zhang, E. Schoop, J. Nichols, A. Mahajan, and A. Swearngin. From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts. arXiv preprint arXiv:2410.09006 , 2024

work page arXiv 2024
[79]

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. 25 Appendix A Statistical Methodology In Table 1, we report the mean and an error bar for each reported method on the three bench- marks. To obtain the error bar we u...

work page 2024