pith. machine review for the scientific record. sign in

arxiv: 2411.04468 · v1 · submitted 2024-11-07 · 💻 cs.AI · cs.MA

Recognition: 2 theorem links

Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-16 18:45 UTC · model grok-4.3

classification 💻 cs.AI cs.MA
keywords multi-agent systemsAI agentsorchestratoragentic benchmarksGAIAWebArenageneralist agents
0
0 comments X

The pith

A multi-agent system with an orchestrator achieves competitive performance on complex AI agent benchmarks without modifications.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Magentic-One uses a lead Orchestrator agent to plan, track progress, and direct specialized agents for tasks like web browsing, file navigation, and code execution. This multi-agent setup delivers performance that is statistically competitive with the state of the art on three challenging benchmarks: GAIA, AssistantBench, and WebArena. The system requires no changes to the agents' core capabilities or collaboration methods to achieve these results across diverse tasks. Its modular architecture allows agents to be added or removed without prompt tuning or training, which supports easier development and adaptation to new scenarios. An open-source implementation and the AutoGenBench evaluation tool are provided to facilitate further research.

Core claim

Magentic-One is a generalist multi-agent system featuring an Orchestrator that plans and replans tasks while directing specialized agents to execute subtasks involving web browsers, local files, and Python code. It attains statistically competitive performance to existing state-of-the-art systems on the GAIA, AssistantBench, and WebArena benchmarks without any modifications to how the agents operate or interact. The design demonstrates progress toward generalist agentic systems by maintaining effectiveness across varied scenarios through modularity rather than specialization.

What carries the argument

The Orchestrator agent, which plans tasks, tracks progress, recovers from errors through replanning, and delegates to specialized agents for web operation, file navigation, and code writing and execution.

If this is right

  • The system integrates web, file, and code capabilities into unified task solving.
  • Competitive benchmark results hold without task-specific agent adjustments.
  • Agents can be swapped in or out to extend functionality without retraining.
  • Rigorous evaluation is supported by AutoGenBench with controls for repetition and isolation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such systems may accelerate development of AI for new applications by allowing plug-in agents for specific domains.
  • Real-world deployment could benefit from the error-recovery mechanisms in long-running tasks.
  • Testing on additional benchmarks involving physical or multi-modal actions would reveal scalability limits.

Load-bearing premise

The modular multi-agent design with an orchestrator allows agents to be added or removed without additional prompt tuning or training while maintaining performance across tasks.

What would settle it

A demonstration that adding or removing a specialized agent requires prompt tuning or training to preserve competitive performance on the benchmarks would falsify the modularity claim.

read the original abstract

Modern AI agents, driven by advances in large foundation models, promise to enhance our productivity and transform our lives by augmenting our knowledge and capabilities. To achieve this vision, AI agents must effectively plan, perform multi-step reasoning and actions, respond to novel observations, and recover from errors, to successfully complete complex tasks across a wide range of scenarios. In this work, we introduce Magentic-One, a high-performing open-source agentic system for solving such tasks. Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code. We show that Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena. Magentic-One achieves these results without modification to core agent capabilities or to how they collaborate, demonstrating progress towards generalist agentic systems. Moreover, Magentic-One's modular design allows agents to be added or removed from the team without additional prompt tuning or training, easing development and making it extensible to future scenarios. We provide an open-source implementation of Magentic-One, and we include AutoGenBench, a standalone tool for agentic evaluation. AutoGenBench provides built-in controls for repetition and isolation to run agentic benchmarks in a rigorous and contained manner -- which is important when agents' actions have side-effects. Magentic-One, AutoGenBench and detailed empirical performance evaluations of Magentic-One, including ablations and error analysis are available at https://aka.ms/magentic-one

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces Magentic-One, a multi-agent system with an Orchestrator that plans, tracks progress, and directs specialized agents for web browsing, file navigation, and Python code execution. It reports statistically competitive performance to the state-of-the-art on the GAIA, AssistantBench, and WebArena benchmarks without core modifications to agent capabilities or collaboration mechanisms, while emphasizing a modular design that permits adding or removing agents without additional prompt tuning or training. The work also releases an open-source implementation and AutoGenBench, a tool for controlled, isolated agentic evaluations with built-in repetition support.

Significance. If the benchmark results are statistically robust, the work advances generalist agentic systems by demonstrating a flexible, extensible multi-agent architecture that maintains performance across diverse tasks. The open-source release and AutoGenBench tool strengthen reproducibility and provide a practical evaluation framework for future agent research.

major comments (2)
  1. [§4 (Experimental Results)] §4 (Experimental Results): The abstract and introduction claim 'statistically competitive performance' on GAIA, AssistantBench, and WebArena, yet the reported results lack explicit error bars, number of independent runs, and the precise statistical tests (e.g., paired t-tests or Wilcoxon) used for baseline comparisons. This information is load-bearing for the central performance claim and must be added with tables showing means, standard deviations, and p-values.
  2. [§3.2 (Modular Architecture)] §3.2 (Modular Architecture): The claim that agents can be added or removed 'without additional prompt tuning or training' while preserving performance requires an explicit ablation (e.g., Table X or Figure Y) that measures end-to-end success rates before and after team modifications on at least one benchmark. Without this, the generality assertion rests on architectural description rather than empirical evidence.
minor comments (3)
  1. [§1 (Introduction)] §1 (Introduction): Add direct citations to the original GAIA, AssistantBench, and WebArena papers when first describing each benchmark.
  2. [Figure 1 (System Overview)] Figure 1 (System Overview): The diagram should include a legend clarifying the direction of control messages versus data flow between the Orchestrator and specialized agents.
  3. [AutoGenBench description] AutoGenBench description: Specify the exact isolation mechanisms (e.g., containerization or sandboxing) used to prevent side-effects during repeated benchmark runs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address each major comment below and will revise the manuscript to incorporate the requested details and evidence.

read point-by-point responses
  1. Referee: [§4 (Experimental Results)] The abstract and introduction claim 'statistically competitive performance' on GAIA, AssistantBench, and WebArena, yet the reported results lack explicit error bars, number of independent runs, and the precise statistical tests (e.g., paired t-tests or Wilcoxon) used for baseline comparisons. This information is load-bearing for the central performance claim and must be added with tables showing means, standard deviations, and p-values.

    Authors: We agree that explicit statistical details are needed to fully support the performance claims. The experiments underlying our results were conducted with 3 independent runs per benchmark to capture variability. In the revised manuscript, we will update Section 4 with tables that report means and standard deviations, include error bars in figures, explicitly state the number of runs, and provide the outcomes of paired t-tests (including p-values) for comparisons against baselines. revision: yes

  2. Referee: [§3.2 (Modular Architecture)] The claim that agents can be added or removed 'without additional prompt tuning or training' while preserving performance requires an explicit ablation (e.g., Table X or Figure Y) that measures end-to-end success rates before and after team modifications on at least one benchmark. Without this, the generality assertion rests on architectural description rather than empirical evidence.

    Authors: We appreciate the suggestion to strengthen the empirical support for modularity. While the current results use the full agent team and the architecture is designed for easy addition/removal without prompt changes, we acknowledge that a dedicated ablation would provide clearer evidence. In the revision, we will add an ablation study (new table in Section 3.2) reporting GAIA success rates for the full team versus modified teams (e.g., without the CodeExecutor or WebSurfer agent), with no prompt or training adjustments. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper presents an empirical system description and benchmark evaluation rather than any mathematical derivation chain. Performance claims rest on direct measurements against external public benchmarks (GAIA, AssistantBench, WebArena) using the provided AutoGenBench tool for controlled repetition; no equations, fitted parameters, or predictions are derived from internal data. The modular orchestrator architecture is described at the implementation level without self-definitional reductions, self-citation load-bearing premises, or ansatz smuggling. All load-bearing steps are externally falsifiable via the open-source release and benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on existing large foundation models and public benchmarks; no new mathematical axioms, free parameters fitted inside the paper, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5698 in / 997 out tokens · 19909 ms · 2026-05-16T18:45:15.680308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • LedgerCanonicality ZeroParameterComparisonLedger unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Magentic-One uses a multi-agent architecture where a lead agent, the Orchestrator, plans, tracks progress, and re-plans to recover from errors. Throughout task execution, the Orchestrator directs other specialized agents to perform tasks as needed, such as operating a web browser, navigating local files, or writing and executing Python code.

  • HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Magentic-One achieves statistically competitive performance to the state-of-the-art on three diverse and challenging agentic benchmarks: GAIA, AssistantBench, and WebArena.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Why Do Multi-Agent LLM Systems Fail?

    cs.AI 2025-03 unverdicted novelty 8.0

    The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

  2. State-Centric Decision Process

    cs.AI 2026-05 unverdicted novelty 7.0

    SDP constructs a task-induced state space from raw text by having agents commit to and certify natural-language predicates as states, enabling structured planning and analysis in unstructured language environments.

  3. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 7.0

    LC-MAPF uses multi-round local communication between neighboring agents in a pre-trained model to outperform prior learning-based MAPF solvers on diverse unseen scenarios while preserving scalability.

  4. MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

    cs.SE 2026-05 unverdicted novelty 7.0

    MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5...

  5. MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals

    cs.SE 2026-05 unverdicted novelty 7.0

    MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...

  6. From Skills to Talent: Organising Heterogeneous Agents as a Real-World Company

    cs.AI 2026-04 unverdicted novelty 7.0

    OMC framework turns multi-agent AI into self-organizing companies with Talents, Talent Market, and E²R search, achieving 84.67% success on PRDBench (15.48 points above prior art).

  7. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  8. Compass vs Railway Tracks: Unpacking User Mental Models for Communicating Long-Horizon Work to Humans vs. AI

    cs.HC 2026-01 unverdicted novelty 7.0

    Users treat human delegation for long tasks as a flexible compass but AI delegation as rigid railway tracks due to perceived AI limitations in inference and judgment.

  9. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  10. AgentCollabBench: Diagnosing When Good Agents Make Bad Collaborators

    cs.CL 2026-05 unverdicted novelty 6.0

    AgentCollabBench shows that multi-agent reliability is limited by communication topology, with converging-DAG nodes causing synthesis bottlenecks that discard constraints and explain 7-40% of information loss variance.

  11. Learning to Communicate Locally for Large-Scale Multi-Agent Pathfinding

    cs.AI 2026-05 unverdicted novelty 6.0

    LC-MAPF is a decentralized MAPF solver that uses a learnable multi-round communication module among nearby agents to outperform prior IL and RL methods while preserving scalability.

  12. Trace-Level Analysis of Information Contamination in Multi-Agent Systems

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.

  13. BONSAI: A Mixed-Initiative Workspace for Human-AI Co-Development of Visual Analytics Applications

    cs.HC 2026-04 unverdicted novelty 6.0

    BONSAI introduces a four-layer architecture and four-phase workflow for human-AI co-development of visual analytics applications, shown in case studies to enable efficient novel tool creation and reconstruction from p...

  14. Human-Guided Harm Recovery for Computer Use Agents

    cs.AI 2026-04 conditional novelty 6.0

    Introduces harm recovery as a post-execution safeguard for computer-use agents, operationalized via a human-preference rubric, reward model, and BackBench benchmark that shows improved recovery trajectories.

  15. CADMAS-CTX: Contextual Capability Calibration for Multi-Agent Delegation

    cs.AI 2026-04 unverdicted novelty 6.0

    CADMAS-CTX replaces static skill profiles with context-conditioned Beta posteriors and uncertainty-penalized routing, yielding higher accuracy on GAIA (0.442) and SWE-bench (31.4%) than static baselines.

  16. MARL-GPT: Foundation Model for Multi-Agent Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 6.0

    A single transformer model trained offline on expert trajectories from three distinct MARL environments achieves competitive performance against specialized baselines without per-task tuning.

  17. EchoTrail-GUI: Building Actionable Memory for GUI Agents via Critic-Guided Self-Exploration

    cs.AI 2025-12 unverdicted novelty 6.0

    EchoTrail-GUI builds an automated memory of successful GUI task trajectories via self-exploration and injects relevant past examples to raise success rates on Android benchmarks.

  18. Don't Trust Your Upstream: Exploiting LLM Multi-Agent System via Topology-Guided Adversarial Propagation

    cs.CR 2025-12 unverdicted novelty 6.0

    A topology-aware attack propagates adversarial contamination across LLM multi-agent systems to achieve 40-85% success rates on frameworks and real applications, revealing overlooked vulnerabilities.

  19. Retrieval-Conditioned Topology Selection with Provable Budget Conservation for Multi-Agent Code Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    RGAO combines retrieval-based complexity assessment with a formal budget algebra to enable dynamic topology selection in multi-agent code generation with provable conservation.

  20. Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

    cs.CL 2025-11 unverdicted novelty 5.0

    Matrix provides a peer-to-peer multi-agent system for synthetic data generation that scales to tens of thousands of workflows and delivers 2-15x higher throughput than centralized designs without quality loss.

  21. A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    cs.AI 2025-08 unverdicted novelty 5.0

    A comprehensive review of self-evolving AI agents that improve themselves over time, organized via a framework of inputs, agent system, environment, and optimizers, with domain-specific and safety discussions.

  22. AssemPlanner: A Multi-Agent Based Task Planning Framework for Flexible Assembly System

    cs.RO 2026-05 unverdicted novelty 4.0

    AssemPlanner is a ReAct-based multi-agent system that autonomously generates production plans from natural language inputs by integrating scheduling, knowledge, line balancing, and scene graph feedback.

Reference graph

Works this paper leans on

79 extracted references · 79 canonical work pages · cited by 20 Pith papers · 14 internal anchors

  1. [1]

    Abuelsaad, D

    T. Abuelsaad, D. Akkil, P. Dey, A. Jagmohan, A. Vempaty, and R. Kokku. Agent-e: From autonomous web navigation to foundational design principles in agentic systems, 2024

  2. [2]

    Github — babyagi

    BabyAGI. Github — babyagi. https://github.com/yoheinakajima/babyagi, 2023

  3. [3]

    Bonatti, D

    R. Bonatti, D. Zhao, F. Bonacci, D. Dupont, S. Abdali, Y. Li, Y. Lu, J. Wagle, K. Koishida, A. Bucker, L. Jang, and Z. Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024

  4. [4]

    R. Cao, F. Lei, H. Wu, J. Chen, Y. Fu, H. Gao, X. Xiong, H. Zhang, Y. Mao, W. Hu, T. Xie, H. Xu, D. Zhang, S. Wang, R. Sun, P. Yin, C. Xiong, A. Ni, Q. Liu, V. Zhong, L. Chen, K. Yu, and T. Yu. Spider2-v: How far are multimodal agents from automating data science and engineering workflows?, 2024

  5. [5]

    Z. Chen, M. White, R. Mooney, A. Payani, Y. Su, and H. Sun. When is tree search useful for llm planning? it depends on the discriminator, 2024

  6. [6]

    arXiv preprint arXiv:2401.03428

    Y. Cheng, C. Zhang, Z. Zhang, X. Meng, S. Hong, W. Li, Z. Wang, Z. Wang, F. Yin, J. Zhao, et al. Exploring large language model based intelligent agents: Definitions, meth- ods, and prospects. arXiv preprint arXiv:2401.03428 , 2024

  7. [7]

    D’Arcy, T

    M. D’Arcy, T. Hope, L. Birnbaum, and D. Downey. Marg: Multi-agent review generation for scientific papers. arXiv preprint arXiv:2401.04259 , 2024

  8. [8]

    X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2web: Towards a generalist agent for the web, 2023

  9. [9]

    Dibia, A

    V. Dibia, A. Fourney, G. Bansal, F. Poursabzi-Sangdeh, H. Liu, and S. Amershi. Aligning offline metrics and human judgments of value for code generation models. In A. Rogers, J. Boyd-Graber, and N. Okazaki, editors, Findings of the Association for Computational Linguistics: ACL 2023 , pages 8516–8528, Toronto, Canada, July 2023. Association for Computatio...

  10. [10]

    Drouin, M

    A. Drouin, M. Gasse, M. Caccia, I. H. Laradji, M. D. Verme, T. Marty, L. Boisvert, M. Thakkar, Q. Cappart, D. Vazquez, N. Chapados, and A. Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024

  11. [11]

    Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch. Improving factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2305.14325, 2023

  12. [12]

    B. J. Grosz and S. Kraus. The evolution of sharedplans. In Proceedings of the International Conference on Multi-Agent Systems , 1999

  13. [13]

    T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V. Chawla, O. Wiest, and X. Zhang. Large language model based multi-agents: A survey of progress and challenges. arXiv preprint arXiv:2402.01680, 2024

  14. [14]

    H. He, W. Yao, K. Ma, W. Yu, Y. Dai, H. Zhang, Z. Lan, and D. Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. 21

  15. [15]

    S. Hong, X. Zheng, J. Chen, Y. Cheng, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, et al. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023

  16. [16]

    N. R. Jennings and M. Wooldridge. Applications of intelligent agents. In Proceedings of the International Conference on Autonomous Agents , 1998

  17. [17]

    C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe- bench: Can language models resolve real-world github issues?, 2024

  18. [18]

    Kapoor, B

    S. Kapoor, B. Stroebl, Z. S. Siegel, N. Nadgir, and A. Narayanan. Ai agents that matter, 2024

  19. [19]

    J. Y. Koh, S. McAleer, D. Fried, and R. Salakhutdinov. Tree search for language model agents, 2024

  20. [20]

    Li and J

    E. Li and J. Waldo. Websuite: Systematically evaluating why web agents fail. arXiv preprint arXiv:2406.01623, 2024

  21. [21]

    G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem. Camel: Commu- nicative agents for ”mind” exploration of large scale language model society, 2023

  22. [22]

    Liang, Z

    T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, Z. Tu, and S. Shi. Encouraging divergent thinking in large language models through multi-agent debate, 2023

  23. [23]

    J. Liu, Y. Song, B. Y. Lin, W. Lam, G. Neubig, Y. Li, and X. Yue. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding?, 2024

  24. [24]

    N. Liu, L. Chen, X. Tian, W. Zou, K. Chen, and M. Cui. From llm to conversational agent: A memory enhanced architecture with fine-tuning of large language models. arXiv e-prints, pages arXiv–2401, 2024

  25. [25]

    Y. Liu, S. K. Lo, Q. Lu, L. Zhu, D. Zhao, X. Xu, S. Harrer, and J. Whittle. Agent design pattern catalogue: A collection of architectural patterns for foundation model based agents. arXiv preprint arXiv:2405.10467 , 2024

  26. [26]

    C. Lu, C. Lu, R. T. Lange, J. Foerster, J. Clune, and D. Ha. The ai scientist: Towards fully automated open-ended scientific discovery. arXiv preprint arXiv:2408.06292 , 2024

  27. [27]

    The Landscape of Emerging AI Agent Architectures for Reasoning, Planning, and Tool Calling: A Survey

    T. Masterman, S. Besen, M. Sawtell, and A. Chao. The landscape of emerging ai agent architectures for reasoning, planning, and tool calling: A survey. arXiv preprint arXiv:2404.11584, 2024

  28. [28]

    B. Messing. An introduction to multiagent systems. K¨ unstliche Intell., 17:58–, 2002

  29. [29]

    Mialon, C

    G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: a benchmark for general ai assistants, 2023

  30. [30]

    GAIA: a benchmark for General AI Assistants

    G. Mialon, C. Fourrier, C. Swift, T. Wolf, Y. LeCun, and T. Scialom. Gaia: benchmark for general ai assistants. arXiv preprint arXiv:2311.12983 , 2023

  31. [31]

    WebGPT: Browser-assisted question-answering with human feedback

    R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 , 2021

  32. [32]

    N. J. Nilsson. Stuart russell and peter norvig, artificial intelligence: A modern approach. Artificial Intelligence, 82:369–380, 1996. 22

  33. [33]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  34. [34]

    J. Pan, Y. Zhang, N. Tomlin, Y. Zhou, S. Levine, and A. Suhr. Autonomous evaluation and refinement of digital agents, 2024

  35. [35]

    Y. Pan, D. Kong, S. Zhou, C. Cui, Y. Leng, B. Jiang, H. Liu, Y. Shang, S. Zhou, T. Wu, and Z. Wu. Webcanvas: Benchmarking web agents in online environments, 2024

  36. [36]

    ART: Automatic multi-step reasoning and tool-use for large language models

    B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023

  37. [37]

    J. S. Park, J. C. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023

  38. [38]

    D. Paul, M. Ismayilzada, M. Peyrard, B. Borges, A. Bosselut, R. West, and B. Faltings. RE- FINER: Reasoning feedback on intermediate representations. In Y. Graham and M. Purver, editors, Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1100–1126, St. Julian’s, Malta...

  39. [39]

    Putta, E

    P. Putta, E. Mills, N. Garg, S. Motwani, C. Finn, D. Garg, and R. Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024

  40. [40]

    Y. Qin, S. Hu, Y. Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, Y. Huang, C. Xiao, C. Han, Y. R. Fung, Y. Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y. Ye, B. Li, Z. Tang, J. Yi, Y. Zhu, Z. Dai, L. Yan, X. Cong, Y. Lu, W. Zhao, Y. Huang, J. Yan, X. Han, X. Sun, D. Li, J. Phang, C. Yang, T. Wu, H. Ji, Z. Liu, and M. Sun. Tool lear...

  41. [41]

    Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian, S. Zhao, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023

  42. [42]

    Trase tops gaia leaderboard, 2024

    Red Cell Partners. Trase tops gaia leaderboard, 2024

  43. [43]

    S. Reed, K. Zolna, E. Parisotto, S. G. Colmenarejo, A. Novikov, G. Barth-Maron, M. Gimenez, Y. Sulsky, J. Kay, J. T. Springenberg, et al. A generalist agent. arXiv preprint arXiv:2205.06175, 2022

  44. [44]

    Trail of Bits Blog

    M. Russinovich, A. Salem, and R. Eldan. Great, now write an article about that: The crescendo multi-turn llm jailbreak attack. arXiv preprint arXiv:2404.01833 , 2024

  45. [45]

    Scerri, D

    P. Scerri, D. V. Pynadath, and M. Tambe. Adjustable autonomy in real-world multi-agent environments. In International Conference on Autonomous Agents , 2001

  46. [46]

    Schick, J

    T. Schick, J. Dwivedi-Yu, R. Dess` ı, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom. Toolformer: Language models can teach themselves to use tools, 2023

  47. [47]

    T. Shi, A. Karpathy, L. Fan, J. Hernandez, and P. Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning. PMLR, 2017

  48. [48]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems, 36, 2024. 23

  49. [49]

    Sodhi, S

    P. Sodhi, S. R. K. Branavan, Y. Artzi, and R. McDonald. Step: Stacked llm policies for web actions, 2024

  50. [50]

    Y. Song, D. Yin, X. Yue, J. Huang, S. Li, and B. Y. Lin. Trial and error: Exploration-based trajectory optimization for llm agents, 2024

  51. [51]

    Stone and M

    P. Stone and M. Veloso. Multiagent systems: A survey from a machine learning perspective. Auton. Robots, 8(3):345–383, June 2000

  52. [52]

    Talebirad and A

    Y. Talebirad and A. Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents, 2023

  53. [53]

    M. Tambe. Implementing agent teams in dynamic multiagent environments. Appl. Artif. Intell., 12:189–210, 1998

  54. [54]

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, et al. A survey on large language model based autonomous agents. arXiv preprint arXiv:2308.11432, 2023

  55. [55]

    X. Wang, B. Li, Y. Song, F. F. Xu, X. Tang, M. Zhuge, J. Pan, Y. Song, B. Li, J. Singh, H. H. Tran, F. Li, R. Ma, M. Zheng, B. Qian, Y. Shao, N. Muennighoff, Y. Zhang, B. Hui, J. Lin, R. Brennan, H. Peng, H. Ji, and G. Neubig. Opendevin: An open platform for ai software developers as generalist agents, 2024

  56. [56]

    Y. Wang, T. Shen, L. Liu, and J. Xie. Sibyl: Simple yet effective agent framework for complex real-world reasoning, 2024

  57. [57]

    Z. Z. Wang, J. Mao, D. Fried, and G. Neubig. Agent workflow memory, 2024

  58. [58]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models. arXiv preprint arXiv:2201.11903 , 2022

  59. [59]

    Wooldridge and N

    M. Wooldridge and N. R. Jennings. Intelligent agents: theory and practice. The Knowledge Engineering Review, 10:115 – 152, 1995

  60. [60]

    Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. In COLM, 2024

  61. [61]

    Z. Wu, C. Han, Z. Ding, Z. Weng, Z. Liu, S. Yao, T. Yu, and L. Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024

  62. [62]

    Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui. The rise and potential of large language model based agents: A survey, 2023

  63. [63]

    C. S. Xia, Y. Deng, S. Dunn, and L. Zhang. Agentless: Demystifying llm-based software engineering agents. arXiv preprint arXiv:2407.01489 , 2024

  64. [64]

    T. Xie, D. Zhang, J. Chen, X. Li, S. Zhao, R. Cao, T. J. Hua, Z. Cheng, D. Shin, F. Lei, Y. Liu, Y. Xu, S. Zhou, S. Savarese, C. Xiong, V. Zhong, and T. Yu. Osworld: Bench- marking multimodal agents for open-ended tasks in real computer environments, 2024

  65. [65]

    H. Yang, S. Yue, and Y. He. Auto-gpt for online decision making: Benchmarks and additional opinions, 2023. 24

  66. [66]

    J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press. Swe- agent: Agent-computer interfaces enable automated software engineering. arXiv preprint arXiv:2405.15793, 2024

  67. [67]

    J. Yang, H. Zhang, F. Li, X. Zou, C. Li, and J. Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441 , 2023

  68. [68]

    S. Yao, H. Chen, J. Yang, and K. Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

  69. [69]

    S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan. Tree of thoughts: Deliberate problem solving with large language models, 2023

  70. [70]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao. React: Syner- gizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR), 2023

  71. [71]

    Yoran, S

    O. Yoran, S. J. Amouyal, C. Malaviya, B. Bogin, O. Press, and J. Berant. Assistantbench: Can web agents solve realistic and time-consuming tasks?, 2024

  72. [72]

    A. Zeng, M. Liu, R. Lu, B. Wang, X. Liu, Y. Dong, and J. Tang. Agenttuning: Enabling generalized agent abilities for llms, 2023

  73. [73]

    Zhang, L

    C. Zhang, L. Li, S. He, X. Zhang, B. Qiao, S. Qin, M. Ma, Y. Kang, Q. Lin, S. Rajmohan, D. Zhang, and Q. Zhang. Ufo: A ui-focused agent for windows os interaction, 2024

  74. [74]

    Zhang, X

    H. Zhang, X. Pan, H. Wang, K. Ma, W. Yu, and D. Yu. Cognitive kernel: An open-source agent system towards generalist autopilots, 2024

  75. [75]

    Zhang, Z

    Y. Zhang, Z. Ma, Y. Ma, Z. Han, Y. Wu, and V. Tresp. Webpilot: A versatile and autonomous multi-agent system for web task execution with strategic exploration, 2024

  76. [76]

    Zhang, H

    Y. Zhang, H. Ruan, Z. Fan, and A. Roychoudhury. Autocoderover: Autonomous program improvement. In Proceedings of the 33rd ACM SIGSOFT International Symposium on Software Testing and Analysis , pages 1592–1604, 2024

  77. [77]

    Zhang and A

    Z. Zhang and A. Zhang. You only look at screens: Multimodal chain-of-action agents, 2024

  78. [78]

    Z. J. Zhang, E. Schoop, J. Nichols, A. Mahajan, and A. Swearngin. From interaction to impact: Towards safer ai agents through understanding and evaluating ui operation impacts. arXiv preprint arXiv:2410.09006 , 2024

  79. [79]

    S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y. Bisk, D. Fried, U. Alon, and G. Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. 25 Appendix A Statistical Methodology In Table 1, we report the mean and an error bar for each reported method on the three bench- marks. To obtain the error bar we u...