pith. machine review for the scientific record. sign in

arxiv: 2403.07718 · v5 · submitted 2024-03-12 · 💻 cs.LG · cs.AI

Recognition: 1 theorem link

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords web agentsLLM agentsbenchmarkstask automationenterprise softwareServiceNowBrowserGym
0
0 comments X

The pith

Web agents based on large language models show some success on enterprise tasks but leave a large gap to full automation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WorkArena as a benchmark of 33 tasks drawn from the ServiceNow platform to measure how well LLM-based agents can carry out typical daily activities of knowledge workers through web browsers. It also presents BrowserGym, an environment that supplies a rich set of actions and multimodal observations for building and testing such agents. Empirical tests reveal that current agents achieve partial success yet remain far from completing the full set of tasks automatically. The results further show a clear performance difference, with closed-source models outperforming open-source ones. The benchmark is intended to guide development toward agents that can handle real enterprise software work.

Core claim

We propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs.

What carries the argument

WorkArena benchmark of 33 ServiceNow tasks together with BrowserGym environment supplying actions and multimodal observations for agent design and evaluation

If this is right

  • Agents can manage some simpler browser-based tasks but fail on more involved ones.
  • Closing the performance gap between open and closed-source models is a priority for progress.
  • The benchmark can be used to track improvements in agent capabilities over time.
  • Full automation of knowledge-work tasks is not yet feasible with existing methods.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending the benchmark to other enterprise platforms would test whether the observed gaps are specific to ServiceNow or more general.
  • The results point to a need for better ways to handle long sequences of actions and complex interfaces in agent training.
  • Access to proprietary model capabilities currently provides a practical advantage for deploying web agents in work settings.

Load-bearing premise

The 33 tasks chosen for WorkArena are representative of the typical daily work of knowledge workers utilizing enterprise software systems.

What would settle it

Demonstrating near-complete success rates on the 33 WorkArena tasks with new open-source models or on a wider set of real enterprise workflows would challenge the claim of a considerable remaining gap.

read the original abstract

We study the use of large language model-based agents for interacting with software via web browsers. Unlike prior work, we focus on measuring the agents' ability to perform tasks that span the typical daily work of knowledge workers utilizing enterprise software systems. To this end, we propose WorkArena, a remote-hosted benchmark of 33 tasks based on the widely-used ServiceNow platform. We also introduce BrowserGym, an environment for the design and evaluation of such agents, offering a rich set of actions as well as multimodal observations. Our empirical evaluation reveals that while current agents show promise on WorkArena, there remains a considerable gap towards achieving full task automation. Notably, our analysis uncovers a significant performance disparity between open and closed-source LLMs, highlighting a critical area for future exploration and development in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces WorkArena, a remote-hosted benchmark of 33 tasks drawn from the ServiceNow enterprise platform, together with the BrowserGym environment that supplies a rich action space and multimodal observations. Through empirical evaluation of LLM-based web agents, it claims that current agents show promise on these tasks yet exhibit a considerable gap to full automation, with a significant performance disparity between open- and closed-source models.

Significance. If the 33 tasks prove representative of typical knowledge-worker workflows, the benchmark and environment would supply a concrete, reproducible testbed for measuring progress toward practical web agents in enterprise settings. The reported open/closed-source gap would also constitute a falsifiable observation that could guide model development priorities.

major comments (1)
  1. [Task construction / benchmark definition] The central claim of a 'considerable gap towards achieving full task automation' and the open/closed-source disparity both rest on the assumption that the 33 ServiceNow tasks are representative of daily knowledge work. The manuscript describes the tasks as 'based on the widely-used ServiceNow platform' but supplies no usage-log frequency analysis, expert coverage survey, or cross-platform sampling to support this representativeness (see abstract and the task-construction description).
minor comments (2)
  1. [Evaluation] The abstract states that the evaluation 'reveals' the performance gap but does not specify success metrics, statistical tests, or controls; these details should be stated explicitly in the evaluation section.
  2. [BrowserGym environment] BrowserGym's action set and observation modalities are described at a high level; a table enumerating the exact actions and observation channels would improve reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the concern regarding task representativeness below. While we cannot retroactively add proprietary usage logs, we can strengthen the justification for task selection and clarify the scope of our claims.

read point-by-point responses
  1. Referee: The central claim of a 'considerable gap towards achieving full task automation' and the open/closed-source disparity both rest on the assumption that the 33 ServiceNow tasks are representative of daily knowledge work. The manuscript describes the tasks as 'based on the widely-used ServiceNow platform' but supplies no usage-log frequency analysis, expert coverage survey, or cross-platform sampling to support this representativeness (see abstract and the task-construction description).

    Authors: We agree that a formal frequency analysis or expert survey would strengthen the claim of representativeness and note its absence as a limitation. The 33 tasks were curated by the authors (who have direct experience with ServiceNow deployments) to cover core, recurring operations in IT service management, HR, and knowledge workflows that are documented as standard in ServiceNow's own user guides and industry reports (e.g., incident creation, knowledge article search, user provisioning). ServiceNow is deployed in over 10,000 organizations and these task types align with publicly available case studies of daily knowledge-worker activity. The performance gap and open/closed-source disparity are reported as empirical observations on this benchmark rather than universal claims about all knowledge work; we will revise the abstract and task-construction section to explicitly frame the benchmark as a representative but non-exhaustive sample of enterprise web tasks and add references to ServiceNow documentation. No cross-platform sampling was performed because the benchmark deliberately targets a single widely-used platform to enable reproducible, remote-hosted evaluation. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical benchmark evaluation

full rationale

This paper introduces the WorkArena benchmark and BrowserGym environment, then reports direct empirical performance measurements of LLM-based agents on 33 ServiceNow tasks. No mathematical derivations, equations, parameter fitting, or predictive claims exist that could reduce to inputs by construction. The central findings (performance gap to full automation and open/closed-source LLM disparity) rest on observed success rates rather than any self-definitional loop, fitted-input renaming, or load-bearing self-citation chain. Task selection and representativeness are stated assumptions open to external validation, but they do not create circularity within the reported results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the representativeness of the 33 tasks and the fairness of the agent evaluation protocol; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 33 tasks in WorkArena represent typical daily knowledge work on enterprise platforms.
    Assumption invoked to justify the benchmark's relevance to real-world use; no external validation or sampling justification is provided in the abstract.

pith-pipeline@v0.9.0 · 5476 in / 1152 out tokens · 25678 ms · 2026-05-15T02:44:17.853901+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  2. OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    cs.AI 2024-04 accept novelty 8.0

    OSWorld provides the first unified real-computer benchmark for open-ended multimodal agent tasks, exposing large performance gaps between humans and state-of-the-art LLM/VLM agents.

  3. Beyond the All-in-One Agent: Benchmarking Role-Specialized Multi-Agent Collaboration in Enterprise Workflows

    cs.MA 2026-05 unverdicted novelty 7.0

    EntCollabBench shows that today's LLM agents still struggle with delegation, context transfer, parameter grounding, workflow closure, and decision commitment when tested in a simulated enterprise with 11 role-speciali...

  4. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 accept novelty 7.0

    NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...

  5. ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents

    cs.CV 2026-04 unverdicted novelty 7.0

    ClawMark is a new benchmark for multi-turn multi-day multimodal coworker agents in stateful evolving services, with deterministic Python checkers showing frontier models achieve only 20% strict task success.

  6. HiL-Bench (Human-in-Loop Benchmark): Do Agents Know When to Ask for Help?

    cs.AI 2026-04 unverdicted novelty 7.0

    HiL-Bench shows frontier AI agents fail to ask for help on incomplete tasks, recovering only a fraction of full-information performance, but RL training on Ask-F1 reward improves judgment and transfers across domains.

  7. SAGE: A Service Agent Graph-guided Evaluation Benchmark

    cs.AI 2026-04 unverdicted novelty 7.0

    SAGE is a new multi-agent benchmark that formalizes service SOPs as dynamic dialogue graphs to measure LLM agents on logical compliance and path coverage, uncovering an execution gap and empathy resilience across 27 m...

  8. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  9. WebSP-Eval: Evaluating Web Agents on Website Security and Privacy Tasks

    cs.CR 2026-04 unverdicted novelty 7.0

    WebSP-Eval shows that multimodal LLM-based web agents fail more than 45% of the time on security and privacy tasks involving stateful UI elements such as toggles and checkboxes.

  10. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    cs.AI 2024-05 accept novelty 7.0

    AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

  11. On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

    cs.AI 2026-05 unverdicted novelty 6.0

    FATE lets LLM agents self-evolve safer behaviors by generating and filtering repairs from their own failure trajectories using verifiers and Pareto optimization.

  12. Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

    cs.AI 2026-05 unverdicted novelty 6.0

    Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.

  13. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 unverdicted novelty 6.0

    NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.

  14. LocalAlign: Enabling Generalizable Prompt Injection Defense via Generation of Near-Target Adversarial Examples for Alignment Training

    cs.CR 2026-05 unverdicted novelty 6.0

    LocalAlign generates near-target adversarial examples via prompting and applies margin-aware alignment training to enforce tighter boundaries against prompt injection attacks.

  15. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  16. VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation

    cs.CL 2026-04 conditional novelty 6.0

    VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.

  17. LLMs Corrupt Your Documents When You Delegate

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

  18. Agent Workflow Memory

    cs.CL 2024-09 unverdicted novelty 6.0

    AWM induces reusable workflows from agent experiences and provides them selectively to improve success rates by 24.6% on Mind2Web and 51.1% on WebArena while reducing steps taken.

  19. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    cs.AI 2026-05 conditional novelty 5.0

    Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

  20. AgentCE-Bench: Agent Configurable Evaluation with Scalable Horizons and Controllable Difficulty under Lightweight Environments

    cs.AI 2026-04 unverdicted novelty 5.0

    AgentCE-Bench is a lightweight grid-planning benchmark that controls task horizon via hidden slots H and difficulty via decoy budget B, validated across 13 models for consistent and discriminative evaluation.

  21. Securing Computer-Use Agents: A Unified Architecture-Lifecycle Framework for Deployment-Grounded Reliability

    cs.CL 2026-05 unverdicted novelty 4.0

    The paper develops a unified framework that organizes computer-use agent reliability around perception-decision-execution layers and creation-deployment-operation-maintenance stages to map security and alignment inter...

  22. ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

    cs.AI 2026-05 unverdicted novelty 3.0

    Expanded orchestration in ChromaFlow lowered accuracy on GAIA tasks from 29/53 to 27/53 while increasing timeouts, tool failures, and costs.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 21 Pith papers · 8 internal anchors

  1. [1]

    The unsolved challenges of LLM s in open-ended web tasks: A case study

    Assouel, R., Marty, T., Caccia, M., Laradji, I., Drouin, A., Rajeswar, S., Palacios, H., Cappart, Q., Vazquez, D., Chapados, N., Gasse, M., and Lacoste, A. The unsolved challenges of LLM s in open-ended web tasks: A case study. In NeurIPS 2023 Foundation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id=jt3il4fC5B

  2. [2]

    OpenAI gym, 2016

    Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. OpenAI gym, 2016

  3. [3]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

    Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2Web : Towards a generalist agent for the web. arXiv, abs/2306.06070, 2023

  4. [4]

    S., and Gur, I

    Furuta, H., Nachum, O., Lee, K.-H., Matsuo, Y., Gu, S. S., and Gur, I. Multimodal web navigation with instruction-finetuned foundation models. arXiv, abs/2305.11854, 2023. URL https://arxiv.org/abs/2305.11854

  5. [5]

    Chrome devtools protocol, 2023

    Google . Chrome devtools protocol, 2023. URL https://chromedevtools.github.io/devtools-protocol/

  6. [7]

    A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

    Gur, I., Furuta, H., Huang, A., Safdari, M., Matsuo, Y., Eck, D., and Faust, A. A real-world WebAgent with planning, long context understanding, and program synthesis. arXiv, abs/2307.12856, 2023 b . URL https://arxiv.org/abs/2307.12856

  7. [8]

    Webvoyager: Building an end-to-end web agent with large multimodal models,

    He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., and Yu, D. WebVoyager : Building an end-to-end web agent with large multimodal models. arXiv, abs/2401.13919, 2024. URL https://arxiv.org/abs/2401.13919

  8. [9]

    Automatic macro mining from interaction traces at scale

    Huang, F., Li, G., Li, T., and Li, Y. Automatic macro mining from interaction traces at scale. In Proceedings of the CHI Conference on Human Factors in Computing Systems, CHI '24, New York, NY, USA, 2024. Association for Computing Machinery. ISBN 9798400703300. doi:10.1145/3613904.3642074. URL https://doi.org/10.1145/3613904.3642074

  9. [10]

    C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P., Santoro, A., and Lillicrap, T

    Humphreys, P. C., Raposo, D., Pohlen, T., Thornton, G., Chhaparia, R., Muldal, A., Abramson, J., Georgiev, P., Santoro, A., and Lillicrap, T. A data-driven approach for learning to control computers. In International Conference on Machine Learning (ICML), 2022

  10. [11]

    Language models can solve computer tasks

    Kim, G., Baldi, P., and McAleer, S. Language models can solve computer tasks. arXiv, abs/2303.17491, 2023. URL https://arxiv.org/abs/2303.17491

  11. [12]

    Mapping natural language instructions to mobile ui action sequences

    Li, Y., He, J., Zhou, X., Zhang, Y., and Baldridge, J. Mapping natural language instructions to mobile ui action sequences. In Annual Conference of the Association for Computational Linguistics (ACL 2020), 2020. URL https://www.aclweb.org/anthology/2020.acl-main.729.pdf

  12. [13]

    Z., Guu, K., Pasupat, P., Shi, T., and Liang, P

    Liu, E. Z., Guu, K., Pasupat, P., Shi, T., and Liang, P. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018

  13. [14]

    AgentBench: Evaluating LLMs as Agents

    Liu, X., Yu, H., Zhang, H., Xu, Y., Lei, X., Lai, H., Gu, Y., Ding, H., Men, K., Yang, K., Zhang, S., Deng, X., Zeng, A., Du, Z., Zhang, C., Shen, S., Zhang, T., Su, Y., Sun, H., Huang, M., Dong, Y., and Tang, J. AgentBench : Evaluating LLMs as agents. arXiv, abs/2308.03688, 2023 a . URL https://arxiv.org/abs/2308.03688

  14. [15]

    Bo- laa: Benchmarking and orchestrating llm-augmented autonomous agents

    Liu, Z., Yao, W., Zhang, J., Xue, L., Heinecke, S., Murthy, R., Feng, Y., Chen, Z., Niebles, J. C., Arpit, D., Xu, R., Mui, P., Wang, H., Xiong, C., and Savarese, S. BOLAA : Benchmarking and orchestrating LLM -augmented autonomous agents. arXiv, abs/2308.05960, 2023 b

  15. [16]

    H., Kasner, Z., and Reddy, S

    L \`u , X. H., Kasner, Z., and Reddy, S. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024

  16. [17]

    Knowledge 2020: ``The digital workflow revolution has just begun''

    Maas, M. Knowledge 2020: ``The digital workflow revolution has just begun'' . Technical report, Sprinklr, 2020. URL https://www.linkedin.com/pulse/knowledge-2020-digital-workflow-revolution-has-just-begun-maas/

  17. [18]

    ServiceNow joins the prestigious Fortune 500 list

    Mastantuono, G. ServiceNow joins the prestigious Fortune 500 list . https://www.servicenow.com/blogs/2023/servicenow-joins-fortune-500-list.html, 2023. Accessed: 2024-01-29

  18. [19]

    Llama 3: Meta's latest large language model

    Meta. Llama 3: Meta's latest large language model. https://github.com/meta-llama/llama3, 2024. Accessed: 2024-06-03

  19. [20]

    Playwright for P ython documentation, 2023

    Microsoft . Playwright for P ython documentation, 2023. URL https://playwright.dev/python/

  20. [21]

    WebGPT: Browser-assisted question-answering with human feedback

    Nakano, R., Hilton, J., Balaji, S., Wu, J., Ouyang, L., Kim, C., Hesse, C., Jain, S., Kosaraju, V., Saunders, W., Jiang, X., Cobbe, K., Eloundou, T., Krueger, G., Button, K., Knight, M., Chess, B., and Schulman, J. WebGPT : Browser-assisted question-answering with human feedback. arXiv, abs/2112.09332, 2021. URL https://arxiv.org/abs/2112.09332

  21. [22]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report. ArXiv, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774

  22. [23]

    Androidinthewild: A large-scale dataset for android device control

    Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Androidinthewild: A large-scale dataset for android device control. In Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., and Levine, S. (eds.), Advances in Neural Information Processing Systems, volume 36, pp.\ 59708--59728, 2023

  23. [24]

    Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles

    SAE. Taxonomy and Definitions for Terms Related to Driving Automation Systems for On-Road Motor Vehicles . Technical report, Society of Automotive Engineers (SAE), 04 2021. URL https://doi.org/10.4271/J3016_202104

  24. [25]

    Vancouver release notes

    ServiceNow . Vancouver release notes. Online, 2023. Available at: https://docs.servicenow.com/bundle/vancouver-release-notes/

  25. [26]

    World of bits: An open-domain platform for web-based agents

    Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning (ICML), 2017 a

  26. [27]

    World of bits: An open-domain platform for web-based agents

    Shi, T., Karpathy, A., Fan, L., Hernandez, J., and Liang, P. World of bits: An open-domain platform for web-based agents. ICML, 2017 b

  27. [28]

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Kh...

  28. [29]

    A journey into the future of the translation industry, 2021

    van der Meer, J. A journey into the future of the translation industry, 2021. URL https://www.taus.net/resources/blog/a-journey-into-the-future-of-the-translation-industry. Accessed: 2024-02-01

  29. [30]

    Emergent Abilities of Large Language Models

    Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., et al. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022 a

  30. [31]

    V., and Zhou, D

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 24824--24837. Curran Associates, Inc...

  31. [32]

    J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., and Yu, T

    Xie, T., Zhang, D., Chen, J., Li, X., Zhao, S., Cao, R., Hua, T. J., Cheng, Z., Shin, D., Lei, F., Liu, Y., Xu, Y., Zhou, S., Savarese, S., Xiong, C., Zhong, V., and Yu, T. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024

  32. [33]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Yang, J., Zhang, H., Li, F., Zou, X., Li, C., and Gao, J. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023

  33. [34]

    WebShop : Towards scalable real-world web interaction with grounded language agents

    Yao, S., Chen, H., Yang, J., and Narasimhan, K. WebShop : Towards scalable real-world web interaction with grounded language agents. In Advances in Neural Information Processing Systems (NeurIPS), 2022

  34. [35]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., and Cao, Y. ReAct : Synergizing reasoning and acting in language models. arXiv, abs/2210.03629, 2023. URL https://arxiv.org/abs/2210.03629

  35. [36]

    Agenttuning: Enabling generalized agent abilities for llms,

    Zeng, A., Liu, M., Lu, R., Wang, B., Liu, X., Dong, Y., and Tang, J. Agenttuning: Enabling generalized agent abilities for llms. arXiv preprint arXiv:2310.12823, 2023

  36. [37]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Zhou, S., Xu, F. F., Zhu, H., Zhou, X., Lo, R., Sridhar, A., Cheng, X., Bisk, Y., Fried, D., Alon, U., and Neubig, G. Webarena: A realistic web environment for building autonomous agents. ArXiv, abs/2307.13854, 2023. URL https://arxiv.org/abs/2307.13854