pith. sign in

arxiv: 2406.12373 · v3 · pith:RSD5FDDUnew · submitted 2024-06-18 · 💻 cs.CL · cs.AI· cs.LG

WebCanvas: Benchmarking Web Agents in Online Environments

Pith reviewed 2026-05-20 11:24 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords web agentsonline evaluationdynamic web environmentstask success ratetask completion ratebenchmark datasetintermediate statesagent framework
0
0 comments X

The pith

WebCanvas gives web agents an online benchmark with intermediate checks that reveals 23 percent success on live tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WebCanvas as a framework to test web agents on real changing websites rather than fixed snapshots. It supplies a metric focused on the key steps that actually move a task forward, a dataset of 542 tasks each carrying multiple evaluation points, and simple tools so others can keep the tasks current. Tests using the framework show the strongest agent reaches a 48.8 percent completion rate yet only a 23.1 percent full success rate, with clear differences across sites and domains. A reader would care because agents meant for everyday use must keep working when pages update or elements shift, and this setup makes that kind of testing practical. The authors also release an extensible agent system so the community can run and improve online evaluations.

Core claim

WebCanvas is an online evaluation framework built from a metric that tracks only the critical intermediate actions or states required for task completion, the Mind2Web-Live dataset of 542 tasks containing 2439 such states, and lightweight annotation tools. The framework supports realistic testing in evolving web environments. An open-sourced agent built on it records a 23.1 percent task success rate and 48.8 percent task completion rate on the test set while exposing performance gaps across websites, domains, and setups.

What carries the argument

The novel evaluation metric that isolates critical intermediate actions or states needed for task completion while filtering noise from minor events or changed web elements.

If this is right

  • Web agents can be tested under conditions that match the frequent interface and content updates found on real sites.
  • Performance differences across specific websites, domains, and experimental setups become measurable and comparable.
  • The community can use the provided tools to collect and refresh high-quality task data over time.
  • An extensible open agent framework supports ongoing online inference, evaluation, and module additions by others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents tuned on this kind of intermediate-state feedback may handle unexpected page changes more reliably than those trained only on static traces.
  • Similar intermediate-check methods could be adapted to evaluate agents in other changing environments such as mobile apps or simulated worlds.
  • Expanding the dataset to additional task types would likely surface new patterns of failure that static benchmarks miss.
  • Combining the framework with stronger reasoning modules might raise the observed completion rates without altering the core metric.

Load-bearing premise

The novel evaluation metric reliably captures critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements.

What would settle it

Human raters reviewing a sample of completed tasks and judging whether the metric's selected intermediate states actually match the steps required for success, or whether agents pass the metric yet still fail the overall task in uncontrolled live runs.

read the original abstract

For web agents to be practically useful, they must adapt to the continuously evolving web environment characterized by frequent updates to user interfaces and content. However, most existing benchmarks only capture the static aspects of the web. To bridge this gap, we introduce WebCanvas, an innovative online evaluation framework for web agents that effectively addresses the dynamic nature of web interactions. WebCanvas contains three main components to facilitate realistic assessments: (1) A novel evaluation metric which reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements. (2) A benchmark dataset called Mind2Web-Live, a refined version of original Mind2Web static dataset containing 542 tasks with 2439 intermediate evaluation states; (3) Lightweight and generalizable annotation tools and testing pipelines that enables the community to collect and maintain the high-quality, up-to-date dataset. Building on WebCanvas, we open-source an agent framework with extensible modules for reasoning, providing a foundation for the community to conduct online inference and evaluations. Our best-performing agent achieves a task success rate of 23.1% and a task completion rate of 48.8% on the Mind2Web-Live test set. Additionally, we analyze the performance discrepancies across various websites, domains, and experimental environments. We encourage the community to contribute further insights on online agent evaluation, thereby advancing this field of research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces WebCanvas, an online evaluation framework for web agents operating in dynamic web environments. It consists of three main components: (1) a novel evaluation metric intended to capture critical intermediate actions or states for task completion while ignoring noise from insignificant events or changed web elements; (2) the Mind2Web-Live dataset, a refined version of the static Mind2Web dataset containing 542 tasks and 2439 intermediate evaluation states; and (3) lightweight annotation tools and testing pipelines for community maintenance of up-to-date data. The authors also release an extensible open-source agent framework and report that their best agent achieves a 23.1% task success rate and 48.8% task completion rate on the Mind2Web-Live test set, along with analyses of performance differences across websites, domains, and environments.

Significance. If the novel metric is shown to be reliable, this work would meaningfully advance benchmarking of web agents by shifting from static to online, dynamic evaluations, addressing a clear limitation in existing resources. The open-sourcing of the agent framework, annotation tools, and dataset is a clear strength that promotes reproducibility and community contributions. The reported performance figures usefully illustrate current limitations in web agent capabilities.

major comments (1)
  1. [Abstract] Abstract, component (1): The central claim that the novel evaluation metric 'reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements' lacks any reported validation, such as inter-annotator agreement, correlation with human task-success labels, or ablation studies on state selection criteria. This is load-bearing because the 23.1% success rate and 48.8% completion rate on Mind2Web-Live cannot be confidently interpreted without evidence that the metric avoids post-hoc tuning or selection bias.
minor comments (2)
  1. [Results] The reported performance numbers (23.1% and 48.8%) lack error bars, confidence intervals, or statistical comparisons to baselines, which would help assess the significance of discrepancies across websites and domains.
  2. [Dataset] The description of how the 2439 intermediate states were chosen and annotated could be expanded for better reproducibility, even if the tools are open-sourced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for identifying a key area where additional evidence would strengthen the manuscript. We address the major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract] Abstract, component (1): The central claim that the novel evaluation metric 'reliably capture critical intermediate actions or states necessary for task completions while disregarding noise caused by insignificant events or changed web-elements' lacks any reported validation, such as inter-annotator agreement, correlation with human task-success labels, or ablation studies on state selection criteria. This is load-bearing because the 23.1% success rate and 48.8% completion rate on Mind2Web-Live cannot be confidently interpreted without evidence that the metric avoids post-hoc tuning or selection bias.

    Authors: We agree that the abstract makes a strong claim about the metric's reliability and that explicit validation evidence is needed to support interpretation of the reported success and completion rates. The intermediate states were selected through expert annotation following the original Mind2Web task definitions, with guidelines focused on identifying necessary actions and states while excluding transient UI changes. In the revised manuscript we will add a dedicated subsection on the annotation protocol, report inter-annotator agreement computed on a sampled subset of the 2439 states, include a correlation analysis between the metric and human task-success judgments on a held-out sample, and present an ablation study examining the effect of alternative state-selection criteria. These additions will directly address concerns about post-hoc tuning and selection bias. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark and metric introduced directly without derivation chain

full rationale

The paper introduces WebCanvas as a new online evaluation framework consisting of a directly defined metric, the Mind2Web-Live dataset, and annotation tools. No mathematical derivation, first-principles prediction, or fitted parameter is claimed; the reported success rates are empirical measurements on the newly constructed test set rather than outputs that reduce to the inputs by construction. The metric is presented as a novel definition that captures intermediate states while ignoring noise, without any self-referential equations or self-citation load-bearing steps that would make the result tautological. This is a standard benchmark contribution whose central claims remain independent of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on the assumption that the refined Mind2Web-Live tasks remain representative of real web use and that the new metric can be applied consistently across evolving sites; no free parameters or invented entities are introduced beyond standard benchmark construction.

axioms (1)
  • domain assumption The novel evaluation metric reliably captures critical intermediate actions while disregarding noise from insignificant events or changed web-elements.
    Stated as the first main component in the abstract; central to the framework's validity.

pith-pipeline@v0.9.0 · 5815 in / 1258 out tokens · 34272 ms · 2026-05-20T11:24:31.392727+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RiskWebWorld: A Realistic Interactive Benchmark for GUI Agents in E-commerce Risk Management

    cs.AI 2026-04 unverdicted novelty 7.0

    RiskWebWorld is the first realistic interactive benchmark for GUI agents in e-commerce risk management, revealing a large gap between generalist and specialized models plus RL gains.

  2. WebForge: Breaking the Realism-Reproducibility-Scalability Trilemma in Browser Agent Benchmark

    cs.AI 2026-04 unverdicted novelty 7.0

    WebForge is an automated multi-agent framework that creates realistic and reproducible browser agent benchmarks at scale, demonstrated via a 934-task benchmark that reveals distinct model capability profiles through m...

  3. ClawBench: Can AI Agents Complete Everyday Online Tasks?

    cs.CL 2026-04 unverdicted novelty 7.0

    ClawBench is a benchmark of 153 live-web tasks where AI agents achieve low success rates, e.g. 33.3% for Claude Sonnet 4.6.

  4. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  5. Orak: A Foundational Benchmark for Training and Evaluating LLM Agents on Diverse Video Games

    cs.AI 2025-06 unverdicted novelty 7.0

    Orak is a foundational benchmark providing training data, interfaces, and evaluation tools for LLM agents across diverse video game genres.

  6. SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    SimGym is a browser-based VLM agent framework that simulates A/B test outcomes on e-commerce storefronts with 77% directional agreement on add-to-cart shifts from real buyer traffic.

  7. DocOS: Towards Proactive Document-Guided Actions in GUI Agents

    cs.AI 2026-05 unverdicted novelty 6.0

    Introduces DocOS benchmark to test GUI agents on proactively locating, comprehending, and executing instructions from online documentation in interactive web settings.

  8. From Controlled to the Wild: Evaluation of Pentesting Agents for the Real-World

    cs.AI 2026-05 unverdicted novelty 6.0

    A practical evaluation protocol for AI pentesting agents that uses validated vulnerability discovery, LLM semantic matching, and bipartite scoring to assess performance in realistic, complex targets.

  9. Log analysis is necessary for credible evaluation of AI agents

    cs.AI 2026-05 conditional novelty 6.0

    Log analysis of AI agent executions is necessary for credible evaluation because outcome-only metrics permit shortcuts, poor real-world prediction, and concealed risks.

  10. Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents

    cs.HC 2026-05 unverdicted novelty 6.0

    Augmented Nielsen heuristics improve computer-use agent task completion on varied interfaces while preserving human usability, as shown in UI-Verse experiments and human studies.

  11. Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows

    cs.SE 2026-04 unverdicted novelty 6.0

    Claw-Eval-Live benchmark with 105 tasks shows no frontier LLM agent exceeds 66.7% success rate on evolving real-world workflows, with HR and multi-system tasks as persistent bottlenecks.

  12. ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    ClawEnvKit automates generation of diverse verified environments for claw-like agents from natural language, producing the Auto-ClawEval benchmark of 1,040 environments that matches human-curated quality at 13,800x lo...

  13. GenericAgent: A Token-Efficient Self-Evolving LLM Agent via Contextual Information Density Maximization (V1.0)

    cs.CL 2026-04 unverdicted novelty 6.0

    GenericAgent outperforms other LLM agents on long-horizon tasks by maximizing context information density with fewer tokens via minimal tools, on-demand memory, trajectory-to-SOP evolution, and compression.

  14. WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

    cs.AI 2026-03 unverdicted novelty 6.0

    WebFactory is a fully automated RL pipeline that compresses LLM-encoded internet knowledge into grounded web agents, achieving performance comparable to human-annotated training but using synthetic data from only 10 websites.

  15. A Functionality-Grounded Benchmark for Evaluating Web Agents in E-commerce Domains

    cs.CL 2025-08 unverdicted novelty 6.0

    The paper proposes Amazon-Bench, a functionality-grounded benchmark for web agents in e-commerce that generates diverse task queries from webpage elements and evaluates both task performance and safety risks.

  16. Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    cs.CL 2024-12 conditional novelty 6.0

    Aguvis presents a pure vision-based framework for autonomous GUI agents using structured reasoning via inner monologue, a new multimodal dataset, and two-stage training to reach SOTA on offline and online benchmarks.

  17. GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

    cs.AI 2026-04 unverdicted novelty 5.0

    The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

  18. Addressing the Reality Gap: A Three-Tension Framework for Agentic AI Adoption

    cs.CY 2026-04 unverdicted novelty 5.0

    A three-tension framework is introduced to help navigate the adoption of autonomous agentic AI systems in K-12 and higher education by addressing practical, temporal, and value-based challenges.

  19. Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents

    cs.HC 2025-09 unverdicted novelty 5.0

    Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Opera...

  20. A Survey of Self-Evolving Agents: What, When, How, and Where to Evolve on the Path to Artificial Super Intelligence

    cs.AI 2025-07 accept novelty 4.0

    The paper delivers the first systematic review of self-evolving agents, structured around what components evolve, when adaptation occurs, and how it is implemented.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 20 Pith papers · 14 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023

  3. [3]

    Mind2web: Towards a generalist agent for the web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems , 36, 2024

  4. [4]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024

  5. [5]

    Mul- timodal web navigation with instruction-finetuned foundation models

    Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin Gur. Mul- timodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023

  6. [6]

    A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023

  7. [7]

    Understanding html with large language models

    Izzeddin Gür, Ofir Nachum, Yingjie Miao, Mustafa Safdari, Austin Huang, Aakanksha Chowdhery, Sharan Narang, Noah Fiedel, and Aleksandra Faust. Understanding html with large language models. In Findings of the Association for Computational Linguistics: EMNLP 2023 , pages 2803–2821, 2023

  8. [8]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024

  9. [9]

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar- Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. arXiv preprint arXiv:2403.07974, 2024

  10. [10]

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024. 10

  11. [11]

    Dynabench: Rethinking benchmarking in nlp

    Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, et al. Dynabench: Rethinking benchmarking in nlp. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–...

  12. [12]

    Language models can solve computer tasks

    Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks. Advances in Neural Information Processing Systems , 36, 2024

  13. [13]

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

  14. [14]

    Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent

    Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, et al. Autowebglm: Bootstrap and reinforce a large language model-based web navigating agent. arXiv preprint arXiv:2404.03648, 2024

  15. [15]

    Reinforcement learning on web interfaces using workflow-guided exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration. In International Conference on Learning Representations (ICLR), 2018

  16. [16]

    Visual instruction tuning.Advances in neural information processing systems, 36, 2024

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36, 2024

  17. [17]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

  18. [18]

    Hierarchical prompting assists large language model on web navigation

    Robert Lo, Abishek Sridhar, Frank F Xu, Hao Zhu, and Shuyan Zhou. Hierarchical prompting assists large language model on web navigation. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10217–10244, 2023

  19. [19]

    Weblinx: Real- world website navigation with multi-turn dialogue.arXiv preprint arXiv:2402.05930, 2024

    Xing Han Lù, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024

  20. [20]

    Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking

    Zhiyi Ma, Kawin Ethayarajh, Tristan Thrush, Somya Jain, Ledell Wu, Robin Jia, Christopher Potts, Adina Williams, and Douwe Kiela. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Advances in Neural Information Processing Systems , 34:10351–10367, 2021

  21. [21]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023

  22. [22]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  23. [23]

    Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations, 2023

    Theo X Olausson, Jeevana Priya Inala, Chenglong Wang, Jianfeng Gao, and Armando Solar-Lezama. Is self-repair a silver bullet for code generation? In The Twelfth International Conference on Learning Representations, 2023

  24. [24]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems , 35:27730–27744, 2022

  25. [25]

    Autonomous evaluation and refinement of digital agents, 2024

    Jiayi Pan, Yichi Zhang, Nicholas Tomlin, Yifei Zhou, Sergey Levine, and Alane Suhr. Autonomous evaluation and refinement of digital agents, 2024

  26. [26]

    Androidinthewild: A large-scale dataset for android device control

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Androidinthewild: A large-scale dataset for android device control. Advances in Neural Information Processing Systems , 36, 2024

  27. [27]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017

  28. [28]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. Advances in Neural Information Processing Systems , 36, 2024

  29. [29]

    Cognitive architectures for language agents

    Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive architectures for language agents. Transactions on Machine Learning Research , 2024. ISSN 2835-8856. URL https://openreview.net/forum?id=1i6ZCvflQJ. Survey Certification. 11

  30. [30]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023

  31. [31]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

  32. [32]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022

  33. [33]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments. arXiv preprint arXiv:2404.07972, 2024

  34. [34]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, , and Y . Cao. React: Synergizing reasoning and acting in language models. In International Conference on Learning Representations (ICLR) , 2023

  35. [35]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems , 35: 20744–20757, 2022

  36. [36]

    Tree of thoughts: Deliberate problem solving with large language models

    Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. Advances in Neural Information Processing Systems, 36, 2024

  37. [37]

    GPT-4V(ision) is a Generalist Web Agent, if Grounded

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024

  38. [38]

    dlc_purchase_action

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents. In NeurIPS 2023 F oundation Models for Decision Making Workshop, 2023. 12 A Data Collection Details A.1 Recording process In the construction of Mind2We...

  39. [39]

    Deathloop

    Given the complexity of webpage elements, our initial implementations focus predominantly on parent-child mapping relationships. Future work will delve deeper into inter-element mappings to ensure the accuracy and correctness of element mappings. 21 Table 14: Case study of failure trajectories. State Task Instruction Agent’s Thought Check the rating and u...

  40. [40]

    For instance, issues such as CAPTCHAs, network outages, or inconsistencies across different IPs can influence outcomes

    Network Instability: The variability in network conditions can lead to discrepancies between the results obtained from online real-time evaluations and those from closed environments. For instance, issues such as CAPTCHAs, network outages, or inconsistencies across different IPs can influence outcomes. However, in other words, WebCanvas allows for the gen...

  41. [41]

    This oversight can lead to a misalignment between the defined key nodes and the essential components of task completion, inadvertently penalizing correct processes

    Complex Task Pathways: The diversity of potential execution paths for a given task may not be completely identified by human annotators. This oversight can lead to a misalignment between the defined key nodes and the essential components of task completion, inadvertently penalizing correct processes. A model-based evaluation approach could mitigate some o...

  42. [42]

    For example, a task might involve booking a flight to Hawaii next month if the weather is favorable

    Static Evaluation Functions: The current static nature of our evaluation functions does not accommodate changes in task instructions based on environmental variables such as time, location, or weather conditions. For example, a task might involve booking a flight to Hawaii next month if the weather is favorable. Ideally, the evaluation module would dynami...

  43. [43]

    It has three a t t r i b u t e s : ’[40] ’ for the element ’ s element_id , ’ link ’ in di ca tes the element is a link , and ’ About ’ for the content of the element

    button ’ See more ’ ’’’ In this example , each row r e p r e s e n t s the c h a r a c t e r i s t i c r e p r e s e n t a t i o n of a web page element . It has three a t t r i b u t e s : ’[40] ’ for the element ’ s element_id , ’ link ’ in di ca tes the element is a link , and ’ About ’ for the content of the element . Note : The above element provided...

  44. [44]

    In the initial step of a process or when there ’ s no pr ec ed in g i n t e r a c t i o n history ( i . e . , the previous trace is empty )

  45. [45]

    thought

    In s i t u a t i o n s where the a c c e s s i b i l i t y tree is absent or not provided . - Your action should not be the same as last step ’ s action . - The ‘ element_id ’ should be an integer a c c u r a t e l y r e p r e s e n t i n g the element ’ s ID in the a c c e s s i b i l i t y tree . - AVOID using the provided example ’ s e l e m e n t _ i ...