pith. machine review for the scientific record. sign in

arxiv: 2404.07972 · v2 · submitted 2024-04-11 · 💻 cs.AI · cs.CL

Recognition: no theorem link

OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:14 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords multimodal agentscomputer environment benchmarkopen-ended tasksGUI groundingLLM agentsexecution-based evaluationOSWorlddesktop applications
0
0 comments X

The pith

OSWorld creates the first scalable real-computer environment benchmark where multimodal agents reach only 12.24% success on 369 open tasks against humans at over 72%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper presents OSWorld, a new environment that runs actual operating systems to let multimodal agents perform and be evaluated on real computer tasks. The benchmark consists of 369 tasks drawn from everyday use of web apps, desktop software, file operations, and multi-app workflows, each with setup and custom scoring scripts. Evaluations of current models show they lag far behind human performance, mainly because they fail at grounding in graphical interfaces and lack operational knowledge. A reader should care because this provides a more realistic test than previous narrow benchmarks, highlighting what is needed for agents to act as useful computer assistants.

Core claim

OSWorld is the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across operating systems such as Ubuntu, Windows, and macOS. It enables assessment of open-ended tasks involving arbitrary applications. The associated benchmark includes 369 tasks from real-world cases with detailed initial states and custom evaluation scripts. State-of-the-art agents achieve only 12.24% success rate, while humans exceed 72.36%, struggling especially with GUI grounding and operational knowledge. The environment allows for comprehensive analysis to guide development of better multimodal generalist agents.

What carries the argument

OSWorld, the unified real computer environment that integrates task setup, execution, and evaluation for open-ended tasks across various applications and operating systems.

If this is right

  • Current state-of-the-art LLM and VLM agents exhibit significant deficiencies in performing as computer assistants.
  • Agents primarily struggle with GUI grounding and operational knowledge in real environments.
  • OSWorld provides valuable insights for developing multimodal generalist agents unavailable from prior benchmarks.
  • The setup supports interactive learning and can be used across multiple operating systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption of OSWorld could standardize evaluation of computer-use agents, similar to how other benchmarks advanced their fields.
  • The identified performance gap may drive research into better vision-language-action models for interface interaction.
  • Public release of the environment allows independent researchers to build and test agents on consistent real-world tasks.
  • This work suggests that progress in multimodal agents for computers will require advances beyond current model capabilities in handling dynamic desktop interactions.

Load-bearing premise

The 369 tasks chosen from real-world use cases and equipped with custom evaluation scripts are representative of the full diversity and complexity of open-ended computer work without selection bias or narrow criteria.

What would settle it

A multimodal agent achieving success rates substantially above 12.24%, such as 40% or more, on the full set of OSWorld tasks in a controlled evaluation would challenge the reported extent of current model deficiencies.

read the original abstract

Autonomous agents that accomplish complex computer tasks with minimal human interventions have the potential to transform human-computer interaction, significantly enhancing accessibility and productivity. However, existing benchmarks either lack an interactive environment or are limited to environments specific to certain applications or domains, failing to reflect the diverse and complex nature of real-world computer use, thereby limiting the scope of tasks and agent scalability. To address this issue, we introduce OSWorld, the first-of-its-kind scalable, real computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across various operating systems such as Ubuntu, Windows, and macOS. OSWorld can serve as a unified, integrated computer environment for assessing open-ended computer tasks that involve arbitrary applications. Building upon OSWorld, we create a benchmark of 369 computer tasks involving real web and desktop apps in open domains, OS file I/O, and workflows spanning multiple applications. Each task example is derived from real-world computer use cases and includes a detailed initial state setup configuration and a custom execution-based evaluation script for reliable, reproducible evaluation. Extensive evaluation of state-of-the-art LLM/VLM-based agents on OSWorld reveals significant deficiencies in their ability to serve as computer assistants. While humans can accomplish over 72.36% of the tasks, the best model achieves only 12.24% success, primarily struggling with GUI grounding and operational knowledge. Comprehensive analysis using OSWorld provides valuable insights for developing multimodal generalist agents that were not possible with previous benchmarks. Our code, environment, baseline models, and data are publicly available at https://os-world.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces OSWorld as the first scalable, real-computer environment for multimodal agents, supporting task setup, execution-based evaluation, and interactive learning across Ubuntu, Windows, and macOS. It constructs a benchmark of 369 tasks derived from real-world use cases involving web/desktop apps, file I/O, and multi-app workflows, each with initial-state configurations and custom evaluation scripts. Experiments show state-of-the-art LLM/VLM agents achieving 12.24% success versus humans at >72.36%, with primary failure modes in GUI grounding and operational knowledge.

Significance. If the benchmark holds, this work supplies the first unified, real-environment platform for open-ended computer tasks, enabling reproducible evaluation and interactive learning beyond prior domain-specific or simulated benchmarks. The public release of the environment, code, baselines, and data is a clear strength that directly supports community progress. The concrete identification of failure modes supplies actionable guidance for multimodal generalist agents.

major comments (2)
  1. [§4] §4 (Benchmark Construction): The 369 tasks are stated to derive from real-world use cases, but the section provides no quantitative coverage metrics (e.g., domain/OS/application distribution statistics or comparison against external task taxonomies). This directly affects the central claim that the 12.24% vs. >72.36% gap reveals general deficiencies in open-ended computer work.
  2. [§5] §5 (Experiments and Evaluation): The custom execution-based scripts are presented as reliable, yet the manuscript contains no analysis or ablation testing whether agents succeeding via unscripted but functionally equivalent paths would be scored as failures. This risks inflating the reported model-human gap and is load-bearing for the deficiency conclusions.
minor comments (1)
  1. [Abstract] Abstract and §5: Explicitly name the single best model achieving 12.24% and cite the exact table/figure row where this number appears.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§4] §4 (Benchmark Construction): The 369 tasks are stated to derive from real-world use cases, but the section provides no quantitative coverage metrics (e.g., domain/OS/application distribution statistics or comparison against external task taxonomies). This directly affects the central claim that the 12.24% vs. >72.36% gap reveals general deficiencies in open-ended computer work.

    Authors: We agree that quantitative coverage metrics would better substantiate the representativeness of the 369 tasks and the generalizability of the observed performance gap. In the revised manuscript, we will add a new subsection or table in §4 that reports the distribution of tasks across domains (e.g., web browsing, file management, multi-app workflows), operating systems (Ubuntu, Windows, macOS), and specific applications. We will also include a brief comparison against established task taxonomies from prior HCI and AI literature to contextualize coverage. These additions will directly support the claim that the benchmark reflects diverse real-world computer use. revision: yes

  2. Referee: [§5] §5 (Experiments and Evaluation): The custom execution-based scripts are presented as reliable, yet the manuscript contains no analysis or ablation testing whether agents succeeding via unscripted but functionally equivalent paths would be scored as failures. This risks inflating the reported model-human gap and is load-bearing for the deficiency conclusions.

    Authors: The evaluation scripts are intentionally state-based: they inspect the final environment state (e.g., file contents, application window states, or output artifacts) rather than prescribing exact action sequences, which is intended to accommodate multiple valid paths. Nevertheless, we acknowledge the value of explicitly analyzing potential false negatives from functionally equivalent but unscripted trajectories. In the revision, we will expand §5 with a dedicated paragraph and examples illustrating how the scripts verify functional success, discuss observed cases where alternative paths were correctly accepted, and note any remaining limitations where certain edge-case equivalences might not be captured. This will clarify the reliability of the reported gap without requiring an exhaustive new ablation study. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical benchmark without derivations or fitted predictions

full rationale

The paper presents an empirical benchmark (OSWorld environment + 369 tasks) with execution-based evaluation scripts and public code/data. No mathematical derivation chain, no fitted parameters renamed as predictions, and no load-bearing self-citations or uniqueness theorems. Claims rest on direct experimental results (model success 12.24% vs human >72%) rather than any reduction to inputs by construction. This is the standard non-circular outcome for a systems/benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a new benchmark and environment without introducing fitted parameters, new axioms, or invented physical entities; it relies on standard LLM/VLM agents and real OS interfaces.

pith-pipeline@v0.9.0 · 5653 in / 1088 out tokens · 29519 ms · 2026-05-13T01:14:35.775432+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SWE-smith: Scaling Data for Software Engineering Agents

    cs.SE 2025-04 conditional novelty 8.0

    SWE-smith scales software engineering training data to 50k instances across 128 repositories, enabling SWE-agent-LM-32B to achieve 40.2% Pass@1 on SWE-bench Verified, state of the art among open-source models.

  2. Do Androids Dream of Breaking the Game? Systematically Auditing AI Agent Benchmarks with BenchJack

    cs.AI 2026-05 conditional novelty 7.0

    BenchJack audits 10 AI agent benchmarks, synthesizes exploits achieving near-perfect scores without task completion, surfaces 219 flaws, and reduces hackable-task ratios to under 10% on four benchmarks via iterative patching.

  3. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 accept novelty 7.0

    NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...

  4. Feedback-Driven Execution for LLM-Based Binary Analysis

    cs.CR 2026-04 unverdicted novelty 7.0

    FORGE uses a reasoning-action-observation loop and Dynamic Forest of Agents to perform scalable LLM-based binary analysis, finding 1,274 vulnerabilities across 591 of 3,457 real-world firmware binaries at 72.3% precis...

  5. Beyond Chat and Clicks: GUI Agents for In-Situ Assistance via Live Interface Transformation

    cs.HC 2026-04 unverdicted novelty 7.0

    GUI agents can transform live web interfaces in real-time via DOM manipulations to deliver contextual assistance directly within the application.

  6. GUI-Perturbed: Domain Randomization Reveals Systematic Brittleness in GUI Grounding Models

    cs.LG 2026-04 conditional novelty 7.0

    GUI-Perturbed shows that GUI grounding models suffer systematic accuracy collapse under relational instructions and visual changes such as 70% zoom, with even augmented fine-tuning worsening results.

  7. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.

  8. SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

    cs.AI 2026-04 unverdicted novelty 7.0

    SEA-Eval is the first benchmark for self-evolving agents that uses sequential tasks to show success rate alone misleads while convergence in token efficiency T distinguishes genuine evolution.

  9. MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    cs.CV 2026-04 unverdicted novelty 7.0

    Open 4B and 8B visual web agents achieve state-of-the-art results on browser benchmarks by predicting actions from screenshots and instructions, outperforming similar open models and some closed larger-model agents, w...

  10. Same Outcomes, Different Journeys: A Trace-Level Framework for Comparing Human and GUI-Agent Behavior in Production Search Systems

    cs.IR 2026-04 unverdicted novelty 7.0

    A controlled study in an audio-streaming search app shows GUI agents match human task success and query patterns but use more search-centric, low-branching navigation while humans are content-centric and exploratory.

  11. FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

    cs.CL 2026-04 unverdicted novelty 7.0

    FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

  12. AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    cs.AI 2024-05 accept novelty 7.0

    AndroidWorld is a dynamic, reproducible Android benchmark that generates unlimited natural-language tasks for autonomous agents and shows current agents succeed on only 30.6 percent of them.

  13. Can Agent Benchmarks Support Their Scores? Evidence-Supported Bounds for Interactive-Agent Evaluation

    cs.AI 2026-05 unverdicted novelty 6.0

    Agent benchmarks can report evidence-supported score bounds instead of single misleading success rates by adding a layer that checks required artifacts for outcome verification.

  14. MMTB: Evaluating Terminal Agents on Multimedia-File Tasks

    cs.MM 2026-05 unverdicted novelty 6.0

    MMTB is a new benchmark with 105 multimedia terminal tasks that shows how audio and video access changes agent performance and evidence use in executable workflows.

  15. Computer Use at the Edge of the Statistical Precipice

    cs.SE 2026-05 unverdicted novelty 6.0

    A blind replay script matches frontier model performance on static CUA benchmarks due to non-principled environments and evaluation methods, prompting PRISM design principles and the DigiWorld benchmark with improved ...

  16. Augmenting Interface Usability Heuristics for Reliable Computer-Use Agents

    cs.HC 2026-05 unverdicted novelty 6.0

    Augmented Nielsen heuristics improve computer-use agent task completion on varied interfaces while preserving human usability, as shown in UI-Verse experiments and human studies.

  17. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 unverdicted novelty 6.0

    NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.

  18. An AI Agent Execution Environment to Safeguard User Data

    cs.CR 2026-04 unverdicted novelty 6.0

    GAAP guarantees confidentiality of private user data for AI agents by enforcing user-specified permissions deterministically through persistent information flow tracking, without trusting the agent or requiring attack...

  19. Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence

    cs.AI 2026-04 unverdicted novelty 6.0

    Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.

  20. AIT Academy: Cultivating the Complete Agent with a Confucian Three-Domain Curriculum

    cs.AI 2026-04 unverdicted novelty 6.0

    AIT Academy introduces a tripartite curriculum for AI agents across natural science, humanities, and social science domains, with reported gains of 15.9 points in security and 7 points in social reasoning under specif...

  21. MemExplorer: Navigating the Heterogeneous Memory Design Space for Agentic Inference NPUs

    cs.AR 2026-04 unverdicted novelty 6.0

    MemExplorer optimizes heterogeneous memory systems for agentic LLM inference on NPUs and reports up to 2.3x higher energy efficiency than baselines under fixed power budgets.

  22. IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    IntentScore learns intent-conditioned action scores from offline GUI trajectories and raises task success by 6.9 points on an unseen agent and environment.

  23. AdaRubric: Task-Adaptive Rubrics for Reliable LLM Agent Evaluation and Reward Learning

    cs.AI 2026-03 conditional novelty 6.0

    AdaRubric adaptively generates task-specific rubrics via LLM, scores agent trajectories with per-dimension confidence weighting, and produces filtered DPO pairs that raise human correlation to Pearson r=0.79 and downs...

  24. SoK: Agentic Skills -- Beyond Tool Use in LLM Agents

    cs.CR 2026-02 unverdicted novelty 6.0

    The paper systematizes agentic skills beyond tool use, providing design pattern and representation-scope taxonomies plus security analysis of malicious skill infiltration in agent marketplaces.

  25. OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    cs.CL 2024-10 unverdicted novelty 6.0

    OS-Atlas, trained on the largest open-source cross-platform GUI grounding corpus of 13 million elements, outperforms prior open-source models on six benchmarks across mobile, desktop, and web platforms.

  26. From Agent Loops to Deterministic Graphs: Execution Lineage for Reproducible AI-Native Work

    cs.AI 2026-05 conditional novelty 5.0

    Execution lineage models AI-native work as a DAG of computations with explicit dependencies, achieving perfect state preservation in controlled update tasks where loop-based agents introduce churn and contamination.

  27. Addressing the Reality Gap: A Three-Tension Framework for Agentic AI Adoption

    cs.CY 2026-04 unverdicted novelty 5.0

    A three-tension framework is introduced to help navigate the adoption of autonomous agentic AI systems in K-12 and higher education by addressing practical, temporal, and value-based challenges.

  28. AlphaEval: Evaluating Agents in Production

    cs.CL 2026-04 unverdicted novelty 5.0

    AlphaEval is a benchmark of 94 production-sourced tasks from seven companies for evaluating full AI agent products across six domains using multiple judgment methods, plus a framework to build similar benchmarks.

  29. From Pixels to Digital Agents: An Empirical Study on the Taxonomy and Technological Trends of Reinforcement Learning Environments

    cs.AI 2026-03 unverdicted novelty 5.0

    An empirical literature analysis reveals a bifurcation in RL environments into Semantic Prior (LLM-dominated) and Domain-Specific Generalization ecosystems with distinct cognitive fingerprints.

  30. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  31. ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

    cs.AI 2026-05 unverdicted novelty 3.0

    Expanded orchestration in ChromaFlow lowered accuracy on GAIA tasks from 29/53 to 27/53 while increasing timeouts, tool failures, and costs.

  32. X-OmniClaw Technical Report: A Unified Mobile Agent for Multimodal Understanding and Interaction

    cs.CV 2026-05 unverdicted novelty 3.0

    X-OmniClaw presents a unified architecture for Android mobile agents using Omni Perception, Memory, and Action modules to enable efficient multimodal task handling and personalized interactions.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · cited by 31 Pith papers · 17 internal anchors

  1. [1]

    ACT-1: Transformer for Actions

    Adept. ACT-1: Transformer for Actions. https://www.adept.ai/act, 2022

  2. [2]

    Introducing the next generation of claude

    Anthropic. Introducing the next generation of claude. https://www.anthropic.com/news/ claude-3-family, 2023. Accessed: 2024-03-26

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku

    Anthropic. The claude 3 model family: Opus, sonnet, haiku. https://www- cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf, 2024. 18

  4. [4]

    Screenai: A vision-language model for ui and infographics understanding

    Gilles Baechler, Srinivas Sunkara, Maria Wang, Fedir Zubach, Hassan Mansoor, Vincent Etter, Victor C˘arbune, Jason Lin, Jindong Chen, and Abhanshu Sharma. Screenai: A vision-language model for ui and infographics understanding. arXiv preprint arXiv:2402.04615, 2024

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023

  6. [6]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022

  7. [7]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choro- manski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language- action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023

  8. [8]

    Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. arXiv preprint arXiv:2401.10935, 2024

  9. [9]

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. arXiv preprint arXiv:2306.06070, 2023

  10. [10]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks? arXiv preprint arXiv:2403.07718, 2024

  11. [11]

    D. Dupont. GPT-4V-Act: GPT-4 Variant for Active Learning. GitHub repository, 2023. URL https://github.com/ddupont808/GPT-4V-Act

  12. [12]

    Multimodal web navigation with instruction-finetuned foundation models

    Hiroki Furuta, Ofir Nachum, Kuang-Huei Lee, Yutaka Matsuo, Shixiang Shane Gu, and Izzeddin Gur. Multimodal web navigation with instruction-finetuned foundation models. arXiv preprint arXiv:2305.11854, 2023

  13. [13]

    Assistgui: Task-oriented desktop graphical user interface automation

    Difei Gao, Lei Ji, Zechen Bai, Mingyu Ouyang, Peiran Li, Dongxing Mao, Qinchen Wu, Weichen Zhang, Peiyi Wang, Xiangwu Guo, et al. Assistgui: Task-oriented desktop graphical user interface automation. arXiv preprint arXiv:2312.13108, 2023

  14. [14]

    arXiv preprint arXiv:2311.01767 , year=

    Yiduo Guo, Zekai Zhang, Yaobo Liang, Dongyan Zhao, and Duan Nan. Pptc benchmark: Evalu- ating large language models for powerpoint task completion. arXiv preprint arXiv:2311.01767, 2023

  15. [15]

    A real-world WebAgent with planning, long context understanding, and program synthesis.arXiv preprint arXiv:2307.12856, 2023

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856, 2023

  16. [16]

    Webvoyager: Building an end-to-end web agent with large multimodal models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models. arXiv preprint arXiv:2401.13919, 2024

  17. [17]

    Lanxiang Hu, Mingjia Huo, Yuxuan Zhang, Haoyang Yu, Eric P

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. arXiv preprint arXiv:2312.08914, 2023

  18. [18]

    A data-driven approach for learning to control computers

    Peter C Humphreys, David Raposo, Tobias Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers. InInternational Conference on Machine Learning, pages 9466–9482. PMLR, 2022

  19. [19]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024. 19

  20. [20]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues? arXiv preprint arXiv:2310.06770, 2023

  21. [21]

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. arXiv preprint arXiv:2402.17553, 2024

  22. [22]

    Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649,

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. arXiv preprint arXiv:2401.13649, 2024

  23. [23]

    Pix2struct: Screenshot parsing as pretraining for visual language understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisensch- los, Urvashi Khandelwal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual language understanding. In International Confer- ence on Machine Learning, pages 18893–18912. PMLR, 2023

  24. [24]

    Devbench: A comprehensive benchmark for software development

    Bowen Li, Wenhan Wu, Ziwei Tang, Lin Shi, John Yang, Jinyang Li, Shunyu Yao, Chen Qian, Binyuan Hui, Qicheng Zhang, et al. Devbench: A comprehensive benchmark for software development. arXiv preprint arXiv:2403.08604, 2024

  25. [25]

    arXiv preprint arXiv:2305.19308 , year=

    Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. Sheetcopilot: Bring- ing software productivity to the next level through large language models. arXiv preprint arXiv:2305.19308, 2023

  26. [26]

    arXiv preprint arXiv:2312.10665 (2024)

    Lei Li, Zhihui Xie, Mukai Li, Shunian Chen, Peiyi Wang, Liang Chen, Yazheng Yang, Benyou Wang, and Lingpeng Kong. Silkie: Preference distillation for large visual language models. arXiv preprint arXiv:2312.10665, 2023

  27. [27]

    Mapping natural language instructions to mobile ui action sequences.arXiv preprint arXiv:2005.03776, 2020

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences. arXiv preprint arXiv:2005.03776, 2020

  28. [28]

    J. C. R. Licklider. Man-computer symbiosis. IRE Transactions on Human Factors in Electronics, HFE-1(1):4–11, 1960. doi: 10.1109/THFE2.1960.4503259

  29. [29]

    Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system

    Xi Victoria Lin, Chenglong Wang, Luke Zettlemoyer, and Michael D Ernst. Nl2bash: A corpus and semantic parser for natural language interface to the linux operating system. arXiv preprint arXiv:1802.08979, 2018

  30. [30]

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration.arXiv preprint arXiv:1802.08802, 2018

  31. [31]

    Visual Instruction Tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023

  32. [32]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents. arXiv preprint arXiv:2308.03688, 2023

  33. [33]

    We- blinx: Real-world website navigation with multi-turn dia- logue.arXiv preprint arXiv:2402.05930, 2024

    Xing Han Lù, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue. arXiv preprint arXiv:2402.05930, 2024

  34. [34]

    Ma et al

    Chang Ma, Junlei Zhang, Zhihao Zhu, Cheng Yang, Yujiu Yang, Yaohui Jin, Zhenzhong Lan, Lingpeng Kong, and Junxian He. Agentboard: An analytical evaluation board of multi-turn llm agents. arXiv preprint arXiv:2401.13178, 2024

  35. [35]

    Introducing meta Llama 3: The most capable openly available LLM to date, April

    Meta AI. Introducing meta Llama 3: The most capable openly available LLM to date, April

  36. [36]

    Accessed: 2024-04-18

    URL https://ai.meta.com/blog/meta-llama-3/. Accessed: 2024-04-18

  37. [37]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. arXiv preprint arXiv:2311.12983, 2023. 20

  38. [38]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christo- pher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021

  39. [39]

    Screenagent: A vision language model-driven computer control agent

    Runliang Niu, Jindong Li, Shiqi Wang, Yali Fu, Xiyu Hu, Xueyuan Leng, He Kong, Yi Chang, and Qi Wang. Screenagent: A vision language model-driven computer control agent. arXiv preprint arXiv:2402.07945, 2024

  40. [40]

    GPT-4 Technical Report

    R OpenAI. Gpt-4 technical report. arxiv 2303.08774. View in Article, 2:13, 2023

  41. [41]

    Android in the wild: A large-scale dataset for android device control

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control. arXiv preprint arXiv:2307.10088, 2023

  42. [42]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  43. [43]

    An empirical study & evaluation of modern {CAPTCHAs}

    Andrew Searles, Yoshimichi Nakatsuka, Ercan Ozturk, Andrew Paverd, Gene Tsudik, and Ai Enkoji. An empirical study & evaluation of modern {CAPTCHAs}. In 32nd USENIX Security Symposium (USENIX Security 23), pages 3081–3097, 2023

  44. [44]

    From pixels to ui actions: Learning to follow instructions via graphical user interfaces

    Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to ui actions: Learning to follow instructions via graphical user interfaces. arXiv preprint arXiv:2306.00245, 2023

  45. [45]

    World of bits: An open-domain platform for web-based agents

    Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of bits: An open-domain platform for web-based agents. In International Conference on Machine Learning, pages 3135–3144. PMLR, 2017

  46. [46]

    Design2code: How far are we from automating front-end engineering?, 2024

    Chenglei Si, Yanzhe Zhang, Zhengyuan Yang, Ruibo Liu, and Diyi Yang. Design2code: How far are we from automating front-end engineering?, 2024

  47. [47]

    arXiv preprint arXiv:2305.14257 , year=

    Abishek Sridhar, Robert Lo, Frank F Xu, Hao Zhu, and Shuyan Zhou. Hierarchical prompting assists large language model on web navigation. arXiv preprint arXiv:2305.14257, 2023

  48. [48]

    Meta-gui: Towards multi-modal conversational agents on mobile gui

    Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui. arXiv preprint arXiv:2205.11029, 2022

  49. [49]

    Cradle: Empowering foundation agents towards general computer control,

    Weihao Tan, Ziluo Ding, Wentao Zhang, Boyu Li, Bohan Zhou, Junpeng Yue, Haochong Xia, Jiechuan Jiang, Longtao Zheng, Xinrun Xu, et al. Towards general computer control: A multimodal agent for red dead redemption ii as a case study. arXiv preprint arXiv:2403.03186, 2024

  50. [50]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

  51. [51]

    Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan

    Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231, 2021

  52. [52]

    Ugif: Ui grounded instruction following

    Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan. Ugif: Ui grounded instruction following. arXiv preprint arXiv:2211.07615, 2022

  53. [53]

    Mobile- agent: Autonomous multi-modal mobile device agent with visual perception.arXiv preprint arXiv:2401.16158, 2024

    Junyang Wang, Haiyang Xu, Jiabo Ye, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024

  54. [54]

    Empowering llm to use smartphone for intelligent task automation

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Empowering llm to use smartphone for intelligent task automation. arXiv preprint arXiv:2308.15272, 2023. 21

  55. [55]

    Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks

    Tianbao Xie, Fan Zhou, Zhoujun Cheng, Peng Shi, Luoxuan Weng, Yitao Liu, Toh Jing Hua, Junning Zhao, Qian Liu, Che Liu, Leo Z. Liu, Yiheng Xu, Hongjin Su, Dongchan Shin, Caiming Xiong, and Tao Yu. Openagents: An open platform for language agents in the wild. CoRR, abs/2310.10634, 2023. doi: 10.48550/ARXIV .2310.10634. URL https: //doi.org/10.48550/arXiv.2...

  56. [56]

    Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation

    An Yan, Zhengyuan Yang, Wanrong Zhu, Kevin Lin, Linjie Li, Jianfeng Wang, Jianwei Yang, Yiwu Zhong, Julian McAuley, Jianfeng Gao, et al. Gpt-4v in wonderland: Large multimodal models for zero-shot smartphone gui navigation. arXiv preprint arXiv:2311.07562, 2023

  57. [57]

    Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

    Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v.arXiv preprint arXiv:2310.11441, 2023

  58. [58]

    arXiv preprint arXiv:2306.14898 , year=

    John Yang, Akshara Prabhakar, Karthik Narasimhan, and Shunyu Yao. Intercode: Standardizing and benchmarking interactive coding with execution feedback.arXiv preprint arXiv:2306.14898, 2023

  59. [59]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757, 2022

  60. [60]

    arXiv preprint arXiv:2402.07939 , year=

    Chaoyun Zhang, Liqun Li, Shilin He, Xu Zhang, Bo Qiao, Si Qin, Minghua Ma, Yu Kang, Qingwei Lin, Saravan Rajmohan, et al. Ufo: A ui-focused agent for windows os interaction. arXiv preprint arXiv:2402.07939, 2024

  61. [61]

    Appagent: Multimodal agents as smartphone users

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. Appagent: Multimodal agents as smartphone users. arXiv e-prints, pages arXiv–2312, 2023

  62. [62]

    Mobile-env: An evaluation platform and benchmark for interactive agents in llm era

    Danyang Zhang, Lu Chen, and Kai Yu. Mobile-env: A universal platform for training and evaluation of mobile interaction. arXiv preprint arXiv:2305.08144, 2023

  63. [63]

    Large language models are semi-parametric reinforcement learning agents

    Danyang Zhang, Lu Chen, Situo Zhang, Hongshen Xu, Zihan Zhao, and Kai Yu. Large language models are semi-parametric reinforcement learning agents. Advances in Neural Information Processing Systems, 36, 2024

  64. [64]

    You only look at screens: Multimodal chain-of-action agents

    Zhuosheng Zhang and Aston Zhang. You only look at screens: Multimodal chain-of-action agents. arXiv e-prints, pages arXiv–2309, 2023

  65. [65]

    Tie: Topological information enhanced structural reading comprehension on web pages

    Zihan Zhao, Lu Chen, Ruisheng Cao, Hongshen Xu, Xingyu Chen, and Kai Yu. Tie: Topological information enhanced structural reading comprehension on web pages. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1808–1821, 2022

  66. [66]

    Gpt-4v (ision) is a generalist web agent, if grounded

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. Gpt-4v (ision) is a generalist web agent, if grounded. arXiv preprint arXiv:2401.01614, 2024

  67. [67]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Yonatan Bisk, Daniel Fried, Uri Alon, et al. Webarena: A realistic web environment for building autonomous agents. arXiv preprint arXiv:2307.13854, 2023

  68. [68]

    MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models

    Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and Mohamed Elhoseiny. Minigpt-4: En- hancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023. 22 A Details of OSW ORLD Environment A.1 Environment Infrastructure As compared to core commonly used techniques like Docker 6, virtual machines can operate th...