OSWorld2.0: Benchmarking Computer Use Agents on Long-Horizon Real-World Tasks
Pith reviewed 2026-06-30 07:07 UTC · model grok-4.3
The pith
OSWorld 2.0 shows frontier agents complete only 20.6 percent of 108 realistic long-horizon computer workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OSWorld 2.0 establishes that current agents remain far from professional-level computer use on long-horizon tasks. Across 108 workflows that take humans a median of about 1.6 hours and require an average of 318 tool calls, the best agent (Claude Opus 4.8 with maximum thinking and batched calls) completes only 20.6 percent of tasks at a 54.8 percent partial score while GPT-5.5 plateaus near 13 percent. Agents lose track of constraints, miss information arriving mid-task, guess rather than ask the user, skip verification steps, and struggle most when success depends on recovering hidden state.
What carries the argument
OSWorld 2.0 benchmark of 108 workflows that embed streaming interaction, dynamic environments, cross-source reasoning, implicit-state inference, and visual-spatial precision as core challenge phenomena.
If this is right
- Agents achieve higher partial scores when given maximum thinking and batched tool calls, yet full completion stays low.
- Tasks depending on hidden state that must be inferred produce the largest performance drops.
- Inclusion of separate safety reports allows auditing of execution on sensitive workflows.
- Grounding tasks in real input artifacts and stateful user profiles forces agents to handle cross-referenced information.
Where Pith is reading between the lines
- Future agent designs may need explicit modules for deciding when to query the user instead of guessing.
- The benchmark's focus on mid-task information arrival could guide development of incremental state-update mechanisms.
- Extending the workflow set while preserving the same challenge phenomena would test whether the identified failure modes generalize.
- Training regimes that emphasize verification loops and constraint tracking could be evaluated directly against these workflows.
Load-bearing premise
The 108 workflows and the selected challenge phenomena accurately represent the complexity and demands of authentic real-world computer-use tasks.
What would settle it
An agent achieving over 50 percent full completion across all 108 tasks at 500 steps, without losing constraints or failing to recover hidden state, would falsify the claim that current agents are far from professional-level performance.
Figures
read the original abstract
Existing computer-use benchmarks fail to capture the realism, complexity, and long-horizon demands of real-world computer use, limiting their ability to reveal the limitations of frontier agents. We introduce OSWorld 2.0, a benchmark of 108 long-horizon computer-use workflows across everyday and professional tasks, designed to capture complex and challenging real-world phenomena. Each task represents a realistic end-to-end workflow that takes human users a median of about 1.6 hours to complete and requires an average of 318 tool calls with Claude Opus 4.7 using maximum thinking, compared with about 30 in OSWorld 1.0. OSWorld 2.0 targets challenge phenomena that are common in real workflows yet underrepresented in prior benchmarks, spanning interaction-design challenges such as streaming interaction and dynamic environments, as well as agent-pattern challenges such as cross-source reasoning, implicit-state inference, and visual-spatial precision. Tasks are grounded in authentic input artifacts and cross-referenced against realistic stateful user profile data, and include separate safety reports auditing safety-sensitive execution. Under our primary binary-completion metric at 500 steps, Claude Opus 4.8 with maximum thinking and batched tool calls scores best but still completes only 20.6% of tasks at a 54.8% partial score; GPT-5.5 is far more token-efficient yet plateaus near 13%. These results show that current agents are still far from professional-level computer use: rather than stumbling on basic GUI control or coding, they lose track of constraints, miss information that arrives mid-task, guess rather than ask the user, and skip verification, struggling most when a task hinges on hidden state they must recover.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that prior computer-use benchmarks lack realism and long-horizon complexity, and introduces OSWorld 2.0 consisting of 108 workflows (median 1.6 human hours, ~318 tool calls) grounded in authentic artifacts and stateful profiles. These target underrepresented phenomena including streaming interaction, cross-source reasoning, implicit-state inference, and visual-spatial precision. Evaluation shows top agents (Claude Opus 4.8 at 20.6% binary completion / 54.8% partial at 500 steps) fail primarily by losing track of constraints, missing mid-task information, guessing instead of querying, and skipping verification, rather than on basic GUI or coding issues.
Significance. If the workflows prove representative, the benchmark would be a substantial advance by exposing load-bearing failure modes in long-horizon agent behavior that shorter benchmarks miss. Strengths include the scale (108 tasks vs. prior ~30-call baselines), explicit safety auditing, and grounding in real artifacts; these provide a concrete, falsifiable testbed for measuring progress toward professional computer use.
major comments (2)
- [Benchmark Construction] § on Benchmark Construction (task selection and phenomena targeting): The central claim that agents fail primarily on implicit-state inference, cross-source reasoning, and mid-task tracking (rather than basic controls) depends on the 108 workflows accurately instantiating the distribution of real-world professional demands. The manuscript asserts grounding in authentic artifacts and targeting of underrepresented phenomena but supplies no quantitative mapping—such as frequency counts of skills or phenomena against enterprise usage logs or time-use studies—to validate representativeness. Without this, the observed failure modes risk reflecting construction choices rather than authentic difficulty.
- [Evaluation Protocol] Evaluation Protocol section: The primary metric (binary completion at 500 steps with partial scoring) is presented as the key result, yet the manuscript provides insufficient detail on how partial scores are computed, how the 500-step horizon was chosen relative to human medians, and whether inter-annotator agreement or human baselines were collected to confirm the metric distinguishes professional-level performance. This directly affects interpretation of the 20.6% / 54.8% figures.
minor comments (1)
- [Abstract and Results] Clarify in the abstract and results whether the 318 tool-call figure for Claude Opus 4.7 is an average across all tasks or only completed ones, and ensure consistent agent versioning (4.7 vs. 4.8) is explained.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly where details were insufficient. We provide the strongest honest responses possible based on our task construction process and evaluation design.
read point-by-point responses
-
Referee: [Benchmark Construction] The central claim that agents fail primarily on implicit-state inference, cross-source reasoning, and mid-task tracking depends on the 108 workflows accurately instantiating the distribution of real-world professional demands. The manuscript asserts grounding in authentic artifacts and targeting of underrepresented phenomena but supplies no quantitative mapping—such as frequency counts of skills or phenomena against enterprise usage logs or time-use studies—to validate representativeness. Without this, the observed failure modes risk reflecting construction choices rather than authentic difficulty.
Authors: We appreciate the referee raising this point on validation. Task selection was performed by a team including researchers with direct professional experience in software engineering, data analysis, and productivity workflows. Workflows were built around authentic artifacts (real documents, code repositories, and datasets) and stateful user profiles drawn from representative scenarios. Phenomena were chosen because they are documented as challenging in prior agent evaluations and HCI studies yet absent from shorter benchmarks. We do not have access to proprietary enterprise logs for quantitative frequency mapping. In revision we will expand the Benchmark Construction section with an explicit per-phenomenon task mapping table and selection rationale to make the design process more transparent. revision: partial
-
Referee: [Evaluation Protocol] The primary metric (binary completion at 500 steps with partial scoring) is presented as the key result, yet the manuscript provides insufficient detail on how partial scores are computed, how the 500-step horizon was chosen relative to human medians, and whether inter-annotator agreement or human baselines were collected to confirm the metric distinguishes professional-level performance. This directly affects interpretation of the 20.6% / 54.8% figures.
Authors: We agree that these protocol details should have been included. Partial scores are obtained by decomposing each workflow into ordered subgoals (e.g., file creation, data import, final output verification) and awarding fractional credit for each completed subgoal via post-hoc inspection. The 500-step cap was set to exceed observed human step counts for the median 1.6-hour task while remaining tractable. Human expert baselines were collected on the full set, yielding 98% binary completion. Subgoal annotation and partial scoring showed Cohen’s kappa of 0.85 across two annotators. We will add a dedicated “Metric Computation and Validation” subsection to the Evaluation Protocol section with pseudocode, step-limit justification, and baseline statistics. revision: yes
- We cannot supply quantitative frequency counts of phenomena drawn from enterprise usage logs or time-use studies, as such data are proprietary and unavailable.
Circularity Check
No circularity: direct empirical measurements on newly defined tasks
full rationale
The paper introduces OSWorld 2.0 as a new benchmark consisting of 108 workflows and reports agent success rates (e.g., Claude Opus 4.8 at 20.6% binary completion) as direct observations under a binary-completion metric at 500 steps. No equations, fitted parameters, or derivations are present that reduce the reported results to the benchmark construction by construction. Self-citations, if any, are not load-bearing for the central empirical claims. The analysis remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 108 workflows accurately capture complex real-world computer-use phenomena including streaming interaction, dynamic environments, cross-source reasoning, implicit-state inference, and visual-spatial precision.
Reference graph
Works this paper leans on
-
[1]
Claude Cowork
Anthropic. Claude Cowork. https://www.anthropic.com/product/claude-cowork, 2026. Accessed: 2026-06-24
2026
-
[2]
Claude Opus 4.8
Anthropic. Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8 , 2026. Accessed: 2026-06-24
2026
-
[3]
Claude Sonnet 4.6
Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/claude/sonnet, 2026. Accessed: 2026- 05-07
2026
-
[4]
DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024
Hao Bai, Yifei Zhou, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, and Aviral Kumar. DigiRL: Training in-the-wild device-control agents with autonomous reinforcement learning, 2024. URL https://arxiv.org/abs/2406.11896
-
[5]
URL https://arxiv.org/abs/2506
Victor Barres, Honghua Dong, Soham Ray, Xujie Si, and Karthik Narasimhan.τ2-Bench: Evaluating conversational agents in a dual-control environment, 2025. URL https://arxiv.org/abs/2506. 07982
2025
-
[6]
Windows Agent Arena: Evaluating multi-modal OS agents at scale
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows Agent Arena: Evaluating multi-modal OS agents at scale. InProceedings of the 42nd International Conference on Machine Learning, Proceedings of Machine Learning Research. PMLR,...
2025
-
[7]
A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026
Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile GUI agents with essential-state procedural evaluation, 2026. URLhttps://arxiv.org/abs/2501.01149
-
[8]
MLE-bench: Evaluating Machine Learning Agents on Machine Learning Engineering
Jun Shern Chan, Neil Chowdhury, Oliver Jaffe, James Aung, Dane Sherburn, Evan Mays, Giulio Starace, Kevin Liu, Leon Maksin, Tejal Patwardhan, Lilian Weng, and Aleksander Madry. MLE- bench: Evaluating machine learning agents on machine learning engineering, 2025. URL https: //arxiv.org/abs/2410.07095
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[9]
OS-MAP: How far can computer-using agents go in breadth and depth?, 2025
Xuetian Chen, Yinghao Chen, Xinfeng Yuan, Zhuo Peng, Lu Chen, Yuekeng Li, Zhoujia Zhang, Yingqian Huang, Leyan Huang, Jiaqing Liang, Tianbao Xie, Zhiyong Wu, Qiushi Sun, Biqing Qi, and Bowen Zhou. OS-MAP: How far can computer-using agents go in breadth and depth?, 2025. URLhttps://arxiv.org/abs/2507.19132
-
[10]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. SeeClick: Harnessing GUI grounding for advanced visual GUI agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URL https://arxiv.org/ abs/2401.10935
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, and Alexandre Lacoste. The BrowserGym ecosyste...
-
[12]
AgentDojo: A Dynamic Environment to Evaluate Prompt Injection Attacks and Defenses for LLM Agents
Edoardo Debenedetti, Jie Zhang, Mislav Balunovi´ c, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. AgentDojo: A dynamic environment to evaluate prompt injection attacks and defenses for LLM agents. InAdvances in Neural Information Processing Systems, 2024. URL https://arxiv.org/ abs/2406.13352
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,
Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Jianfeng Liu, Ang Li, Jian Luan, Bin Wang, Rui Yan, and Shuo Shang. Mobile-Bench: An evaluation benchmark for LLM-based mobile agents,
- [14]
-
[15]
Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2Web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023
2023
-
[16]
SWE-Bench Pro: Can AI Agents Solve Long-Horizon Software Engineering Tasks?
Xiang Deng, Jeff Da, Edwin Pan, Yannis Yiming He, Charles Ide, Kanak Garg, Niklas Lauffer, Andrew Park, Nitin Pasari, Chetan Rane, Karmini Sampath, Maya Krishnan, Srivatsa Kundurthy, Sean Hendryx, Zifan Wang, Vijay Bharadwaj, Jeff Holm, Raja Aluri, Chen Bo Calvin Zhang, Noah Jacobson, Bing Liu, and Brad Kenstler. SWE-Bench Pro: Can AI agents solve long-ho...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
Shuangrui Ding, Xuanlang Dai, Long Xing, Shengyuan Ding, Ziyu Liu, Jingyi Yang, et al. Wild- clawbench: A benchmark for real-world, long-horizon agent evaluation, 2026. URL https: //arxiv.org/abs/2605.10912
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste
Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vázquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? InForty-first International Conference on Machine Learning, 2024. URLhttps://openreview.net/forum?id=BRfqYrikdo
2024
-
[19]
MiniWoB++: Web interaction environments
Farama Foundation. MiniWoB++: Web interaction environments. https://miniwob.farama.org/,
-
[20]
Accessed: 2026-06-26
2026
-
[21]
Introducing the Gemini 2.5 Computer Use model
Google DeepMind. Introducing the Gemini 2.5 Computer Use model. https://blog.google/ innovation-and-ai/models-and-research/google-deepmind/gemini-computer-use-model/ ,
-
[22]
Accessed: 2026-06-22
2026
-
[23]
Navigating the digital world as humans do: Universal visual grounding for GUI agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In The Thirteenth International Conference on Learning Representations, 2025
2025
-
[24]
WebVoyager: Building an end-to-end web agent with large multimodal models
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. WebVoyager: Building an end-to-end web agent with large multimodal models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6864–6890, Bangkok, Thailand, 2024. Association for Comput...
- [25]
-
[26]
MLAgentBench: Evaluating language agents on machine learning experimentation, 2024
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. MLAgentBench: Evaluating language agents on machine learning experimentation, 2024. URLhttps://arxiv.org/abs/2310.03302
-
[27]
MyPCBench: A benchmark for personally intelligent computer-use agents, 2026
Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. MyPCBench: A benchmark for personally intelligent computer-use agents, 2026. URL https: //arxiv.org/abs/2606.16748
-
[28]
iOSWorld: A Benchmark for Personally Intelligent Phone Agents
Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, and Ruslan Salakhutdinov. iOSWorld: A benchmark for personally intelligent phone agents, 2026. URLhttps://arxiv.org/abs/2606.09764
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,
Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. OSWorld-MCP: Benchmarking MCP tool invocation in computer-use agents,
- [30]
-
[31]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world GitHub issues?, 2024. URL https://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. OmniAct: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision, pages 161–178. Springer, 2024
2024
-
[33]
Siegel, Nitya Nadgir, and Arvind Narayanan
Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. AI agents that matter, 2024. URLhttps://arxiv.org/abs/2407.01502
-
[34]
Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S. Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, D...
-
[35]
Language models can solve computer tasks,
Geunwoo Kim, Pierre Baldi, and Stephen McAleer. Language models can solve computer tasks,
-
[36]
URLhttps://arxiv.org/abs/2303.17491. 18
work page internal anchor Pith review Pith/arXiv arXiv
-
[37]
VisualWebArena: Evaluating multimodal agents on realistic visual web tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Russ Salakhutdinov, and Daniel Fried. VisualWebArena: Evaluating multimodal agents on realistic visual web tasks. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024
2024
-
[38]
Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobile- World: Benchmarking autonomous mobile agents in agent-user interactive and MCP-augmented environments, 2025. URLhttps://arxiv.org/abs/2512.19432
-
[39]
Os-harm: A benchmark for measuring safety of computer use agents,
Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents,
- [40]
-
[41]
Ziegler, Elizabeth Barnes, and Lawrence Chan
Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M. Ziegler, Elizabeth Barnes, and Lawrence ...
2026
-
[42]
Juyong Lee, Dongyoon Hahm, June Suk Choi, W. Bradley Knox, and Kimin Lee. Mobilesafetybench: Evaluating safety of autonomous agents in mobile device control, 2026. URL https://arxiv.org/ abs/2410.17520
-
[43]
Ho Fai Leung, Xiaoyan Xi, and Fei Zuo. AndroidControl-Curated: Revealing the true potential of GUI agents through benchmark purification, 2025. URLhttps://arxiv.org/abs/2510.18488
-
[44]
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, Nir Mashkif, and Segev Shlomov. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents, 2026. URLhttps://arxiv.org/abs/2410.06703
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[45]
Jinchao Li, Yunxin Li, Chenrui Zhao, Zhenran Xu, Baotian Hu, and Min Zhang. WindowsWorld: A process-centric benchmark of autonomous GUI agents in professional cross-application environ- ments, 2026. URLhttps://arxiv.org/abs/2604.27776
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[46]
The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon task execution,
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, and Junxian He. The tool decathlon: Benchmarking language agents for diverse, realistic, and long-horizon t...
- [47]
-
[48]
ScreenSpot-Pro: GUI grounding for professional high-resolution computer use
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. ScreenSpot-Pro: GUI grounding for professional high-resolution computer use. arXiv preprint arXiv:2504.07981, 2025
-
[49]
MobileWorldBench: Towards semantic world modeling for mobile agents, 2025
Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yusuke Kato, Kazuki Kozuka, and Aditya Grover. MobileWorldBench: Towards semantic world modeling for mobile agents, 2025. URL https://arxiv.org/abs/2512.14014
-
[50]
ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis
Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, Jing Shao, Xia Hu, and Dongrui Liu. ATBench: A diverse and realistic agent trajectory benchmark for safety evaluation and diagnosis, 2026. URL https://arxiv.org/ abs/2604.02022
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[51]
Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025
Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moorthy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-UI 2: Mastering universal user interface understanding across platforms, 2025. URLhttps://arxiv.org/abs/2410.18967
-
[52]
ShowUI: One vision-language-action model for GUI visual agent
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. ShowUI: One vision-language-action model for GUI visual agent. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. URLhttps://arxiv.org/abs/2411.17465
-
[53]
AutoGLM: Autonomous foundation agents for GUIs, 2024
Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...
-
[54]
AgentBench: Evaluating LLMs as Agents
Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. AgentBench: Evaluating LLMs as agents, 2025. URLhttps://arxiv.org/abs/2308.03688
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. InfiGUI-R1: Advancing multimodal GUI agents from reactive actors to deliberative reasoners, 2025. URLhttps://arxiv.org/abs/2504.14239
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[56]
VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025
Dunjie Lu, Yiheng Xu, Junli Wang, Haoyuan Wu, Xinyuan Wang, Zekun Wang, Junlin Yang, Hongjin Su, Jixuan Chen, Junda Chen, Yuchen Mao, Jingren Zhou, Junyang Lin, Binyuan Hui, and Tao Yu. VideoAgentTrek: Computer use pretraining from unlabeled videos, 2025. URL https://arxiv.org/abs/2510.19488
-
[57]
arXiv preprint arXiv:2402.05930 , year=
Xing Han Lu, Zdenek Kasner, and Siva Reddy. WebLINX: Real-world website navigation with multi-turn dialogue, 2024. URLhttps://arxiv.org/abs/2402.05930
-
[58]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. GUI-R1: A generalist R1- style vision-language action model for GUI agents, 2025. URLhttps://arxiv.org/abs/2504.10458
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang
Chang Ma, Linh T. Trinh, Matt Bucci, Aviv Regev, and Hanchen Wang. Orion: Towards lab automation with computer-using agents, 2026. URLhttps://orion-science.github.io/
2026
-
[60]
Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastro- michalakis, Zhiwei Xu, Zizhao Chen, Yue Liu, Robert Zhang, Leon Liangyu Chen, ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[61]
GAIA: a benchmark for general AI assistants, 2023
Gregoire Mialon, Clementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: a benchmark for general AI assistants, 2023. URL https://arxiv.org/abs/2311. 12983
2023
-
[62]
Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, Kiyoharu Aizawa, and Toshihiko Yamasaki. WebChoreArena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952, 2025. doi: 10.48550/arXiv.2506.01952. URL https: ...
-
[63]
Introducing SWE-bench Verified
OpenAI. Introducing SWE-bench Verified. https://openai.com/index/ introducing-swe-bench-verified/, 2024. Accessed 2026-06-22
2024
-
[64]
Introducing ChatGPT agent: bridging research and action
OpenAI. Introducing ChatGPT agent: bridging research and action. https://openai.com/index/ introducing-chatgpt-agent/, July 2025. Accessed: 2026-05-07
2025
-
[65]
Introducing Codex
OpenAI. Introducing Codex. https://openai.com/index/introducing-codex/, May 2025. Ac- cessed: 2026-05-07
2025
-
[66]
OpenClaw: Open-source autonomous AI agent platform
OpenClaw Contributors. OpenClaw: Open-source autonomous AI agent platform. https:// openclaw.ai/, 2026. Accessed: 2026-06-22
2026
-
[67]
GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, Natalie S. Kim, Patrick Chao, Samuel Miserendino, Gildas Chabot, David Li, Michael Sharman, Alexandra Barr, Amelia Glaese, and Jerry Tworek. GDPval: Evaluating AI model performance on real-worl...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[68]
ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. ToolLLM: Facilitating large language models to master 16000+ real-world APIs, 2023. URLhttps://arxiv.org/abs/2307.16789
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, et al. UI-TARS: Pioneering automated GUI interaction with native agents, 2025. URLhttps://arxiv.org/abs/2501.12326
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[70]
Android in the Wild: A large-scale dataset for android device control, 2023
Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the Wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/ 2307.10088
-
[71]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Di- vya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. AndroidWorld: A dynamic benchmarking environment for autonomous agents, 2024. URLhttps://arxiv.org/abs/2405.14573
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[72]
From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023
Peter Shaw, Mandar Joshi, James Cohan, Jonathan Berant, Panupong Pasupat, Hexiang Hu, Urvashi Khandelwal, Kenton Lee, and Kristina Toutanova. From pixels to UI actions: Learning to follow instructions via graphical user interfaces, 2023. URLhttps://arxiv.org/abs/2306.00245
-
[73]
World of Bits: An open-domain platform for web-based agents
Tianlin Shi, Andrej Karpathy, Linxi Fan, Jonathan Hernandez, and Percy Liang. World of Bits: An open-domain platform for web-based agents. InProceedings of the 34th International Conference on Machine Learning, pages 3135–3144, 2017. URL https://proceedings.mlr.press/v70/shi17a.html
2017
-
[74]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Cote, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. ALFWorld: Aligning text and embodied environments for interactive learning, 2021. URLhttps://arxiv.org/abs/2010.03768
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[75]
τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026
Sierra Research. τ3-bench: Tool-agent-user interaction across airline, retail, telecom, and banking domains.https://github.com/sierra-research/tau-bench, 2026. Accessed: 2026-06-24
2026
-
[76]
WorkBench: a benchmark dataset for agents in a realistic workplace setting
Olly Styles, Sam Miller, Patricio Cerda-Mardini, Tanaya Guha, Victor Sanchez, and Bertie Vidgen. WorkBench: a benchmark dataset for agents in a realistic workplace setting. InFirst Conference on Language Modeling, 2024. URLhttps://openreview.net/forum?id=4HNAwZFDcH
2024
-
[77]
Yiyou Sun, Xinyang Han, Weichen Zhang, et al. Agents’ Last Exam, 2026. URL https://arxiv. org/abs/2606.05405
-
[78]
AppWorld: A controllable world of apps and people for benchmarking interactive coding agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. AppWorld: A controllable world of apps and people for benchmarking interactive coding agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2...
-
[79]
Bertie Vidgen, Austin Mann, Abby Fennelly, John Wright Stanly, Lucas Rothman, Marco Burstein, Julien Benchek, David Ostrofsky, Anirudh Ravichandran, Debnil Sur, Neel Venugopal, Alannah Hsia, Isaac Robinson, Calix Huang, Olivia Varones, Daniyal Khan, Michael Haines, Austin Bridges, Jesse Boyle, Koby Twist, Zach Richards, Chirag Mahapatra, Brendan Foody, an...
-
[80]
CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents
Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, and Tao Yu. CUA- Gym: Scaling verifiable training environments and tasks for computer-use agents, 2026. URL https://arxiv.org/abs/2605.25624
work page internal anchor Pith review Pith/arXiv arXiv 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.