Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Chaobo Xian; Chao He; Chenchen Zhang; Chi Wu; Chuqian Yu; Fangzhi Xu; Ge Zhang; Guanhong Chen; Haodong Duan; Haojun Wang

arxiv: 2606.11042 · v1 · pith:4ITB7L5Bnew · submitted 2026-06-09 · 💻 cs.AI

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Liya Zhu , Jingzhe Ding , Jian Zhang , Jianbo Xue , Shihao Liang , Ge Zhang , Xiang Gao , Qingshui Gu

show 49 more authors

Mailun Gao Huimin Che Yan Zhao Peiheng Zhou Haojun Wang Chaobo Xian Lili Le Chi Wu Yiwei Liu Shengda Long Jiale Yang Fangzhi Xu Sijin Wu Haodong Duan Yi Zhu Chao He Zhaojian Li Minchao Wang Huan Zhou Jiani Hou Chuqian Yu Weiran Shi Hongwan Gao Jiamin Chen Guanhong Chen Tingqin Luo Kaiyuan Zhang Zhixin Yao Qing Hua Yuhao Jiang Jin Chen Pu Chen Zhenyu Hu Xingyu Li Zhengxuan Jiang Meng Cao Tianfeng Long Haozhe Wang Mingzhang Wang Yichen Zhang Yiming Dai Chenchen Zhang Jiaying Wang Zhiyong Wu Shen Yan Yujia Qin Wenhao Huang Zaiyuan Wang Xiaolong Chang

This is my paper

Pith reviewed 2026-06-27 13:26 UTC · model grok-4.3

classification 💻 cs.AI

keywords Workflow-GYMGUI agentslong-horizon tasksprofessional workflowscomputer-use agentsbenchmarkagent evaluationspecialized software

0 comments

The pith

Current AI agents succeed on only slightly more than 30 percent of long-horizon professional GUI workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Workflow-GYM, a benchmark that tests whether agents can follow instructions to complete extended sequences of actions inside domain-specific professional software. Experiments with leading models produce success rates just above 30 percent, showing that agents commonly lose track of multi-step goals. Failures include skipping required stages, letting early mistakes compound, drifting from the original objective, and misunderstanding how specialized tools work. These outcomes matter because the tasks mirror economically valuable work that requires sustained, autonomous operation rather than short or generic interactions.

Core claim

Workflow-GYM is a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, even the strongest achieve only slightly above 30 percent success rates. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. The findings provide insights into limitations of current agent systems and suggest directions for next-generation GUI-agent research.

What carries the argument

The Workflow-GYM benchmark, which supplies professional-domain workflows in specialized software to measure end-to-end autonomous completion by agents.

If this is right

Professional long-horizon GUI workflows remain highly challenging for current GUI agents.
Agents exhibit workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software.
Maintaining long-horizon workflow consistency is a core limitation that must be addressed.
The benchmark supplies concrete directions for developing next-generation GUI agents capable of economically valuable tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption of Workflow-GYM could shift evaluation focus from short toy tasks toward sequences that better predict usefulness in actual professional settings.
The observed failure modes suggest that progress may require new training signals that explicitly reward stage completion and objective adherence across dozens of steps.
Extending the benchmark to additional specialized domains would test whether the 30-percent ceiling is domain-specific or general.

Load-bearing premise

The chosen professional workflows and success criteria accurately reflect the complexity and economic value of real work without hidden human help or artificial simplifications.

What would settle it

A new agent model reaching success rates above 60 percent on the same Workflow-GYM tasks under the identical evaluation protocol would indicate that the reported challenges are not fundamental to current approaches.

read the original abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Workflow-GYM adds a benchmark for long-horizon professional GUI tasks with reported low success rates, but task selection and evaluation details remain thin enough to limit how far the 30% claim can be taken.

read the letter

Workflow-GYM introduces a benchmark aimed at long-horizon GUI tasks in professional software environments, and the experiments show even the strongest models reaching only slightly above 30% success. That is the central point worth noting.

The extension to domain-specific professional tools and extended workflows is the main new element compared with prior GUI benchmarks that stayed with general apps or short sequences. The paper also does a reasonable job of cataloging failure patterns such as stage omission, error buildup, goal drift, and weak grasp of specialized software.

The soft spots sit in the construction and measurement side. The abstract states the tasks target economically valuable work and autonomous end-to-end performance, yet supplies no concrete criteria for choosing the workflows, no justification for their economic representativeness, and no precise rules for scoring success or handling partial completion. Without those, the low success rates are difficult to read as clear evidence of agent limits rather than benchmark choices. The stress-test concern about representativeness and hidden assistance holds up on the given text.

This work is aimed at researchers building computer-use agents who need tests that go beyond toy environments. A reader interested in benchmark design or in mapping current agent weaknesses could extract some value from the listed failure modes.

I would send it to peer review. A benchmark paper in this direction can be worth referee time if the methods section is strengthened with explicit task-selection rules and evaluation protocol.

Referee Report

1 major / 0 minor

Summary. The paper introduces Workflow-GYM, a benchmark for long-horizon GUI agent tasks in professional domains using specialized software. Experiments on state-of-the-art models report success rates slightly above 30%, with analysis showing struggles in workflow consistency including stage omission, error propagation, objective drift, and insufficient domain understanding. The work positions this as evidence that professional long-horizon GUI workflows remain highly challenging for current agents.

Significance. If the selected workflows prove representative of economically valuable professional tasks and the evaluation protocol measures fully autonomous end-to-end performance, the benchmark would fill a notable gap in GUI agent evaluation and provide concrete evidence of current limitations, guiding future research on long-horizon consistency and domain-specific software understanding.

major comments (1)

[Abstract] Abstract: the central claim that even the strongest models achieve only slightly above 30% success (and thus that professional workflows remain highly challenging) is load-bearing for the paper's contribution, yet the abstract supplies no details on workflow selection criteria, economic-value justification, success metrics, controls for human assistance, or how stage omissions are scored. This prevents verification that the reported rates reflect autonomous agent performance rather than benchmark construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the concern about the abstract below and agree to make revisions to improve clarity and verifiability of our central claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that even the strongest models achieve only slightly above 30% success (and thus that professional workflows remain highly challenging) is load-bearing for the paper's contribution, yet the abstract supplies no details on workflow selection criteria, economic-value justification, success metrics, controls for human assistance, or how stage omissions are scored. This prevents verification that the reported rates reflect autonomous agent performance rather than benchmark construction.

Authors: We agree that the abstract would benefit from additional details to make the key claims more self-contained. The full manuscript provides these details in dedicated sections: workflow selection criteria and economic-value justification are discussed in the introduction and benchmark construction sections; success metrics are defined as end-to-end task completion; evaluations are conducted fully autonomously with no human assistance; and stage omission scoring is detailed in the evaluation protocol. To strengthen the abstract, we will add concise statements on these points, for example, noting the selection of workflows representing economically valuable professional tasks in specialized software and confirming fully autonomous evaluation. This revision will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces Workflow-GYM as an empirical benchmark for GUI agents on professional workflows and reports observed success rates (e.g., strongest models ~30%). It contains no equations, fitted parameters, predictions derived from first principles, or load-bearing self-citations that reduce the central claim to its own inputs. The reported findings are direct experimental measurements on the introduced tasks; no step equates a claimed result to a renamed input or self-referential definition. This is a standard non-circular empirical evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark without new mathematical derivations, fitted parameters, or postulated entities; it relies on standard assumptions about task representativeness and evaluation validity.

pith-pipeline@v0.9.1-grok · 5953 in / 974 out tokens · 15909 ms · 2026-06-27T13:26:15.348362+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PhoneBuddy: Training Open Models for Agentic Phone Use
cs.CL 2026-06 unverdicted novelty 6.0

PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

Reference graph

Works this paper leans on

23 extracted references · 7 linked inside Pith · cited by 1 Pith paper

[1]

Scuba: Salesforce computer use benchmark

Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, et al. Scuba: Salesforce computer use benchmark. arXiv preprint arXiv:2509.26506, 2025

arXiv 2025
[2]

Mobile-bench: An evaluation benchmark for llm-based mobile agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

2024
[3]

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.arXiv preprint arXiv:2512.12730, 2025

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.arXiv preprint arXiv:2512.12730, 2025

arXiv 2025
[4]

Gemini 3.1 Pro Model Card, 4 2026

Google Cloud. Gemini 3.1 Pro Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-1-pro?hl=zh-cn

2026
[5]

Gemini 3 Flash Model Card, 4 2026

Google Cloud. Gemini 3 Flash Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-flash

2026
[6]

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

arXiv 2025
[7]

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Pith/arXiv arXiv 2026
[8]

Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026

Moonshot AI. Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026. Ac- cessed: 2026-06-02

2026
[9]

Gui-360: A comprehensive dataset and benchmark for computer-using agents

Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, et al. Gui-360: A comprehensive dataset and benchmark for computer-using agents. 2025

2025
[10]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, 2026. URL https://deploymentsafety. openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf. Accessed: 2026-04-27

2026
[11]

Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

Pith/arXiv arXiv 2025
[12]

Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Pith/arXiv arXiv 2024
[13]

Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

Pith/arXiv arXiv 2026
[14]

Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. URL https://lf3-static. bytednsdoc. com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2. 0% 20Model% 20Card. pdf, 2026

2026
[15]

Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks.arXiv preprint arXiv:2510.26098, 2025

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks.arXiv preprint arXiv:2510.26098, 2025

arXiv 2025
[16]

Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild

Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, et al. Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild. arXiv preprint arXiv:2602.11750, 2026

arXiv 2026
[17]

Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows

Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, et al. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows. arXiv preprint arXiv:2505.19897, 2025. 20

Pith/arXiv arXiv 2025
[18]

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025

Pith/arXiv arXiv 2025
[19]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

Pith/arXiv arXiv 2024
[20]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

2024
[21]

Step-gui technical report, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...

arXiv 2025
[22]

\onemillion-bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980, 2026

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, et al. \onemillion-bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980, 2026

arXiv 2026
[23]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022. 21 Appendix A Annotator and Domain Expert Background Workflow-GYM focuses on long-horizon professional workflows grounded in re...

2022

[1] [1]

Scuba: Salesforce computer use benchmark

Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, et al. Scuba: Salesforce computer use benchmark. arXiv preprint arXiv:2509.26506, 2025

arXiv 2025

[2] [2]

Mobile-bench: An evaluation benchmark for llm-based mobile agents

Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

2024

[3] [3]

Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.arXiv preprint arXiv:2512.12730, 2025

Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.arXiv preprint arXiv:2512.12730, 2025

arXiv 2025

[4] [4]

Gemini 3.1 Pro Model Card, 4 2026

Google Cloud. Gemini 3.1 Pro Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-1-pro?hl=zh-cn

2026

[5] [5]

Gemini 3 Flash Model Card, 4 2026

Google Cloud. Gemini 3 Flash Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-flash

2026

[6] [6]

Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

arXiv 2025

[7] [7]

Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

Pith/arXiv arXiv 2026

[8] [8]

Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026

Moonshot AI. Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026. Ac- cessed: 2026-06-02

2026

[9] [9]

Gui-360: A comprehensive dataset and benchmark for computer-using agents

Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, et al. Gui-360: A comprehensive dataset and benchmark for computer-using agents. 2025

2025

[10] [10]

Gpt-5.4 thinking system card

OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, 2026. URL https://deploymentsafety. openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf. Accessed: 2026-04-27

2026

[11] [11]

Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

Pith/arXiv arXiv 2025

[12] [12]

Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

Pith/arXiv arXiv 2024

[13] [13]

Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

Pith/arXiv arXiv 2026

[14] [14]

Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. URL https://lf3-static. bytednsdoc. com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2. 0% 20Model% 20Card. pdf, 2026

2026

[15] [15]

Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks.arXiv preprint arXiv:2510.26098, 2025

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks.arXiv preprint arXiv:2510.26098, 2025

arXiv 2025

[16] [16]

Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild

Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, et al. Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild. arXiv preprint arXiv:2602.11750, 2026

arXiv 2026

[17] [17]

Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows

Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, et al. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows. arXiv preprint arXiv:2505.19897, 2025. 20

Pith/arXiv arXiv 2025

[18] [18]

Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning

Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025

Pith/arXiv arXiv 2025

[19] [19]

Openhands: An open platform for ai software developers as generalist agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

Pith/arXiv arXiv 2024

[20] [20]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

2024

[21] [21]

Step-gui technical report, 2025

Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...

arXiv 2025

[22] [22]

\onemillion-bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980, 2026

Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, et al. \onemillion-bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980, 2026

arXiv 2026

[23] [23]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022. 21 Appendix A Annotator and Domain Expert Background Workflow-GYM focuses on long-horizon professional workflows grounded in re...

2022