Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields
Pith reviewed 2026-06-27 13:26 UTC · model grok-4.3
The pith
Current AI agents succeed on only slightly more than 30 percent of long-horizon professional GUI workflows.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Workflow-GYM is a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, even the strongest achieve only slightly above 30 percent success rates. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. The findings provide insights into limitations of current agent systems and suggest directions for next-generation GUI-agent research.
What carries the argument
The Workflow-GYM benchmark, which supplies professional-domain workflows in specialized software to measure end-to-end autonomous completion by agents.
If this is right
- Professional long-horizon GUI workflows remain highly challenging for current GUI agents.
- Agents exhibit workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software.
- Maintaining long-horizon workflow consistency is a core limitation that must be addressed.
- The benchmark supplies concrete directions for developing next-generation GUI agents capable of economically valuable tasks.
Where Pith is reading between the lines
- Widespread adoption of Workflow-GYM could shift evaluation focus from short toy tasks toward sequences that better predict usefulness in actual professional settings.
- The observed failure modes suggest that progress may require new training signals that explicitly reward stage completion and objective adherence across dozens of steps.
- Extending the benchmark to additional specialized domains would test whether the 30-percent ceiling is domain-specific or general.
Load-bearing premise
The chosen professional workflows and success criteria accurately reflect the complexity and economic value of real work without hidden human help or artificial simplifications.
What would settle it
A new agent model reaching success rates above 60 percent on the same Workflow-GYM tasks under the identical evaluation protocol would indicate that the reported challenges are not fundamental to current approaches.
read the original abstract
Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Workflow-GYM, a benchmark for long-horizon GUI agent tasks in professional domains using specialized software. Experiments on state-of-the-art models report success rates slightly above 30%, with analysis showing struggles in workflow consistency including stage omission, error propagation, objective drift, and insufficient domain understanding. The work positions this as evidence that professional long-horizon GUI workflows remain highly challenging for current agents.
Significance. If the selected workflows prove representative of economically valuable professional tasks and the evaluation protocol measures fully autonomous end-to-end performance, the benchmark would fill a notable gap in GUI agent evaluation and provide concrete evidence of current limitations, guiding future research on long-horizon consistency and domain-specific software understanding.
major comments (1)
- [Abstract] Abstract: the central claim that even the strongest models achieve only slightly above 30% success (and thus that professional workflows remain highly challenging) is load-bearing for the paper's contribution, yet the abstract supplies no details on workflow selection criteria, economic-value justification, success metrics, controls for human assistance, or how stage omissions are scored. This prevents verification that the reported rates reflect autonomous agent performance rather than benchmark construction.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the concern about the abstract below and agree to make revisions to improve clarity and verifiability of our central claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that even the strongest models achieve only slightly above 30% success (and thus that professional workflows remain highly challenging) is load-bearing for the paper's contribution, yet the abstract supplies no details on workflow selection criteria, economic-value justification, success metrics, controls for human assistance, or how stage omissions are scored. This prevents verification that the reported rates reflect autonomous agent performance rather than benchmark construction.
Authors: We agree that the abstract would benefit from additional details to make the key claims more self-contained. The full manuscript provides these details in dedicated sections: workflow selection criteria and economic-value justification are discussed in the introduction and benchmark construction sections; success metrics are defined as end-to-end task completion; evaluations are conducted fully autonomously with no human assistance; and stage omission scoring is detailed in the evaluation protocol. To strengthen the abstract, we will add concise statements on these points, for example, noting the selection of workflows representing economically valuable professional tasks in specialized software and confirming fully autonomous evaluation. This revision will be made in the next version of the manuscript. revision: yes
Circularity Check
No circularity: empirical benchmark with no derivation chain
full rationale
The paper introduces Workflow-GYM as an empirical benchmark for GUI agents on professional workflows and reports observed success rates (e.g., strongest models ~30%). It contains no equations, fitted parameters, predictions derived from first principles, or load-bearing self-citations that reduce the central claim to its own inputs. The reported findings are direct experimental measurements on the introduced tasks; no step equates a claimed result to a renamed input or self-referential definition. This is a standard non-circular empirical evaluation paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
PhoneBuddy: Training Open Models for Agentic Phone Use
PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.
Reference graph
Works this paper leans on
-
[1]
Scuba: Salesforce computer use benchmark
Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, et al. Scuba: Salesforce computer use benchmark. arXiv preprint arXiv:2509.26506, 2025
arXiv 2025
-
[2]
Mobile-bench: An evaluation benchmark for llm-based mobile agents
Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024
2024
-
[3]
Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.arXiv preprint arXiv:2512.12730, 2025
arXiv 2025
-
[4]
Gemini 3.1 Pro Model Card, 4 2026
Google Cloud. Gemini 3.1 Pro Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-1-pro?hl=zh-cn
2026
-
[5]
Gemini 3 Flash Model Card, 4 2026
Google Cloud. Gemini 3 Flash Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-flash
2026
-
[6]
Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025
arXiv 2025
-
[7]
Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026
Pith/arXiv arXiv 2026
-
[8]
Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026
Moonshot AI. Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026. Ac- cessed: 2026-06-02
2026
-
[9]
Gui-360: A comprehensive dataset and benchmark for computer-using agents
Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, et al. Gui-360: A comprehensive dataset and benchmark for computer-using agents. 2025
2025
-
[10]
Gpt-5.4 thinking system card
OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, 2026. URL https://deploymentsafety. openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf. Accessed: 2026-04-27
2026
-
[11]
Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025
Pith/arXiv arXiv 2025
-
[12]
Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024
Pith/arXiv arXiv 2024
-
[13]
Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026
Pith/arXiv arXiv 2026
-
[14]
Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. URL https://lf3-static. bytednsdoc. com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2. 0% 20Model% 20Card. pdf, 2026
2026
-
[15]
Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks.arXiv preprint arXiv:2510.26098, 2025
arXiv 2025
-
[16]
Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild
Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, et al. Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild. arXiv preprint arXiv:2602.11750, 2026
arXiv 2026
-
[17]
Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows
Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, et al. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows. arXiv preprint arXiv:2505.19897, 2025. 20
Pith/arXiv arXiv 2025
-
[18]
Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025
Pith/arXiv arXiv 2025
-
[19]
Openhands: An open platform for ai software developers as generalist agents
Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024
Pith/arXiv arXiv 2024
-
[20]
Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024
2024
-
[21]
Step-gui technical report, 2025
Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...
arXiv 2025
-
[22]
Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, et al. \onemillion-bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980, 2026
arXiv 2026
-
[23]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022. 21 Appendix A Annotator and Domain Expert Background Workflow-GYM focuses on long-horizon professional workflows grounded in re...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.