pith. sign in

arxiv: 2606.11042 · v1 · pith:4ITB7L5Bnew · submitted 2026-06-09 · 💻 cs.AI

Workflow-GYM: Towards Long-Horizon Evaluation of Computer-use Agentic tasks in Real-World Professional Fields

Pith reviewed 2026-06-27 13:26 UTC · model grok-4.3

classification 💻 cs.AI
keywords Workflow-GYMGUI agentslong-horizon tasksprofessional workflowscomputer-use agentsbenchmarkagent evaluationspecialized software
0
0 comments X

The pith

Current AI agents succeed on only slightly more than 30 percent of long-horizon professional GUI workflows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Workflow-GYM, a benchmark that tests whether agents can follow instructions to complete extended sequences of actions inside domain-specific professional software. Experiments with leading models produce success rates just above 30 percent, showing that agents commonly lose track of multi-step goals. Failures include skipping required stages, letting early mistakes compound, drifting from the original objective, and misunderstanding how specialized tools work. These outcomes matter because the tasks mirror economically valuable work that requires sustained, autonomous operation rather than short or generic interactions.

Core claim

Workflow-GYM is a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, even the strongest achieve only slightly above 30 percent success rates. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. The findings provide insights into limitations of current agent systems and suggest directions for next-generation GUI-agent research.

What carries the argument

The Workflow-GYM benchmark, which supplies professional-domain workflows in specialized software to measure end-to-end autonomous completion by agents.

If this is right

  • Professional long-horizon GUI workflows remain highly challenging for current GUI agents.
  • Agents exhibit workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software.
  • Maintaining long-horizon workflow consistency is a core limitation that must be addressed.
  • The benchmark supplies concrete directions for developing next-generation GUI agents capable of economically valuable tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Widespread adoption of Workflow-GYM could shift evaluation focus from short toy tasks toward sequences that better predict usefulness in actual professional settings.
  • The observed failure modes suggest that progress may require new training signals that explicitly reward stage completion and objective adherence across dozens of steps.
  • Extending the benchmark to additional specialized domains would test whether the 30-percent ceiling is domain-specific or general.

Load-bearing premise

The chosen professional workflows and success criteria accurately reflect the complexity and economic value of real work without hidden human help or artificial simplifications.

What would settle it

A new agent model reaching success rates above 60 percent on the same Workflow-GYM tasks under the identical evaluation protocol would indicate that the reported challenges are not fundamental to current approaches.

read the original abstract

Recent years have witnessed the rapid evolution of AI agents toward handling increasingly complex, real-world tasks. However, existing benchmarks rarely evaluate whether agents can operate graphical user interfaces to complete long-horizon, high-value professional workflows across diverse domains. Current GUI benchmarks still predominantly focus on general-purpose software, relatively simple applications, and short-horizon tasks, leaving it largely unknown whether modern agents can follow user instructions to autonomously operate domain-specific professional software and accomplish economically valuable work in an end-to-end manner. To bridge this gap, we introduce Workflow-GYM, a benchmark for long-horizon GUI tasks centered on professional domains and specialized software environments. Through extensive experiments on state-of-the-art models, we find that even the strongest models achieve only slightly above 30% success rates, highlighting that professional long-horizon GUI workflows remain highly challenging for current GUI agents. Further analysis reveals that current agents struggle to maintain long-horizon workflow consistency, frequently exhibiting workflow stage omission, error propagation, objective drift, and insufficient understanding of professional software environments. Our findings provide important insights into the limitations of current agent systems and suggest key directions for the next generation of GUI-agent research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces Workflow-GYM, a benchmark for long-horizon GUI agent tasks in professional domains using specialized software. Experiments on state-of-the-art models report success rates slightly above 30%, with analysis showing struggles in workflow consistency including stage omission, error propagation, objective drift, and insufficient domain understanding. The work positions this as evidence that professional long-horizon GUI workflows remain highly challenging for current agents.

Significance. If the selected workflows prove representative of economically valuable professional tasks and the evaluation protocol measures fully autonomous end-to-end performance, the benchmark would fill a notable gap in GUI agent evaluation and provide concrete evidence of current limitations, guiding future research on long-horizon consistency and domain-specific software understanding.

major comments (1)
  1. [Abstract] Abstract: the central claim that even the strongest models achieve only slightly above 30% success (and thus that professional workflows remain highly challenging) is load-bearing for the paper's contribution, yet the abstract supplies no details on workflow selection criteria, economic-value justification, success metrics, controls for human assistance, or how stage omissions are scored. This prevents verification that the reported rates reflect autonomous agent performance rather than benchmark construction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback. We address the concern about the abstract below and agree to make revisions to improve clarity and verifiability of our central claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that even the strongest models achieve only slightly above 30% success (and thus that professional workflows remain highly challenging) is load-bearing for the paper's contribution, yet the abstract supplies no details on workflow selection criteria, economic-value justification, success metrics, controls for human assistance, or how stage omissions are scored. This prevents verification that the reported rates reflect autonomous agent performance rather than benchmark construction.

    Authors: We agree that the abstract would benefit from additional details to make the key claims more self-contained. The full manuscript provides these details in dedicated sections: workflow selection criteria and economic-value justification are discussed in the introduction and benchmark construction sections; success metrics are defined as end-to-end task completion; evaluations are conducted fully autonomously with no human assistance; and stage omission scoring is detailed in the evaluation protocol. To strengthen the abstract, we will add concise statements on these points, for example, noting the selection of workflows representing economically valuable professional tasks in specialized software and confirming fully autonomous evaluation. This revision will be made in the next version of the manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivation chain

full rationale

The paper introduces Workflow-GYM as an empirical benchmark for GUI agents on professional workflows and reports observed success rates (e.g., strongest models ~30%). It contains no equations, fitted parameters, predictions derived from first principles, or load-bearing self-citations that reduce the central claim to its own inputs. The reported findings are direct experimental measurements on the introduced tasks; no step equates a claimed result to a renamed input or self-referential definition. This is a standard non-circular empirical evaluation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces an empirical benchmark without new mathematical derivations, fitted parameters, or postulated entities; it relies on standard assumptions about task representativeness and evaluation validity.

pith-pipeline@v0.9.1-grok · 5953 in / 974 out tokens · 15909 ms · 2026-06-27T13:26:15.348362+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. PhoneBuddy: Training Open Models for Agentic Phone Use

    cs.CL 2026-06 unverdicted novelty 6.0

    PhoneBuddy combines real-app and mock-app RL after shared SFT, raising real-phone task success from 36.67% to 45.33% and AndroidWorld from 60.3% to 83.2%.

Reference graph

Works this paper leans on

23 extracted references · 7 linked inside Pith · cited by 1 Pith paper

  1. [1]

    Scuba: Salesforce computer use benchmark

    Yutong Dai, Krithika Ramakrishnan, Jing Gu, Matthew Fernandez, Yanqi Luo, Viraj Prabhu, Zhenyu Hu, Silvio Savarese, Caiming Xiong, Zeyuan Chen, et al. Scuba: Salesforce computer use benchmark. arXiv preprint arXiv:2509.26506, 2025

  2. [2]

    Mobile-bench: An evaluation benchmark for llm-based mobile agents

    Shihan Deng, Weikai Xu, Hongda Sun, Wei Liu, Tao Tan, Liujianfeng Liujianfeng, Ang Li, Jian Luan, Bin Wang, Rui Yan, et al. Mobile-bench: An evaluation benchmark for llm-based mobile agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8813–8831, 2024

  3. [3]

    Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.arXiv preprint arXiv:2512.12730, 2025

    Jingzhe Ding, Shengda Long, Changxin Pu, Huan Zhou, Hongwan Gao, Xiang Gao, Chao He, Yue Hou, Fei Hu, Zhaojian Li, et al. Nl2repo-bench: Towards long-horizon repository generation evaluation of coding agents.arXiv preprint arXiv:2512.12730, 2025

  4. [4]

    Gemini 3.1 Pro Model Card, 4 2026

    Google Cloud. Gemini 3.1 Pro Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-1-pro?hl=zh-cn

  5. [5]

    Gemini 3 Flash Model Card, 4 2026

    Google Cloud. Gemini 3 Flash Model Card, 4 2026. URL https://docs.cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/3-flash

  6. [6]

    Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

    Haowei Liu, Xi Zhang, Haiyang Xu, Yuyang Wanyan, Junyang Wang, Ming Yan, Ji Zhang, Chunfeng Yuan, Changsheng Xu, Weiming Hu, et al. Pc-agent: A hierarchical multi-agent collaboration framework for complex task automation on pc.arXiv preprint arXiv:2502.14282, 2025

  7. [7]

    Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

    Mike A Merrill, Alexander G Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E Kelly Buchanan, et al. Terminal-bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868, 2026

  8. [8]

    Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026

    Moonshot AI. Kimi k2.6: Advancing open-source coding.https://www.kimi.com/blog/kimi-k2-6, 2026. Ac- cessed: 2026-06-02

  9. [9]

    Gui-360: A comprehensive dataset and benchmark for computer-using agents

    Jian Mu, Chaoyun Zhang, Chiming Ni, Lu Wang, Bo Qiao, Kartik Mathur, Qianhui Wu, Yuhang Xie, Xiaojun Ma, Mengyu Zhou, et al. Gui-360: A comprehensive dataset and benchmark for computer-using agents. 2025

  10. [10]

    Gpt-5.4 thinking system card

    OpenAI. Gpt-5.4 thinking system card. Technical report, OpenAI, 2026. URL https://deploymentsafety. openai.com/gpt-5-4-thinking/gpt-5-4-thinking.pdf. Accessed: 2026-04-27

  11. [11]

    Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

    Tejal Patwardhan, Rachel Dias, Elizabeth Proehl, Grace Kim, Michele Wang, Olivia Watkins, Simón Posada Fishman, Marwan Aljubeh, Phoebe Thacker, Laurance Fauconnet, et al. Gdpval: Evaluating ai model performance on real-world economically valuable tasks.arXiv preprint arXiv:2510.04374, 2025

  12. [12]

    Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, et al. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573, 2024

  13. [13]

    Bytedance Seed. Seed1. 8 model card: Towards generalized real-world agency.arXiv preprint arXiv:2603.20633, 2026

  14. [14]

    Bytedance Seed. Seed2. 0 model card: Towards intelligence frontier for real-world complexity. URL https://lf3-static. bytednsdoc. com/obj/eden-cn/lapzild-tss/ljhwZthlaukjlkulzlp/seed2/0214/Seed2. 0% 20Model% 20Card. pdf, 2026

  15. [15]

    Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks.arXiv preprint arXiv:2510.26098, 2025

    Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. Gui knowledge bench: Revealing the knowledge gap behind vlm failures in gui tasks.arXiv preprint arXiv:2510.26098, 2025

  16. [16]

    Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild

    Jiazheng Sun, Mingxuan Li, Yingying Zhang, Jiayang Niu, Yachen Wu, Ruihan Jin, Shuyu Lei, Pengrongrui Tan, Zongyu Zhang, Ruoyi Wang, et al. Ambibench: Benchmarking mobile gui agents beyond one-shot instructions in the wild. arXiv preprint arXiv:2602.11750, 2026

  17. [17]

    Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows

    Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, et al. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflows. arXiv preprint arXiv:2505.19897, 2025. 20

  18. [18]

    Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025

  19. [19]

    Openhands: An open platform for ai software developers as generalist agents

    Xingyao Wang, Boxuan Li, Yufan Song, Frank F Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, et al. Openhands: An open platform for ai software developers as generalist agents. arXiv preprint arXiv:2407.16741, 2024

  20. [20]

    Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advancesin Neural Information Processing Systems, 37:52040–52094, 2024

  21. [21]

    Step-gui technical report, 2025

    Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...

  22. [22]

    \onemillion-bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980, 2026

    Qianyu Yang, Yang Liu, Jiaqi Li, Jun Bai, Hao Chen, Kaiyuan Chen, Tiliang Duan, Jiayun Dong, Xiaobo Hu, Zixia Jia, et al. \onemillion-bench: How far are language agents from human experts?arXiv preprint arXiv:2603.07980, 2026

  23. [23]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The eleventh international conference on learning representations, 2022. 21 Appendix A Annotator and Domain Expert Background Workflow-GYM focuses on long-horizon professional workflows grounded in re...