pith. sign in

arxiv: 2605.25160 · v2 · pith:TCICKQUYnew · submitted 2026-05-24 · 💻 cs.AI

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

Pith reviewed 2026-06-30 10:54 UTC · model grok-4.3

classification 💻 cs.AI
keywords GUI agentssynthetic environmentsverifiable rewardsmobile GUI benchmarklarge-scale synthesisweb-based interfacesagent evaluationcross-platform GUI
0
0 comments X

The pith

ScaleWoB generates high-fidelity GUI environments as backend-free webpages with verifiable rewards for scalable agent evaluation across platforms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a synthesis framework that converts GUI application designs into interactive web pages accessible by URL. These pages include built-in reward signals and support reset and state control without backend servers or virtual machines. The approach yields over 100 environments and 1000 tasks spanning mobile, desktop, and automotive interfaces. A released subset forms a benchmark of 120 tasks on 63 mobile apps where five current agents average 27.92 percent success, falling to 17.82 percent on long-horizon items, while humans reach 92.08 percent. Performance measured in the synthetic settings correlates with behavior on real applications.

Core claim

ScaleWoB produces 100+ synthesized interactive environments and 1000+ verifiable tasks as backend-free webpages accessible via URL, including a public benchmark of 120 challenging tasks across 63 simulated mobile applications, on which state-of-the-art mobile GUI agents achieve an average success rate of only 27.92 percent (dropping to 17.82 percent on the long-horizon subset) while humans reach 92.08 percent, with the synthetic assessments generalizing to real apps.

What carries the argument

A synthesis pipeline that converts GUI specifications into backend-free interactive webpages equipped with verifiable reward functions and state reset capabilities.

If this is right

  • GUI agent training and evaluation can proceed at large scale with near-zero setup cost and without dependence on device emulators or cloud instances.
  • Reproducible, resetable tasks become available for long-horizon mobile, desktop, and in-vehicle scenarios using a single pipeline.
  • New benchmarks can be generated and shared simply by publishing URLs rather than distributing virtual-machine images.
  • The gap between current agent performance and human performance on long-horizon tasks can be quantified under controlled conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same synthesis method could support iterative training loops in which coding agents generate or refine environment specifications for GUI agents.
  • The low-resource web format opens the possibility of running large-scale agent experiments on consumer hardware or in browser-based sandboxes.
  • Similar synthesis pipelines might be applied to other interface domains such as web browsers or game UIs to create comparable verifiable benchmarks.

Load-bearing premise

The synthesized web pages replicate the visual layout, interaction dynamics, and reward outcomes of real GUI applications closely enough that agent success rates and rankings transfer to actual apps.

What would settle it

Measure the same set of agents on both the synthetic mobile environments and the corresponding real mobile applications and observe whether success rates and relative rankings remain consistent.

Figures

Figures reproduced from arXiv: 2605.25160 by Guohong Liu, Jialei Ye, Jian Luan, Pengzhi Gao, Wei Liu, Yuanchun Li, Yunxin Liu.

Figure 1
Figure 1. Figure 1: An overview of SimuWoB. After collecting real-world tasks along with related applications, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Two-stage environment synthesis pipeline of SimuWoB. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Automatic issue inspection and correction workflow. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example task in SimuWoB. The agent is asked to top up the wallet by 100 euros in a [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Experimental results of different agents on SimuWoB. For local models, we evaluate only [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Success rate of different task categories across evaluated agents in SimuWoB. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case study of a long-horizon failure: the agent executes UI operations correctly but does not persist key information in context, leading to an incorrect final answer. Mobile GUI agents fall short in long-horizon tasks. Results in [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Case study of a vague-description failure: the agent fails to locate the task entry point due to a lack of proactive exploration capabilities following initial failures. Tasks with vague descriptions or inconspicuous functional entry points can confuse the agent. Our analysis shows that agent performance degrades when instructions are underspecified and the true entry point is visually inconspicuous (e.g.,… view at source ↗
Figure 9
Figure 9. Figure 9: Fine-grained control results. Agents perform poorly on tasks that require fine￾grained control. Fine-grained control is a common re￾quirement in real-world mobile tasks, including dragging a slider to a target position, setting date/time values with pickers, confirming payment via drag gestures, and in￾voking context menus through long presses. Compared with standard click-and-type tasks, these operations … view at source ↗
read the original abstract

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces ScaleWoB, a framework that leverages coding agents to synthesize high-fidelity, backend-free webpage environments for GUI agents across mobile, desktop, and automotive platforms. These environments provide verifiable rewards, require near-zero setup, and scale to 100+ environments and 1000+ tasks; the authors release a benchmark of 120 tasks across 63 simulated mobile apps. Experiments on five state-of-the-art mobile GUI agents report average success rates of 27.92% (17.82% on long-horizon tasks) versus 92.08% for humans, and a comparison on real-world sample tasks is presented to argue that synthetic assessments generalize to real apps.

Significance. If the fidelity and transfer claims hold, the work offers a practical, low-resource alternative to VMs or real-device testing for large-scale GUI agent evaluation and training. The release of a fully synthesized mobile benchmark and the empirical demonstration of substantial headroom in current agents are concrete contributions. The coding-agent synthesis pipeline is a notable strength for reproducibility and scalability.

major comments (1)
  1. [Abstract] Abstract (generalization claim): the assertion that 'assessments made in our synthetic environments generalize to real apps' is load-bearing for the central contribution, yet the manuscript provides no quantitative fidelity metrics (e.g., action-equivalence rates, visual similarity scores, or statistical correlation between synthetic and real success rates) to substantiate transfer; without these, the reported agent success rates cannot be confidently interpreted as evidence of real-world headroom.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the generalization claim. We agree that quantitative support is needed to strengthen the assertion and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract (generalization claim): the assertion that 'assessments made in our synthetic environments generalize to real apps' is load-bearing for the central contribution, yet the manuscript provides no quantitative fidelity metrics (e.g., action-equivalence rates, visual similarity scores, or statistical correlation between synthetic and real success rates) to substantiate transfer; without these, the reported agent success rates cannot be confidently interpreted as evidence of real-world headroom.

    Authors: We acknowledge the validity of this observation. The current manuscript supports the generalization claim via a qualitative comparison on real-world sample tasks (detailed in the experiments section), which shows consistent agent behavior patterns. However, to make the claim more rigorous and address the lack of quantitative metrics, we will add action-equivalence rates, visual similarity scores, and statistical correlations between synthetic and real success rates in the revised version. These additions will be incorporated into the relevant experimental analysis and referenced in the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity; purely empirical synthesis and evaluation framework

full rationale

The paper describes a pipeline for synthesizing backend-free webpage environments from coding agents, then reports measured success rates of GUI agents on 120 tasks and a separate real-app comparison. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described claims. The central assertions rest on experimental outcomes (agent success rates, human baselines, generalization checks) rather than any reduction of outputs to inputs by construction. This is the expected non-finding for an applied systems paper whose load-bearing content is the synthesis method and the measured transfer gap.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that web-based simulations can faithfully replicate GUI interactions and reward structures; no free parameters or invented entities are identifiable from the abstract.

axioms (1)
  • domain assumption Web-based simulations can provide high-fidelity replicas of GUI interactions and reward structures of real apps.
    This underpins the claims of high-fidelity synthesis, verifiable rewards, and generalization to real apps.

pith-pipeline@v0.9.1-grok · 5839 in / 1525 out tokens · 68886 ms · 2026-06-30T10:54:37.023539+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

69 extracted references · 53 canonical work pages · 22 internal anchors

  1. [1]

    Autodroid: Llm-powered task automation in android, 2024

    Hao Wen, Yuanchun Li, Guohong Liu, Shanhui Zhao, Tao Yu, Toby Jia-Jun Li, Shiqi Jiang, Yunhao Liu, Yaqin Zhang, and Yunxin Liu. Autodroid: Llm-powered task automation in android, 2024. URL https://arxiv.org/abs/2308.15272

  2. [2]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

  3. [3]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

  4. [4]

    Os-copilot: Towards generalist computer agents with self-improvement, 2024

    Zhiyong Wu, Chengcheng Han, Zichen Ding, Zhenmin Weng, Zhoumianze Liu, Shunyu Yao, Tao Yu, and Lingpeng Kong. Os-copilot: Towards generalist computer agents with self-improvement, 2024. URL https://arxiv.org/abs/2402.07456

  5. [5]

    Aria-ui: Visual grounding for gui instructions, 2025

    Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions, 2025. URLhttps://arxiv.org/abs/2412.16256

  6. [6]

    Android in the zoo: Chain-of-action-thought for gui agents, 2024

    Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents, 2024. URL https://arxiv.org/abs/2403. 02713

  7. [7]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, Jitong Liao, Qi Zheng, Fei Huang, Jingren Zhou, and Ming Yan. Mobile-agent-v3: Fundamental agents for gui automation, 2025. URLhttps://arxiv.org/abs/2508.15144

  8. [8]

    Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URL https: //arxiv.org/abs/2408.07199

  9. [9]

    Agent s: An open agentic framework that uses computers like a human, 2024

    Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human, 2024. URL https://arxiv.org/abs/2410. 08164

  10. [10]

    Autoglm: Autonomous foundation agents for guis, 2024

    Xiao Liu, Bo Qin, Dongzhu Liang, Guang Dong, Hanyu Lai, Hanchen Zhang, Hanlin Zhao, Iat Long Iong, Jiadai Sun, Jiaqi Wang, Junjie Gao, Junjun Shan, Kangning Liu, Shudan Zhang, Shuntian Yao, Siyi Cheng, Wentao Yao, Wenyi Zhao, Xinghan Liu, Xinyi Liu, Xinying Chen, Xinyue Yang, Yang Yang, Yifan Xu, Yu Yang, Yujia Wang, Yulin Xu, Zehan Qi, Yuxiao Dong, and J...

  11. [11]

    Step-gui technical report, 2025

    Haolong Yan, Jia Wang, Xin Huang, Yeqing Shen, Ziyang Meng, Zhimin Fan, Kaijun Tan, Jin Gao, Lieyu Shi, Mi Yang, Shiliang Yang, Zhirui Wang, Brian Li, Kang An, Chenyang Li, Lei Lei, Mengmeng Duan, Danxun Liang, Guodong Liu, Hang Cheng, Hao Wu, Jie Dong, Junhao Huang, Mei Chen, Renjie Yu, Shunshan Li, Xu Zhou, Yiting Dai, Yineng Deng, Yingdan Liang, Zelin ...

  12. [12]

    Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026

    Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, Zhiyuan Chen, Jitong Liao, Qi Zheng, Jiahui Zeng, Ze Xu, Shuai Bai, Junyang Lin, Jingren Zhou, and Ming Yan. Mobile-agent-v3.5: Multi-platform fundamental gui agents, 2026. URL https://arxiv.org/abs/2602.16855

  13. [13]

    Mai-ui technical report: Real-world centric foundation gui agents, 2025

    Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, and Steven Hoi. Mai-ui technical report: Real-world centric foundation gui agents, 2025. URLhttps://arxiv.org/abs/2512.22047. 10

  14. [14]

    Androidlab: Training and systematic benchmarking of android autonomous agents,

    Yifan Xu, Xiao Liu, Xueqiao Sun, Siyi Cheng, Hao Yu, Hanyu Lai, Shudan Zhang, Dan Zhang, Jie Tang, and Yuxiao Dong. Androidlab: Training and systematic benchmarking of android autonomous agents,

  15. [15]

    URLhttps://arxiv.org/abs/2410.24024

  16. [16]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, Daniel Toyama, Robert Berry, Divya Tyamagundlu, Timothy Lillicrap, and Oriana Riva. Androidworld: A dynamic benchmarking environment for autonomous agents, 2025. URLhttps://arxiv.org/abs/2405.14573

  17. [17]

    OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh Jing Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, Yitao Liu, Yiheng Xu, Shuyan Zhou, Silvio Savarese, Caiming Xiong, Victor Zhong, and Tao Yu. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments, 2024. URLhttps://arxiv.org/abs/2404.07972

  18. [18]

    Crab: Cross-environment agent benchmark for multimodal language model agents, 2025

    Tianqi Xu, Linyao Chen, Dai-Jie Wu, Yanjun Chen, Zecheng Zhang, Xiang Yao, Zhiqiang Xie, Yongchao Chen, Shilong Liu, Bochen Qian, Anjie Yang, Zhaoxuan Jin, Jianbo Deng, Philip Torr, Bernard Ghanem, and Guohao Li. Crab: Cross-environment agent benchmark for multimodal language model agents, 2025. URLhttps://arxiv.org/abs/2407.01511

  19. [19]

    ScienceBoard: Evaluating Multimodal Autonomous Agents in Realistic Scientific Workflows

    Qiushi Sun, Zhoumianze Liu, Chang Ma, Zichen Ding, Fangzhi Xu, Zhangyue Yin, Haiteng Zhao, Zhenyu Wu, Kanzhi Cheng, Zhaoyang Liu, Jianing Wang, Qintong Li, Xiangru Tang, Tianbao Xie, Xiachong Feng, Xiang Li, Ben Kao, Wenhai Wang, Biqing Qi, Lingpeng Kong, and Zhiyong Wu. Scienceboard: Evaluating multimodal autonomous agents in realistic scientific workflo...

  20. [20]

    Windows Agent Arena: Evaluating multi-modal OS agents at scale.arXiv preprint arXiv:2409.08264,

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, and Zack Hui. Windows agent arena: Evaluating multi-modal os agents at scale, 2024. URLhttps://arxiv.org/abs/2409.08264

  21. [21]

    Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024

    Luyuan Wang, Yongyu Deng, Yiwei Zha, Guodong Mao, Qinmin Wang, Tianchen Min, Wei Chen, and Shoufa Chen. Mobileagentbench: An efficient and user-friendly benchmark for mobile llm agents, 2024. URLhttps://arxiv.org/abs/2406.08184

  22. [22]

    Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments, 2025

    Quyu Kong, Xu Zhang, Zhenyu Yang, Nolan Gao, Chen Liu, Panrong Tong, Chenglin Cai, Hanzhang Zhou, Jianan Zhang, Liangyu Chen, Zhidan Liu, Steven Hoi, and Yue Wang. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive and mcp-augmented environments, 2025. URL https://arxiv.org/abs/2512.19432

  23. [24]

    Weblinux: a scalable in-browser and client- side linux and ide

    Rémi Sharrock, Lawrence Angrave, and Ella Hamonic. Weblinux: a scalable in-browser and client- side linux and ide. InProceedings of the Fifth Annual ACM Conference on Learning at Scale, L@S ’18, New York, NY , USA, 2018. Association for Computing Machinery. ISBN 9781450358866. doi: 10.1145/3231644.3231703. URLhttps://doi.org/10.1145/3231644.3231703

  24. [25]

    WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. URL https://arxiv.org/abs/2401.13919

  25. [27]

    Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026

    Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, Jinrui Ding, Xiandi Ma, Yuchen Xie, Peng Pei, Xunliang Cai, and Xipeng Qiu. Evocua: Evolving computer use agents via learning from scalable synthetic experience, 2026. URLhttps://arxiv.org/abs/2601.15876

  26. [28]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. URL https://arxiv.org/ abs/2401.10935

  27. [29]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. Os-atlas: A foundation action model for generalist gui agents, 2024. URLhttps://arxiv.org/abs/2410.23218. 11

  28. [30]

    Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat- Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://arxiv.org/abs/2504.07981

  29. [31]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F. Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, Uri Alon, and Graham Neubig. Webarena: A realistic web environment for building autonomous agents, 2024. URLhttps://arxiv.org/abs/2307.13854

  30. [32]

    Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025

    Nam Huynh and Beiyu Lin. Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications, 2025. URL https://arxiv.org/abs/2503. 01245

  31. [33]

    From code foundation models to agents and applications: A comprehensive survey and practical guide to code intelligence, 2025

    Jian Yang, Xianglong Liu, Weifeng Lv, Ken Deng, Shawn Guo, Lin Jing, Yizhi Li, Shark Liu, Xianzhen Luo, Yuyu Luo, Changzai Pan, Ensheng Shi, Yingshui Tan, Renshuai Tao, Jiajun Wu, Xianjie Wu, Zhenhe Wu, Daoguang Zan, Chenchen Zhang, Wei Zhang, He Zhu, Terry Yue Zhuo, Kerui Cao, Xianfu Cheng, Jun Dong, Shengjie Fang, Zhiwei Fei, Xiangyuan Guan, Qipeng Guo,...

  32. [34]

    Software development life cycle perspective: A survey of benchmarks for code large language models and agents,

    Kaixin Wang, Tianlin Li, Xiaoyu Zhang, Chong Wang, Weisong Sun, Yang Liu, and Bin Shi. Software development life cycle perspective: A survey of benchmarks for code large language models and agents,

  33. [35]

    URLhttps://arxiv.org/abs/2505.05283

  34. [36]

    Challenges and paths towards ai for software engineering, 2025

    Alex Gu, Naman Jain, Wen-Ding Li, Manish Shetty, Yijia Shao, Ziyang Li, Diyi Yang, Kevin Ellis, Koushik Sen, and Armando Solar-Lezama. Challenges and paths towards ai for software engineering, 2025. URL https://arxiv.org/abs/2503.22625

  35. [37]

    ByteDance Seed 1.8

    ByteDance. ByteDance Seed 1.8. https://seed.bytedance.com/en/seed1_8, 2026

  36. [38]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team. Gemini: A family of highly capable multimodal models, 2025. URL https://arxiv. org/abs/2312.11805

  37. [39]

    Large language models: A survey, 2025

    Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey, 2025. URL https://arxiv.org/abs/2402. 06196

  38. [40]

    A Survey of Large Language Models

    Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. A survey of large language models, 2025. URLhttps://arxiv.org/abs/2303.18223

  39. [41]

    Cogagent: A visual language model for gui agents, 2024

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxuan Zhang, Juanzi Li, Bin Xu, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2024. URLhttps://arxiv.org/abs/2312.08914

  40. [42]

    Mind2Web: Towards a Generalist Agent for the Web

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web, 2023. URL https://arxiv.org/abs/2306.06070

  41. [43]

    Autowebglm: A large language model-based web navigating agent, 2024

    Hanyu Lai, Xiao Liu, Iat Long Iong, Shuntian Yao, Yuxuan Chen, Pengbo Shen, Hao Yu, Hanchen Zhang, Xiaohan Zhang, Yuxiao Dong, and Jie Tang. Autowebglm: A large language model-based web navigating agent, 2024. URLhttps://arxiv.org/abs/2404.03648

  42. [44]

    A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis

    Izzeddin Gur, Hiroki Furuta, Austin Huang, Mustafa Safdari, Yutaka Matsuo, Douglas Eck, and Aleksandra Faust. A real-world webagent with planning, long context understanding, and program synthesis, 2024. URLhttps://arxiv.org/abs/2307.12856

  43. [45]

    Omniparser: A unified framework for text spotting, key information extraction and table recognition, 2024

    Jianqiang Wan, Sibo Song, Wenwen Yu, Yuliang Liu, Wenqing Cheng, Fei Huang, Xiang Bai, Cong Yao, and Zhibo Yang. Omniparser: A unified framework for text spotting, key information extraction and table recognition, 2024. URLhttps://arxiv.org/abs/2403.19128

  44. [46]

    UGround: Towards Unified Visual Grounding with Unrolled Transformers

    Rui Qian, Xin Yin, Chuanhang Deng, Zhiyuan Peng, Jian Xiong, Wei Zhai, and Dejing Dou. Uground: Towards unified visual grounding with unrolled transformers, 2026. URL https://arxiv.org/abs/ 2510.03853. 12

  45. [47]

    A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

    Pascal J. Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F. Grewe, and Thilo Stadelmann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions, 2025. URL https://arxiv. org/abs/2501.16150

  46. [48]

    Os agents: A survey on mllm- based agents for general computing devices use, 2025

    Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

  47. [49]

    Android in the wild: A large-scale dataset for android device control, 2023

    Christopher Rawles, Alice Li, Daniel Rodriguez, Oriana Riva, and Timothy Lillicrap. Android in the wild: A large-scale dataset for android device control, 2023. URLhttps://arxiv.org/abs/2307.10088

  48. [50]

    GAIA: a benchmark for General AI Assistants

    Grégoire Mialon, Clémentine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants, 2023. URLhttps://arxiv.org/abs/2311.12983

  49. [51]

    Mapping natural language instructions to mobile ui action sequences, 2020

    Yang Li, Jiacong He, Xin Zhou, Yuan Zhang, and Jason Baldridge. Mapping natural language instructions to mobile ui action sequences, 2020. URLhttps://arxiv.org/abs/2005.03776

  50. [52]

    Andrea Burns, Deniz Arsan, Sanjna Agrawal, Ranjitha Kumar, Kate Saenko, and Bryan A. Plummer. Mo- bile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments,

  51. [53]

    URLhttps://arxiv.org/abs/2104.08560

  52. [54]

    Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022

    Liangtai Sun, Xingyu Chen, Lu Chen, Tianle Dai, Zichen Zhu, and Kai Yu. Meta-gui: Towards multi-modal conversational agents on mobile gui, 2022. URLhttps://arxiv.org/abs/2205.11029

  53. [55]

    Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem Alshikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web, 2024. URLhttps://arxiv.org/abs/2402.17553

  54. [56]

    On the effects of data scale on ui control agents, 2024

    Wei Li, William Bishop, Alice Li, Chris Rawles, Folawiyo Campbell-Ajala, Divya Tyamagundlu, and Oriana Riva. On the effects of data scale on ui control agents, 2024. URL https://arxiv.org/abs/ 2406.03679

  55. [57]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, Shudan Zhang, Xiang Deng, Aohan Zeng, Zhengxiao Du, Chenhui Zhang, Sheng Shen, Tianjun Zhang, Yu Su, Huan Sun, Minlie Huang, Yuxiao Dong, and Jie Tang. Agentbench: Evaluating llms as agents, 2025. URLhttps://arxiv.org/abs/2308.03688

  56. [58]

    Reinforcement Learning on Web Interfaces Using Workflow-Guided Exploration

    Evan Zheran Liu, Kelvin Guu, Panupong Pasupat, Tianlin Shi, and Percy Liang. Reinforcement learning on web interfaces using workflow-guided exploration, 2018. URL https://arxiv.org/abs/1802.08802

  57. [59]

    Webshop: Towards scalable real-world web interaction with grounded language agents, 2023

    Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. Webshop: Towards scalable real-world web interaction with grounded language agents, 2023. URLhttps://arxiv.org/abs/2207.01206

  58. [60]

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks, 2024. URLhttps://arxiv.org/abs/2401.13649

  59. [61]

    WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. Workarena: How capable are web agents at solving common knowledge work tasks?, 2024. URL https://arxiv.org/abs/2403.07718

  60. [62]

    Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024

    Danyang Zhang, Zhennan Shen, Rui Xie, Situo Zhang, Tianbao Xie, Zihan Zhao, Siyuan Chen, Lu Chen, Hongshen Xu, Ruisheng Cao, and Kai Yu. Mobile-env: Building qualified evaluation benchmarks for llm-gui interaction, 2024. URLhttps://arxiv.org/abs/2305.08144

  61. [63]

    A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026

    Yuxiang Chai, Shunye Tang, Han Xiao, Weifeng Lin, Hanhao Li, Jiayu Zhang, Liang Liu, Pengxiang Zhao, Guangyi Liu, Guozhi Wang, Shuai Ren, Rongduo Han, Haining Zhang, Siyuan Huang, and Hongsheng Li. A3: Android agent arena for mobile gui agents with essential-state procedural evaluation, 2026. URL https://arxiv.org/abs/2501.01149

  62. [64]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api. semanticscholar.org/CorpusID:268232499

  63. [65]

    Home”, “Discovery

    Google. A new era of intelligence with Gemini 3. https://blog.google/products-and- platforms/products/gemini/gemini-3/, November 2025. 13 A SimuWoB Environment Synthesizing Following the pipeline of Figure 2, we first had the model draft a detailed PRD document based on the given metadata, then asked it to write code based on the document. Here follows an...

  64. [66]

    Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36)

    Actions: Single-tap on the video area to evoke the control layer; use gravity sensor or tap the button to switch to [full-screen landscape mode]. Membership Subscription Flow(...) Short Video Browsing Flow(...) Visual Interface Guidelines Color Palette • Primary Brand Color: iQIYI Green (#00CC36). Represents vitality and youthfulness. Used for the logo, s...

  65. [67]

    large images with minimal text

    Clear Information Hierarchy: Through the card-style design featuring “large images with minimal text”, users can quickly capture the visual focus while scrolling rapidly

  66. [68]

    browsing for content

    Contextual Design: Strictly distinguishes between the “browsing for content” scenario (bright, efficient) and the “watching content” scenario (dark, immersive), aligning with user mental models

  67. [69]

    Monetization Integration: The VIP membership design is not just a functional entry point but an independent visual system that effectively stimulates users’ desire to pay through color psychology

  68. [70]

    long-form video attracts → community discussion → short-form video kills time

    Ecosystem Loop: Cleverly embeds short videos (Suike) and community (Discovery) into the bottom navigation, forming a content consumption loop of “long-form video attracts → community discussion → short-form video kills time”. After writing, it reviewed the existing codebase, proposed a series of items to be added or modified, updated the PRD document acco...

  69. [71]

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...