pith. sign in

arxiv: 2509.07553 · v3 · submitted 2025-09-09 · 💻 cs.CL

VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents

Pith reviewed 2026-05-18 18:01 UTC · model grok-4.3

classification 💻 cs.CL
keywords OS agentshuman-agent interactionGUI automationtrustworthy AIquery-driven systemsmultimodal agentspolicy optimizationuntrustworthy scenarios
0
0 comments X

The pith

VeriOS-Agent lets OS agents decide when to query humans for reliable GUI task completion in untrustworthy conditions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a query-driven framework for operating system agents that enables them to execute tasks autonomously under normal conditions but proactively seek human input when facing untrustworthy scenarios. This is achieved by training VeriOS-Agent with a three-stage learning paradigm that combines supervised fine-tuning and group relative policy optimization to separate and apply different types of meta-knowledge. The result is a measurable improvement in handling real-world uncertainties without degrading performance in straightforward cases. A sympathetic reader would care because everyday automation tools often risk errors in variable environments, and this method offers a practical way to balance independence with safety through targeted human involvement.

Core claim

VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Built on a query-driven human-agent-GUI interaction framework, the agent is trained via a three-stage paradigm that facilitates the decoupling and utilization of meta-knowledge through supervised fine-tuning followed by group relative policy optimization. Experiments demonstrate that this yields an average step-wise success rate improvement of 19.72% over the strongest baselines in untrustworthy scenarios, without compromising normal performance, while analysis confirms the agent's rationality, generalizability, and scalability.

What carries the argument

The query-driven human-agent-GUI interaction framework, enabled by a three-stage learning paradigm of supervised fine-tuning and group relative policy optimization that decouples meta-knowledge for deciding when to query humans.

If this is right

  • The agent achieves higher step-wise success rates specifically in untrustworthy scenarios compared to baselines.
  • Performance in trustworthy scenarios remains comparable to existing agents.
  • The training process produces agents with demonstrated rationality in deciding when to involve humans.
  • The method supports generalizability across different untrustworthy conditions and scalability to larger tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar query mechanisms could apply to other GUI-based agents in mobile or web environments.
  • Separating meta-knowledge types may prove useful for building safety checks in broader autonomous systems.
  • Further tests with varied human response times could reveal practical limits on real-time querying.

Load-bearing premise

The three-stage learning paradigm successfully decouples and utilizes meta-knowledge to enable accurate decisions on when to query humans.

What would settle it

A controlled test in which VeriOS-Agent shows no improvement in step-wise success rates during untrustworthy GUI scenarios or begins querying humans unnecessarily in normal conditions would falsify the central claim.

Figures

Figures reproduced from arXiv: 2509.07553 by Heyuan Huang, Jun Wang, Pengzhou Cheng, Weinan Zhang, Weiwen Liu, Xiangmou Qu, Xingyu Lou, Zhaoxiang Wang, Zheng Wu, Zhuosheng Zhang, Zongru Wu.

Figure 1
Figure 1. Figure 1: (A) Interaction paradigm among the OS agent, human, and GUI. Existing work mainly focuses on autonomous OS agents and [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of scenarios and platforms in VeriOS-Bench. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pilot study on the scenario judgment accuracy of normal MLLM-based OS agents. Existing MLLM-based OS agents perform [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Diagram of the two-stage learning paradigm and query-driven human-agent-GUI interaction. The two-stage learning paradigm [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: OOD experiment with 7B and 72B model parameter scales. Experimental results demonstrate that VeriOS-Agent exhibits [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a three-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge by supervised fine-tuning and group relative policy optimization. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 19.72\% in over the strongest baselines, without compromising normal performance. VeriOS-Agent significantly improves performance in untrustworthy scenarios while maintaining comparable performance in trustworthy scenarios. Analysis highlights VeriOS-Agent's rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper proposes VeriOS-Agent, a query-driven proactive human-agent-GUI interaction framework for trustworthy OS agents. Built on a three-stage learning paradigm (supervised fine-tuning followed by group relative policy optimization) that aims to decouple and utilize meta-knowledge, the agent executes actions autonomously under normal conditions but proactively queries humans in untrustworthy scenarios. The central empirical result is a 19.72% improvement in average step-wise success rate over the strongest baselines in untrustworthy scenarios, with no compromise to normal-scenario performance. The work also reports analysis of rationality, generalizability, and scalability, and releases code, datasets, and models.

Significance. If the reported performance gains are robustly supported, the work would make a meaningful contribution to reliable GUI-based OS agents by addressing over-execution risks in real-world untrustworthy conditions. The open release of code, datasets, and models at the provided GitHub repository is a clear strength that supports reproducibility and follow-on research.

major comments (2)
  1. [Experiments] Experiments section: The central claim of a 19.72% improvement in average step-wise success rate (and comparable normal performance) is presented without details on experimental setup. No information is given on how untrustworthy scenarios were defined or sampled, which baselines were used, the number of trials or runs, error bars, or statistical significance testing. This directly undermines evaluation of the load-bearing empirical result.
  2. [Method] Method section: The three-stage learning paradigm is described at a high level as enabling decoupling of meta-knowledge via supervised fine-tuning and group relative policy optimization, but no equations, algorithm pseudocode, loss formulations, or hyperparameter details are provided. This makes it impossible to verify how the paradigm produces the claimed proactive querying behavior.
minor comments (3)
  1. [Abstract] Abstract: Typo 'falcitate' should be 'facilitate'.
  2. [Abstract] Abstract: The phrasing 'by 19.72% in over the strongest baselines' contains a grammatical error and should read 'by 19.72% over the strongest baselines'.
  3. [Abstract] Abstract: The final two sentences are largely redundant; the second largely repeats the performance claim already stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to enhance the clarity and completeness of the experimental and methodological sections.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: The central claim of a 19.72% improvement in average step-wise success rate (and comparable normal performance) is presented without details on experimental setup. No information is given on how untrustworthy scenarios were defined or sampled, which baselines were used, the number of trials or runs, error bars, or statistical significance testing. This directly undermines evaluation of the load-bearing empirical result.

    Authors: We thank the referee for highlighting this issue. Upon review, we recognize that the experimental setup details could be presented more comprehensively. In the revised version of the manuscript, we will add explicit information on the definition and sampling of untrustworthy scenarios, the complete list of baselines used, the number of trials and runs performed, error bars on the reported metrics, and the results of statistical significance testing. revision: yes

  2. Referee: [Method] Method section: The three-stage learning paradigm is described at a high level as enabling decoupling of meta-knowledge via supervised fine-tuning and group relative policy optimization, but no equations, algorithm pseudocode, loss formulations, or hyperparameter details are provided. This makes it impossible to verify how the paradigm produces the claimed proactive querying behavior.

    Authors: We agree with the referee that additional details on the three-stage learning paradigm would aid in understanding and reproducibility. We will revise the Method section to include the relevant equations, algorithm pseudocode, loss formulations, and hyperparameter details for the supervised fine-tuning and group relative policy optimization stages. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical systems contribution describing a query-driven human-agent-GUI framework and a three-stage training process (supervised fine-tuning followed by group relative policy optimization) for VeriOS-Agent. All performance claims, including the 19.72% step-wise success rate improvement in untrustworthy scenarios, are presented as direct experimental measurements on benchmarks rather than quantities derived from equations or fitted parameters within the paper. No mathematical derivations, uniqueness theorems, or ansatzes appear; the method description remains at the level of a standard training pipeline without reducing to self-definition or self-citation chains. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the three-stage training successfully teaches the agent to distinguish trustworthy from untrustworthy scenarios; no explicit free parameters, axioms, or invented entities are named in the abstract.

pith-pipeline@v0.9.0 · 5784 in / 999 out tokens · 26883 ms · 2026-05-18T18:01:58.433001+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.

  2. Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

    cs.AI 2026-02 unverdicted novelty 7.0

    The work creates a new benchmark for humanizing GUI agent touch dynamics via a MinMax detector-agent model, a mobile touch dataset, and methods showing agents can match human behavior without losing task performance.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · cited by 2 Pith papers · 18 internal anchors

  1. [1]

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906(2025)

  2. [2]

    Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, et al. 2025. InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning.arXiv preprint arXiv:2508.19679(2025)

  3. [3]

    Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. 2024. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems37 (2024), 12461–12495

  4. [4]

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966(2023)

  5. [5]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...

  6. [6]

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. [n. d.]. Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. InForty-second International Conference on Machine Learning. VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustwor...

  7. [7]

    Jikai Chen, Long Chen, Dong Wang, Leilei Gan, Chenyi Zhuang, and Jinjie Gu. 2025. V2P: From Background Suppression to Center Peaking for Robust GUI Grounding Task.arXiv preprint arXiv:2508.13634(2025)

  8. [8]

    Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, and Baobao Chang. 2023. Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond.arXiv preprint arXiv:2310.02071(2023)

  9. [9]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9313–9332

  10. [10]

    Pengzhou Cheng, Haowen Hu, Zheng Wu, Zongru Wu, Tianjie Ju, Zhuosheng Zhang, and Gongshen Liu. 2025. Hidden Ghost Hand: Unveiling Backdoor Vulnerabilities in MLLM-Powered Mobile GUI Agents.arXiv preprint arXiv:2505.14418(2025)

  11. [11]

    Pengzhou Cheng, Zheng Wu, Zongru Wu, Tianjie Ju, Aston Zhang, Zhuosheng Zhang, and Gongshen Liu. 2025. OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguist...

  12. [12]

    Ziming Cheng, Zhiyuan Huang, Junting Pan, Zhaohui Hou, and Mingjie Zhan. 2025. Navi-plus: Managing Ambiguous GUI Navigation Tasks with Follow-up.arXiv preprint arXiv:2503.24180(2025)

  13. [13]

    Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, Jinpeng Li, et al . 2025. Human-like object concept representations emerge naturally in multimodal large language models.Nature Machine Intelligence(2025), 1–16

  14. [14]

    Robert M French. 1999. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences3, 4 (1999), 128–135

  15. [15]

    Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. 2025. UI-Venus Technical Report: Building High-performance UI Agents with RFT.arXiv preprint arXiv:2508.10833(2025)

  16. [16]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6864–6890

  17. [17]

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778

  18. [18]

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290

  19. [19]

    Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...

  20. [20]

    Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey.arXiv preprint arXiv:2402.02716(2024)

  21. [21]

    Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P

    Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, and Graham Neubig. 2025. CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System D...

  22. [22]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)

  23. [23]

    Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. 2024. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision. Springer, 161–178

  24. [24]

    Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. 2025. ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents.arXiv preprint arXiv:2508.14040(2025)

  25. [25]

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. 2025. Screenspot-pro: Gui grounding for professional high-resolution computer use.arXiv preprint arXiv:2504.07981(2025)

  26. [26]

    Ning Li, Xiangmou Qu, Jiamu Zhou, Jun Wang, Muning Wen, Kounianhua Du, Xingyu Lou, Qiuying Peng, and Weinan Zhang. 2025. MobileUse: A GUI Agent with Hierarchical Reflection for Autonomous Mobile Operation.arXiv preprint arXiv:2507.16853(2025)

  27. [27]

    Guangyi Liu, Pengxiang Zhao, Liang Liu, Yaxuan Guo, Han Xiao, Weifeng Lin, Yuxiang Chai, Yue Han, Shuai Ren, Hao Wang, et al. 2025. Llm-powered gui agents in phone automation: Surveying progress and prospects.arXiv preprint arXiv:2504.19838(2025)

  28. [28]

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. 2025. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025)

  29. [29]

    Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, et al. 2025. InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization.arXiv preprint arXiv:2508.05731(2025). 16 Zheng Wu et al

  30. [30]

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986

  31. [31]

    Quanfeng Lu, Wenqi Shao, Zitao Liu, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, Yu Qiao, and Ping Luo. 2024. Gui odyssey: A comprehensive dataset for cross-app gui navigation on mobile devices.arXiv preprint arXiv:2406.08451(2024)

  32. [32]

    Yijie Lu, Tianjie Ju, Manman Zhao, Xinbei Ma, Yuan Guo, and ZhuoSheng Zhang. 2025. EVA: Red-Teaming GUI Agents via Evolving Indirect Prompt Injection.arXiv preprint arXiv:2505.14289(2025)

  33. [33]

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620(2025)

  34. [34]

    Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025)

  35. [35]

    Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. 2025. Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova,...

  36. [36]

    Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. 2024. CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation. In Findings of the Association for Computational Linguistics ACL 2024. 9097–9110

  37. [37]

    Yi-Hao Peng, Dingzeyu Li, Jeffrey P Bigham, and Amy Pavel. 2025. Morae: Proactively Pausing UI Agents for User Choices.arXiv preprint arXiv:2508.21456(2025)

  38. [38]

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents.arXiv preprint arXiv:2501.12326(2025)

  39. [39]

    Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. 2025. Ai agents for computer use: A review of instructionbased computer control, gui automation, and operator assistants.arXiv preprint arXiv:2501.16150(2025)

  40. [40]

    ByteDance Seed. 2025. UI-TARS-1.5. https://seed-tars.com/1.5

  41. [41]

    Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, et al

  42. [42]

    Thinking vs

    Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction.arXiv preprint arXiv:2506.07976(2025)

  43. [43]

    Yucheng Shi, Wenhao Yu, Wenlin Yao, Wenhu Chen, and Ninghao Liu. 2025. Towards trustworthy gui agents: A survey.arXiv preprint arXiv:2503.23434 (2025)

  44. [44]

    Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al

  45. [45]

    GUI-G2: Gaussian Reward Modeling for GUI Grounding.arXiv preprint arXiv:2507.15846(2025)

  46. [46]

    V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...

  47. [47]

    Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems37 (2024), 2686–2710

  48. [48]

    Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye HAO, Jun Wang, and Kun Shao. 2025. DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agent. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id= LPG8pPSfQD

  49. [49]

    Xin Wang, Zhiyao Cui, Hao Li, Ya Zeng, Chenxu Wang, Ruiqi Song, Yihang Chen, Kun Shao, Qiaosheng Zhang, Jinzhuo Liu, et al. 2025. PerPilot: Personalizing VLM-based Mobile Agents via Memory and Exploration.arXiv preprint arXiv:2508.18040(2025)

  50. [50]

    Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al . 2025. MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents.arXiv preprint arXiv:2507.19478(2025)

  51. [51]

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2024. Agent workflow memory.arXiv preprint arXiv:2409.07429(2024)

  52. [52]

    Biao Wu, Yanda Li, Yunchao Wei, Meng Fang, and Ling Chen. 2024. Foundations and recent trends in multimodal mobile agents: A survey.arXiv preprint arXiv:2411.02006(2024)

  53. [53]

    Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang. 2025. DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning.arXiv preprint arXiv:2507.00008(2025)

  54. [54]

    Zheng Wu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, and Zhuosheng Zhang. 2025. GEM: Gaussian Embedding Modeling for Out-of- Distribution Detection in GUI Agents.arXiv preprint arXiv:2505.12842(2025). VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents 17

  55. [55]

    Zongru Wu, Pengzhou Cheng, Zheng Wu, Tianjie Ju, Zhuosheng Zhang, and Gongshen Liu. 2025. Smoothing grounding and reasoning for mllm-powered gui agents with query-oriented pivot tasks.arXiv preprint arXiv:2503.00401(2025)

  56. [56]

    Zheng Wu, Heyuan Huang, Yanjia Yang, Yuanyi Song, Xingyu Lou, Weiwen Liu, Weinan Zhang, Jun Wang, and Zhuosheng Zhang. 2025. Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents.arXiv preprint arXiv:2508.08645(2025)

  57. [57]

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. 2025. OS-ATLAS: Foundation Action Model for Generalist GUI Agents. InThe Thirteenth International Conference on Learning Representations

  58. [58]

    Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

  59. [59]

    Weikai Xu, Zhizheng Jiang, Yuxuan Liu, Pengzhi Gao, Wei Liu, Jian Luan, Yuanchun Li, Yunxin Liu, Bin Wang, and Bo An. 2025. Mobile-Bench-v2: A More Realistic and Comprehensive Benchmark for VLM-based Mobile Agents.arXiv preprint arXiv:2505.11891(2025)

  60. [60]

    Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2025. Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id= PlihOwfx4r

  61. [61]

    Zihe Yan and Zhuosheng Zhang. 2025. LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents.arXiv preprint arXiv:2507.10610(2025)

  62. [62]

    Xiao Yang, Jiawei Chen, Jun Luo, Zhengwei Fang, Yinpeng Dong, Hang Su, and Jun Zhu. 2025. Mla-trust: Benchmarking trustworthiness of multimodal llm agents in gui environments.arXiv preprint arXiv:2506.01616(2025)

  63. [63]

    Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. 2025. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791(2025)

  64. [64]

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)

  65. [65]

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al . 2025. Mobile-Agent-v3: Foundamental Agents for GUI Automation.arXiv preprint arXiv:2508.15144(2025)

  66. [66]

    Ziang Ye, Yang Zhang, Wentao Shi, Xiaoyu You, Fuli Feng, and Tat-Seng Chua. 2025. VisualTrap: A Stealthy Backdoor Attack on GUI Agents via Visual Grounding Manipulation.arXiv preprint arXiv:2507.06899(2025)

  67. [67]

    Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, et al. 2025. A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6216–6226

  68. [68]

    Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. 2024. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279(2024)

  69. [69]

    Chaoyun Zhang, He Huang, Chiming Ni, Jian Mu, Si Qin, Shilin He, Lu Wang, Fangkai Yang, Pu Zhao, Chao Du, et al. 2025. Ufo2: The desktop agentos.arXiv preprint arXiv:2504.14603(2025)

  70. [70]

    Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2025. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–20

  71. [71]

    Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024. Android in the Zoo: Chain-of-Action- Thought for GUI Agents. InFindings of the Association for Computational Linguistics: EMNLP 2024. 12016–12031

  72. [72]

    Zhong Zhang, Yaxi Lu, Yikun Fu, Yupeng Huo, Shenzhi Yang, Yesai Wu, Han Si, Xin Cong, Haotian Chen, Yankai Lin, et al. 2025. AgentCPM-GUI: Building Mobile-Use Agents with Reinforcement Fine-Tuning.arXiv preprint arXiv:2506.01391(2025)

  73. [73]

    Zhuosheng Zhang and Aston Zhang. 2024. You Only Look at Screens: Multimodal Chain-of-Action Agents. InFindings of the Association for Computational Linguistics ACL 2024. 3132–3149

  74. [74]

    Yuyang Zhao, Wentao Shi, Fuli Feng, and Xiangnan He. 2025. AppAgent-Pro: A Proactive GUI Agent System for Multidomain Information Integration and User Assistance.arXiv preprint arXiv:2508.18689(2025)

  75. [75]

    Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT-4V (ision) is a Generalist Web Agent, if Grounded. InInternational Conference on Machine Learning. PMLR, 61349–61385

  76. [76]

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. 2025. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810(2025)

  77. [77]

    Meng Ziyang, Yu Dai, Zezheng Gong, Shaoxiong Guo, Minglong Tang, and Tongquan Wei. 2024. VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami...