VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents
Pith reviewed 2026-05-18 18:01 UTC · model grok-4.3
The pith
VeriOS-Agent lets OS agents decide when to query humans for reliable GUI task completion in untrustworthy conditions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Built on a query-driven human-agent-GUI interaction framework, the agent is trained via a three-stage paradigm that facilitates the decoupling and utilization of meta-knowledge through supervised fine-tuning followed by group relative policy optimization. Experiments demonstrate that this yields an average step-wise success rate improvement of 19.72% over the strongest baselines in untrustworthy scenarios, without compromising normal performance, while analysis confirms the agent's rationality, generalizability, and scalability.
What carries the argument
The query-driven human-agent-GUI interaction framework, enabled by a three-stage learning paradigm of supervised fine-tuning and group relative policy optimization that decouples meta-knowledge for deciding when to query humans.
If this is right
- The agent achieves higher step-wise success rates specifically in untrustworthy scenarios compared to baselines.
- Performance in trustworthy scenarios remains comparable to existing agents.
- The training process produces agents with demonstrated rationality in deciding when to involve humans.
- The method supports generalizability across different untrustworthy conditions and scalability to larger tasks.
Where Pith is reading between the lines
- Similar query mechanisms could apply to other GUI-based agents in mobile or web environments.
- Separating meta-knowledge types may prove useful for building safety checks in broader autonomous systems.
- Further tests with varied human response times could reveal practical limits on real-time querying.
Load-bearing premise
The three-stage learning paradigm successfully decouples and utilizes meta-knowledge to enable accurate decisions on when to query humans.
What would settle it
A controlled test in which VeriOS-Agent shows no improvement in step-wise success rates during untrustworthy GUI scenarios or begins querying humans unnecessarily in normal conditions would falsify the central claim.
Figures
read the original abstract
With the rapid progress of multimodal large language models, operating system (OS) agents become increasingly capable of automating tasks through on-device graphical user interfaces (GUIs). However, most existing OS agents are designed for idealized settings, whereas real-world environments often present untrustworthy conditions. To mitigate risks of over-execution in such scenarios, we propose a query-driven human-agent-GUI interaction framework that enables OS agents to decide when to query humans for more reliable task completion. Built upon this framework, we introduce VeriOS-Agent, a trustworthy OS agent trained with a three-stage learning paradigm that falicitate the decoupling and utilization of meta-knowledge by supervised fine-tuning and group relative policy optimization. Concretely, VeriOS-Agent autonomously executes actions in normal conditions while proactively querying humans in untrustworthy scenarios. Experiments show that VeriOS-Agent improves the average step-wise success rate by 19.72\% in over the strongest baselines, without compromising normal performance. VeriOS-Agent significantly improves performance in untrustworthy scenarios while maintaining comparable performance in trustworthy scenarios. Analysis highlights VeriOS-Agent's rationality, generalizability, and scalability. The codes, datasets and models are available at https://github.com/Wuzheng02/VeriOS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes VeriOS-Agent, a query-driven proactive human-agent-GUI interaction framework for trustworthy OS agents. Built on a three-stage learning paradigm (supervised fine-tuning followed by group relative policy optimization) that aims to decouple and utilize meta-knowledge, the agent executes actions autonomously under normal conditions but proactively queries humans in untrustworthy scenarios. The central empirical result is a 19.72% improvement in average step-wise success rate over the strongest baselines in untrustworthy scenarios, with no compromise to normal-scenario performance. The work also reports analysis of rationality, generalizability, and scalability, and releases code, datasets, and models.
Significance. If the reported performance gains are robustly supported, the work would make a meaningful contribution to reliable GUI-based OS agents by addressing over-execution risks in real-world untrustworthy conditions. The open release of code, datasets, and models at the provided GitHub repository is a clear strength that supports reproducibility and follow-on research.
major comments (2)
- [Experiments] Experiments section: The central claim of a 19.72% improvement in average step-wise success rate (and comparable normal performance) is presented without details on experimental setup. No information is given on how untrustworthy scenarios were defined or sampled, which baselines were used, the number of trials or runs, error bars, or statistical significance testing. This directly undermines evaluation of the load-bearing empirical result.
- [Method] Method section: The three-stage learning paradigm is described at a high level as enabling decoupling of meta-knowledge via supervised fine-tuning and group relative policy optimization, but no equations, algorithm pseudocode, loss formulations, or hyperparameter details are provided. This makes it impossible to verify how the paradigm produces the claimed proactive querying behavior.
minor comments (3)
- [Abstract] Abstract: Typo 'falcitate' should be 'facilitate'.
- [Abstract] Abstract: The phrasing 'by 19.72% in over the strongest baselines' contains a grammatical error and should read 'by 19.72% over the strongest baselines'.
- [Abstract] Abstract: The final two sentences are largely redundant; the second largely repeats the performance claim already stated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to enhance the clarity and completeness of the experimental and methodological sections.
read point-by-point responses
-
Referee: [Experiments] Experiments section: The central claim of a 19.72% improvement in average step-wise success rate (and comparable normal performance) is presented without details on experimental setup. No information is given on how untrustworthy scenarios were defined or sampled, which baselines were used, the number of trials or runs, error bars, or statistical significance testing. This directly undermines evaluation of the load-bearing empirical result.
Authors: We thank the referee for highlighting this issue. Upon review, we recognize that the experimental setup details could be presented more comprehensively. In the revised version of the manuscript, we will add explicit information on the definition and sampling of untrustworthy scenarios, the complete list of baselines used, the number of trials and runs performed, error bars on the reported metrics, and the results of statistical significance testing. revision: yes
-
Referee: [Method] Method section: The three-stage learning paradigm is described at a high level as enabling decoupling of meta-knowledge via supervised fine-tuning and group relative policy optimization, but no equations, algorithm pseudocode, loss formulations, or hyperparameter details are provided. This makes it impossible to verify how the paradigm produces the claimed proactive querying behavior.
Authors: We agree with the referee that additional details on the three-stage learning paradigm would aid in understanding and reproducibility. We will revise the Method section to include the relevant equations, algorithm pseudocode, loss formulations, and hyperparameter details for the supervised fine-tuning and group relative policy optimization stages. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical systems contribution describing a query-driven human-agent-GUI framework and a three-stage training process (supervised fine-tuning followed by group relative policy optimization) for VeriOS-Agent. All performance claims, including the 19.72% step-wise success rate improvement in untrustworthy scenarios, are presented as direct experimental measurements on benchmarks rather than quantities derived from equations or fitted parameters within the paper. No mathematical derivations, uniqueness theorems, or ansatzes appear; the method description remains at the level of a standard training pipeline without reducing to self-definition or self-citation chains. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 2 Pith papers
-
OS-SPEAR: A Toolkit for the Safety, Performance,Efficiency, and Robustness Analysis of OS Agents
OS-SPEAR is a new evaluation toolkit that tests 22 OS agents and identifies trade-offs between efficiency and safety or robustness.
-
Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization
The work creates a new benchmark for humanizing GUI agent touch dynamics via a MinMax detector-agent model, a mobile touch dataset, and methods showing agents can match human behavior without losing task performance.
Reference graph
Works this paper leans on
-
[1]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. Agent s2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906(2025)
work page internal anchor Pith review arXiv 2025
-
[2]
Qihang Ai, Pi Bu, Yue Cao, Yingyao Wang, Jihao Gu, Jingxuan Xing, Zekun Zhu, Wei Jiang, Zhicheng Zheng, Jun Song, et al. 2025. InquireMobile: Teaching VLM-based Mobile Agent to Request Human Assistance via Reinforcement Fine-Tuning.arXiv preprint arXiv:2508.19679(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[3]
Hao Bai, Yifei Zhou, Jiayi Pan, Mert Cemri, Alane Suhr, Sergey Levine, and Aviral Kumar. 2024. Digirl: Training in-the-wild device-control agents with autonomous reinforcement learning.Advances in Neural Information Processing Systems37 (2024), 12461–12495
work page 2024
-
[4]
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond.arXiv preprint arXiv:2308.12966(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical Rep...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. [n. d.]. Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale. InForty-second International Conference on Machine Learning. VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustwor...
- [7]
-
[8]
Liang Chen, Yichi Zhang, Shuhuai Ren, Haozhe Zhao, Zefan Cai, Yuchi Wang, Peiyi Wang, Tianyu Liu, and Baobao Chang. 2023. Towards end-to-end embodied decision making via multi-modal large language model: Explorations with gpt4-vision and beyond.arXiv preprint arXiv:2310.02071(2023)
-
[9]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9313–9332
work page 2024
- [10]
-
[11]
Pengzhou Cheng, Zheng Wu, Zongru Wu, Tianjie Ju, Aston Zhang, Zhuosheng Zhang, and Gongshen Liu. 2025. OS-Kairos: Adaptive Interaction for MLLM-Powered GUI Agents. InFindings of the Association for Computational Linguistics: ACL 2025, Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar (Eds.). Association for Computational Linguist...
- [12]
-
[13]
Changde Du, Kaicheng Fu, Bincheng Wen, Yi Sun, Jie Peng, Wei Wei, Ying Gao, Shengpei Wang, Chuncheng Zhang, Jinpeng Li, et al . 2025. Human-like object concept representations emerge naturally in multimodal large language models.Nature Machine Intelligence(2025), 1–16
work page 2025
-
[14]
Robert M French. 1999. Catastrophic forgetting in connectionist networks.Trends in cognitive sciences3, 4 (1999), 128–135
work page 1999
- [15]
-
[16]
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. 2024. WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6864–6890
work page 2024
-
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition. 770–778
work page 2016
-
[18]
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14281–14290
work page 2024
-
[19]
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou, Guoyin Wang, Keting Yin, Zhou Zhao, Hongxia Yang, Fan Wu, Shengyu Zhang, a...
-
[20]
Xu Huang, Weiwen Liu, Xiaolong Chen, Xingmei Wang, Hao Wang, Defu Lian, Yasheng Wang, Ruiming Tang, and Enhong Chen. 2024. Understanding the planning of LLM agents: A survey.arXiv preprint arXiv:2402.02716(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P
Faria Huq, Zora Zhiruo Wang, Frank F. Xu, Tianyue Ou, Shuyan Zhou, Jeffrey P. Bigham, and Graham Neubig. 2025. CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (System D...
-
[22]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card.arXiv preprint arXiv:2410.21276(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. 2024. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. InEuropean Conference on Computer Vision. Springer, 161–178
work page 2024
- [24]
- [25]
- [26]
- [27]
-
[28]
Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. 2025. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng Wang, Xueyu Hu, Xiaotian Han, Jianbo Yuan, Xinyao Wang, et al. 2025. InfiGUI-G1: Advancing GUI Grounding with Adaptive Exploration Policy Optimization.arXiv preprint arXiv:2508.05731(2025). 16 Zheng Wu et al
-
[30]
Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. 2022. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11976–11986
work page 2022
- [31]
- [32]
-
[33]
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Guanjing Xiong, and Hongsheng Li. 2025. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning.arXiv preprint arXiv:2503.21620(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Run Luo, Lu Wang, Wanwei He, and Xiaobo Xia. 2025. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
Xinbei Ma, Yiting Wang, Yao Yao, Tongxin Yuan, Aston Zhang, Zhuosheng Zhang, and Hai Zhao. 2025. Caution for the Environment: Multimodal LLM Agents are Susceptible to Environmental Distractions. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova,...
-
[36]
Xinbei Ma, Zhuosheng Zhang, and Hai Zhao. 2024. CoCo-Agent: A Comprehensive Cognitive MLLM Agent for Smartphone GUI Automation. In Findings of the Association for Computational Linguistics ACL 2024. 9097–9110
work page 2024
- [37]
-
[38]
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. 2025. UI-TARS: Pioneering Automated GUI Interaction with Native Agents.arXiv preprint arXiv:2501.12326(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. 2025. Ai agents for computer use: A review of instructionbased computer control, gui automation, and operator assistants.arXiv preprint arXiv:2501.16150(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
ByteDance Seed. 2025. UI-TARS-1.5. https://seed-tars.com/1.5
work page 2025
-
[41]
Junhong Shen, Hao Bai, Lunjun Zhang, Yifei Zhou, Amrith Setlur, Shengbang Tong, Diego Caples, Nan Jiang, Tong Zhang, Ameet Talwalkar, et al
-
[42]
Thinking vs. Doing: Agents that Reason by Scaling Test-Time Interaction.arXiv preprint arXiv:2506.07976(2025)
- [43]
-
[44]
Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al
- [45]
-
[46]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, J...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Junyang Wang, Haiyang Xu, Haitao Jia, Xi Zhang, Ming Yan, Weizhou Shen, Ji Zhang, Fei Huang, and Jitao Sang. 2024. Mobile-agent-v2: Mobile device operation assistant with effective navigation via multi-agent collaboration.Advances in Neural Information Processing Systems37 (2024), 2686–2710
work page 2024
-
[48]
Taiyi Wang, Zhihao Wu, Jianheng Liu, Jianye HAO, Jun Wang, and Kun Shao. 2025. DistRL: An Asynchronous Distributed Reinforcement Learning Framework for On-Device Control Agent. InThe Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id= LPG8pPSfQD
work page 2025
- [49]
- [50]
-
[51]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2024. Agent workflow memory.arXiv preprint arXiv:2409.07429(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [52]
- [53]
-
[54]
Zheng Wu, Pengzhou Cheng, Zongru Wu, Lingzhong Dong, and Zhuosheng Zhang. 2025. GEM: Gaussian Embedding Modeling for Out-of- Distribution Detection in GUI Agents.arXiv preprint arXiv:2505.12842(2025). VeriOS: Query-Driven Proactive Human-Agent-GUI Interaction for Trustworthy OS Agents 17
- [55]
-
[56]
Zheng Wu, Heyuan Huang, Yanjia Yang, Yuanyi Song, Xingyu Lou, Weiwen Liu, Weinan Zhang, Jun Wang, and Zhuosheng Zhang. 2025. Quick on the Uptake: Eliciting Implicit Intents from Human Demonstrations for Personalized Mobile-Use Agents.arXiv preprint arXiv:2508.08645(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. 2025. OS-ATLAS: Foundation Action Model for Generalist GUI Agents. InThe Thirteenth International Conference on Learning Representations
work page 2025
-
[58]
Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094
work page 2024
- [59]
-
[60]
Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2025. Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id= PlihOwfx4r
work page 2025
-
[61]
Zihe Yan and Zhuosheng Zhang. 2025. LaSM: Layer-wise Scaling Mechanism for Defending Pop-up Attack on GUI Agents.arXiv preprint arXiv:2507.10610(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [62]
-
[63]
Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. 2025. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[64]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR)
work page 2023
-
[65]
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al . 2025. Mobile-Agent-v3: Foundamental Agents for GUI Automation.arXiv preprint arXiv:2508.15144(2025)
work page internal anchor Pith review arXiv 2025
- [66]
-
[67]
Miao Yu, Fanci Meng, Xinyun Zhou, Shilong Wang, Junyuan Mao, Linsey Pan, Tianlong Chen, Kun Wang, Xinfeng Li, Yongfeng Zhang, et al. 2025. A survey on trustworthy llm agents: Threats and countermeasures. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2. 6216–6226
work page 2025
-
[68]
Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. 2024. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279(2024)
work page internal anchor Pith review arXiv 2024
- [69]
-
[70]
Chi Zhang, Zhao Yang, Jiaxuan Liu, Yanda Li, Yucheng Han, Xin Chen, Zebiao Huang, Bin Fu, and Gang Yu. 2025. Appagent: Multimodal agents as smartphone users. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems. 1–20
work page 2025
-
[71]
Jiwen Zhang, Jihao Wu, Teng Yihua, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. 2024. Android in the Zoo: Chain-of-Action- Thought for GUI Agents. InFindings of the Association for Computational Linguistics: EMNLP 2024. 12016–12031
work page 2024
- [72]
-
[73]
Zhuosheng Zhang and Aston Zhang. 2024. You Only Look at Screens: Multimodal Chain-of-Action Agents. InFindings of the Association for Computational Linguistics ACL 2024. 3132–3149
work page 2024
- [74]
-
[75]
Boyuan Zheng, Boyu Gou, Jihyung Kil, Huan Sun, and Yu Su. 2024. GPT-4V (ision) is a Generalist Web Agent, if Grounded. InInternational Conference on Machine Learning. PMLR, 61349–61385
work page 2024
- [76]
-
[77]
Meng Ziyang, Yu Dai, Zezheng Gong, Shaoxiong Guo, Minglong Tang, and Tongquan Wei. 2024. VGA: Vision GUI Assistant - Minimizing Hallucinations through Image-Centric Fine-Tuning. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.). Association for Computational Linguistics, Miami...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.