arxiv: 2604.25380 · v2 · submitted 2026-04-28 · 💻 cs.CV

Recognition: no theorem link

Benchmarking and Improving GUI Agents in High-Dynamic Environments

Enqi Liu , Liyuan Pan , Zhi Gao , Yan Yang , Chenrui Shi , Yang Liu , Jingrong Wu , Qing Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords GUI agentsdynamic environmentsscreen video inputframe clusteringaction refinementpartial observabilitybenchmarkreflection module

0 comments

The pith

GUI agents improve in changing interfaces by processing screen videos rather than single frames after each action.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current GUI agents often fail in high-dynamic environments because they base decisions on one screenshot, missing interface changes that occur between actions. The paper introduces DynamicGUIBench, a benchmark covering ten applications and scenarios with frequent key state updates, to measure this gap. It proposes DynamicUI, which ingests full screen-recording videos, clusters frames to identify salient changes, filters thoughts for consistency with actions, and reflects on refined trajectories to plan next steps. Experiments show this yields clear gains on the dynamic benchmark while holding steady on existing static ones. If the approach holds, GUI automation systems can move from partial to more complete observation of interface behavior.

Core claim

DynamicUI takes screen-recording videos as input and uses a dynamic perceiver to cluster frames, caption centroids, and iteratively select the most informative frames as salient dynamic context; an action-conditioned refinement strategy then filters thoughts to reduce inconsistency and redundancy with the agent's textual context; finally, a reflection module draws guidance from the cleaned trajectories to improve subsequent actions, addressing the partial observability that single-screenshot agents face in environments where important GUI state changes between steps.

What carries the argument

The dynamic perceiver that clusters video frames, captions centroids, and selects salient frames, paired with action-conditioned filtering and trajectory reflection to supply changing context.

If this is right

Agents gain awareness of interface transitions and animations that occur after an action but before the next screenshot.
Decision quality rises in tasks where information needed for the correct action disappears or appears only in intermediate states.
The same video-based pipeline can be added to existing supervised or reinforcement-trained agents without retraining the core policy.
Reflection over cleaned trajectories produces more accurate long-term guidance than immediate single-frame reflection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar video-perception modules could help agents in other partially observable settings such as mobile apps with pop-ups or web pages with dynamic content loading.
Benchmarks focused on state change frequency may expose that many reported agent successes rely on static interfaces and would drop in real-world use.
Longer interaction videos might require additional compression or summarization steps to keep computation feasible while preserving change detection.
Integrating the perceiver directly into the agent's vision encoder could reduce the separate clustering step and make the method more end-to-end.

Load-bearing premise

Clustering and captioning video frames plus filtering will reliably surface the key changing states that single frames miss, without introducing new noise or inconsistencies that hurt decisions.

What would settle it

A controlled test on one DynamicGUIBench task where the selected video frames still omit a critical interface change that determines success, so DynamicUI shows no gain or performs worse than a single-frame baseline.

Figures

Figures reproduced from arXiv: 2604.25380 by Chenrui Shi, Enqi Liu, Jingrong Wu, Liyuan Pan, Qing Li, Yang Liu, Yan Yang, Zhi Gao.

**Figure 1.** Figure 1: Illustration of how hidden dynamics make task view at source ↗

**Figure 2.** Figure 2: Pipeline of data construction. EFunc. represents view at source ↗

**Figure 3.** Figure 3: Representative cases from DynamicGUIBench. Each example illustrates a hidden interstitial state view at source ↗

**Figure 4.** Figure 4: Task distribution in DynamicGUIBench across four view at source ↗

**Figure 5.** Figure 5: The overall architecture of DynamicUI. The system comprises three collaborative components: (1) Dynamic Perceiver view at source ↗

**Figure 6.** Figure 6: Three failure modes of traditional GUI agents. view at source ↗

**Figure 7.** Figure 7: A representative comparison between DynamicUI (top) and traditional GUI agents (bottom) on DynamicGUIBench. view at source ↗

read the original abstract

Recent advancements in Graphical User Interface (GUI) agents have predominantly focused on training paradigms like supervised fine-tuning (SFT) and reinforcement learning (RL). However, the challenge of high-dynamic GUI environments remains largely underexplored. Existing agents typically rely on a single screenshot after each action for decision-making, leading to a partially observable (or even unobservable) Markov decision process, where the key GUI state including important information for actions is often inadequately captured. To systematically explore this challenge, we introduce DynamicGUIBench, a comprehensive online GUI benchmark spanning ten applications and diverse interaction scenarios characterized by important interface changes between actions. Furthermore, we present DynamicUI, an agent designed for dynamic interfaces, which takes screen-recording videos of the interaction process as input and consists of three components: a dynamic perceiver, a refinement strategy, and a reflection. Specifically, the dynamic perceiver clusters frames of the GUI video, generates captions for the centroids, and iteratively selects the most informative frames as the salient dynamic context. Considering that there may be inconsistencies and noise between the selected frames and the textual context of the agent, the refinement strategy employs an action-conditioned filtering to refine thoughts to mitigate thought-action inconsistency and redundancy. Based on the refined agent trajectories, the reflection module provides effective and accurate guidance for further actions. Experiments on DynamicGUIBench demonstrate that DynamicUI significantly improves the performance in dynamic GUI environments, while maintaining competitive performance on other public benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a practical new benchmark for high-dynamic GUI agents and a video-based architecture to address partial observability, but the central performance claims lack visible quantitative backing.

read the letter

The main point is that current GUI agents often miss key state changes because they look at one screenshot after each action. This paper builds DynamicGUIBench across ten applications to test exactly those high-dynamic cases, and it proposes DynamicUI, which ingests screen-recording videos instead. The agent clusters video frames, captions the centroids, applies action-conditioned filtering to clean up thoughts, and adds a reflection step on the refined trajectories. That combination directly targets the POMDP-style problem the abstract describes, and the benchmark gives the field a concrete set of scenarios to measure against. Both pieces are new relative to the SFT and RL baselines referenced in the work. The approach stays straightforward without adding heavy new machinery, which is a plus for reproducibility in agent research. The soft spot is the evidence. The abstract states that DynamicUI significantly improves results on the new benchmark while staying competitive elsewhere, yet it supplies no numbers, no specific baselines, no error bars, and no ablation details. Without those, it is impossible to judge how large the gains are or whether the video components are what drive them. If the full paper contains clear tables and controls, the contribution strengthens; if not, the claims stay hard to evaluate. This is useful for researchers working on GUI agents meant for real apps where interfaces change between steps. Readers who need testbeds for dynamic robustness or who are extending multimodal agents will find the benchmark and the three-component design worth examining. It deserves a serious referee because the problem is real and the proposed fix is plausible, even if the experiments require close checking. I would send it to peer review and ask the authors to lead with the quantitative results and ablations.

Referee Report

2 major / 2 minor

Summary. The paper introduces DynamicGUIBench, a new online benchmark spanning ten GUI applications with diverse dynamic interaction scenarios where interface states change significantly between actions. It proposes DynamicUI, an agent that ingests screen-recording videos rather than single screenshots and comprises a dynamic perceiver (frame clustering, centroid captioning, and iterative selection of salient frames), an action-conditioned refinement strategy to reduce thought-action inconsistencies, and a reflection module that generates guidance from refined trajectories. The central claim is that this video-based approach yields significant performance gains on DynamicGUIBench while remaining competitive on existing public benchmarks.

Significance. If the experimental results hold, the work would be significant for the GUI-agent community by directly targeting the partial-observability problem that single-screenshot agents face in high-dynamic settings. The new benchmark itself provides a concrete, reproducible testbed that could drive future research on video-aware or state-tracking agents.

major comments (2)

[Abstract] Abstract: the claim that DynamicUI 'significantly improves the performance' is asserted without any quantitative metrics, baselines, success rates, or statistical details; this is load-bearing for the central claim because the entire contribution rests on demonstrating that the video-based components outperform single-frame baselines.
[Methods (Dynamic Perceiver and Refinement)] The dynamic perceiver description (clustering frames, captioning centroids, action-conditioned filtering) is presented at a high level; without an ablation that isolates the contribution of each sub-component or a concrete metric showing how well the selected frames capture state changes missed by single screenshots, it is difficult to verify that the method avoids introducing new inconsistencies or excessive computation.

minor comments (2)

[Abstract / Methods] The phrase 'salient dynamic context' is used without a precise definition or pseudocode; adding a short formal description or algorithm box would improve reproducibility.
[Benchmark Description] The paper should explicitly list the ten applications in DynamicGUIBench and the exact interaction scenarios used for evaluation so readers can assess coverage of real-world dynamic GUIs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments highlight important areas for improving clarity and empirical support, and we have revised the paper to address them directly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that DynamicUI 'significantly improves the performance' is asserted without any quantitative metrics, baselines, success rates, or statistical details; this is load-bearing for the central claim because the entire contribution rests on demonstrating that the video-based components outperform single-frame baselines.

Authors: We agree that the original abstract would be strengthened by including quantitative support for the performance claim. In the revised version, we have updated the abstract to report key metrics from our DynamicGUIBench experiments, including success rates for DynamicUI versus single-frame baselines and prior methods, along with references to the full results tables and statistical details in Section 4. This change ensures the abstract accurately conveys the empirical evidence without altering the manuscript's core claims. revision: yes
Referee: [Methods (Dynamic Perceiver and Refinement)] The dynamic perceiver description (clustering frames, captioning centroids, action-conditioned filtering) is presented at a high level; without an ablation that isolates the contribution of each sub-component or a concrete metric showing how well the selected frames capture state changes missed by single screenshots, it is difficult to verify that the method avoids introducing new inconsistencies or excessive computation.

Authors: We acknowledge that the main-text description of the dynamic perceiver and refinement components is high-level. In the revision, we have expanded the Methods section with additional algorithmic details on frame clustering, centroid captioning, iterative selection, and action-conditioned filtering. We have also added a dedicated ablation study (new subsection in Experiments) that isolates each sub-component's contribution, along with concrete metrics such as the percentage reduction in missed state transitions (computed by comparing selected frames against full interaction videos) and measured computational overhead. Qualitative examples and consistency scores between selected frames and agent thoughts are included to demonstrate that the approach does not introduce new inconsistencies. These additions are placed in the main paper to facilitate verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is the introduction of DynamicGUIBench (a new online benchmark across ten applications) and DynamicUI (an agent that ingests screen-recording videos, clusters frames, captions centroids, applies action-conditioned filtering, and uses reflection). These elements are presented as direct responses to the partial-observability issue stated in the abstract, with performance gains demonstrated empirically on the new benchmark and maintained competitiveness on public ones. No equations, fitted parameters, or self-citations are shown to reduce the claimed improvements to the inputs by construction; the derivation chain remains self-contained and externally falsifiable via the reported experiments.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is empirical and architectural with no mathematical derivations. No free parameters, axioms, or invented entities (such as new particles or forces) are introduced or required by the central claim.

pith-pipeline@v0.9.0 · 5576 in / 1252 out tokens · 70922 ms · 2026-05-11T00:49:46.972801+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

49 extracted references · 26 canonical work pages · 7 internal anchors

[1]

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, and others. 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Gongwei Chen, Xurui Zhou, Rui Shao, Yibo Lyu, Kaiwen Zhou, Shuai Wang, Wentao Li, Yinchuan Li, Zhongang Qi, and Liqiang Nie. 2025. Less is more: Empowering gui agent with context-aware simplification. InProceedings of the IEEE/CVF International Conference on Computer Vision. 5901–5911

2025
[3]

Sen Chen, Tong Zhao, Yi Bin, Fei Ma, Wenqi Shao, and Zheng Wang. 2025. D- GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies.arXiv preprint arXiv:2511.16590(2025)

work page arXiv 2025
[4]

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9313–9332

2024
[5]

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. 2023. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36 (2023), 28091–28114

2023
[6]

Xinzge Gao, Chuanrui Hu, Bin Chen, and Teng Li. 2025. Chain-of-memory: Enhancing gui agents for cross-application navigation.arXiv preprint arXiv:2506.18158(2025)

work page arXiv 2025
[7]

Gonzalo Gonzalez-Pumariega, Vincent Tu, Chih-Lun Lee, Jiachen Yang, Ang Li, and Xin Eric Wang. 2025. The unreasonable effectiveness of scaling agents for computer use.arXiv preprint arXiv:2510.02250(2025)

work page arXiv 2025
[8]

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. 2024. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243(2024)

work page arXiv 2024
[9]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, and others. 2025. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062(2025)

work page internal anchor Pith review arXiv 2025
[10]

Henry Hengyuan Zhao, Difei Gao, and Mike Zheng Shou. 2025. WorldGUI: Dynamic Testing for Comprehensive Desktop GUI Automation.arXiv e-prints (2025), arXiv–2502

2025
[11]

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and others. 2024. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 14281–14290

2024
[12]

Zhiyuan Huang, Ziming Cheng, Junting Pan, Zhaohui Hou, and Mingjie Zhan
[13]

InProceedings of the computer vision and pattern recognition conference

Spiritsight agent: Advanced gui agent with one look. InProceedings of the computer vision and pattern recognition conference. 29490–29500
[14]

Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, and Caiwen Ding. 2025. GUI-Spotlight: Adaptive Iterative Focus Refinement for Enhanced GUI Visual Grounding. (2025)

2025
[15]

Hongxin Li, Jingran Su, Jingfan Chen, Zheng Ju, Yuntao Chen, Qing Li, and Zhaoxiang Zhang. 2025. UIPro: Unleashing Superior Interaction Capability For GUI Agents. InProceedings of the IEEE/CVF International Conference on Computer Vision. 1613–1623

2025
[16]

Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bingqi Chen, Xi- awu Zheng, and Hui Li. 2025. Ui-agile: Advancing gui agents with effective reinforcement learning and precise inference-time grounding.arXiv preprint arXiv:2507.22025(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. 2025. Showui: One vision-language-action model for gui visual agent. InProceedings of the Computer Vision and Pattern Recognition Conference. 19498–19508

2025
[18]

Musen Lin, Minghao Liu, Taoran Lu, Lichen Yuan, Yiwei Liu, Haonan Xu, Yu Miao, Yuhao Chao, and Zhaojian Li. 2025. GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning.arXiv preprint arXiv:2509.15738(2025)

work page arXiv 2025
[19]

Label-free GUI Grounding via Confidence-guided Negative Reinforcement Learning

Yizhou Liu, Fei Tang, Yuchen Yan, Zhengxi Lu, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, and Yueting Zhuang. Label-free GUI Grounding via Confidence-guided Negative Reinforcement Learning. (????)
[20]

LONGHORIZONUI: AUnified FRAMEWORK FOR ROBUST LONG-HORIZON TASK AUTOMATION OF GUI AGENT

BUST LONG, TASK AUTOMATION, and GUI AGENT. LONGHORIZONUI: AUnified FRAMEWORK FOR ROBUST LONG-HORIZON TASK AUTOMATION OF GUI AGENT. (????)
[21]

Quanfeng Lu, Wenqi Shao, Zitao Liu, Lingxiao Du, Fanqing Meng, Boxuan Li, Botong Chen, Siyuan Huang, Kaipeng Zhang, and Ping Luo. 2025. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 22404– 22414

2025
[22]

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. 2025. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458(2025)

work page internal anchor Pith review arXiv 2025
[23]

OpenAI. 2025. Openai o3 and o4-mini system card.technical report(2025)

2025
[24]

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, and others. 2025. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326(2025)

work page Pith review arXiv 2025
[25]

Zengyi Qin, Jinyuan Chen, Yunze Man, Shengcao Cao, Ziqi Pang, Zhuoyuan Wang, Xin Sun, Gen Lin, Han Fang, Ling Zhu, and others. 2025. OSGym: Super- Scalable Distributed Data Engine for Generalizable Computer Agents.arXiv preprint arXiv:2511.11672(2025)

work page arXiv 2025
[27]

Christopher Rawles, Sarah Clinckemaillie, Yifan Chang, Jonathan Waltz, Gabrielle Lau, Marybeth Fair, Alice Li, William Bishop, Wei Li, Folawiyo Campbell-Ajala, and others. 2024. Androidworld: A dynamic benchmarking environment for autonomous agents.arXiv preprint arXiv:2405.14573(2024)

work page internal anchor Pith review arXiv 2024
[28]

8 Model Card: Towards Generalized Real-World Agency

Bytedance Seed.Seed1. 8 Model Card: Towards Generalized Real-World Agency. Technical Report. 2025a. Technical Report
[29]

Rui Shao, Ruize Gao, Bin Xie, Yixing Li, Kaiwen Zhou, Shuai Wang, Weili Guan, and Gongwei Chen. 2026. HATS: Hardness-Aware Trajectory Synthesis for GUI Agents.arXiv preprint arXiv:2603.12138(2026)

work page arXiv 2026
[30]

Chenrui Shi, Zedong Yu, Zhi Gao, Ruining Feng, Enqi Liu, Yuwei Wu, Yunde Jia, Liuyu Xiang, Zhaofeng He, and Qing Li. 2025. GUI Knowledge Bench: Re- vealing the Knowledge Gap Behind VLM Failures in GUI Tasks.arXiv preprint arXiv:2510.26098(2025)

work page arXiv 2025
[31]

Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and others. 2025. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

CoAct-1: Computer-using Multi- agent System with Coding Actions

Linxin Song, Yutong Dai, Viraj Prabhu, Jieyu Zhang, Taiwei Shi, Li Li, Junnan Li, Zeyuan Chen, Jieyu Zhao, Ran Xu, and others. CoAct-1: Computer-using Multi- agent System with Coding Actions. InThe Fourteenth International Conference on Learning Representations
[33]

Yuchen Sun, Shanhui Zhao, Tao Yu, Hao Wen, Samith Va, Mengwei Xu, Yuanchun Li, and Chongyang Zhang. 2025. Gui-xplore: Empowering generalizable gui agents with one exploration. InProceedings of the computer vision and pattern recognition conference. 19477–19486

2025
[34]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, and others. 2026. Kimi K2. 5: Visual Agentic Intelligence.arXiv preprint arXiv:2602.02276(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Wenhao Wang, Mengying Yuan, Zijie Yu, Guangyi Liu, Rui Ye, Tian Jin, Siheng Chen, and Yanfeng Wang. 2025. MobileA3gent: Training Mobile GUI Agents Using Decentralized Self-Sourced Data from Diverse Users. InProceedings of the Fourth Workshop on Bridging Human-Computer Interaction and Natural Language Processing (HCI+ NLP). 79–112

2025
[36]

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, and others. 2025. Opencua: Open foundations for computer-use agents.arXiv preprint arXiv:2508.09123 (2025)

work page arXiv 2025
[37]

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, and others. 2025. Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227(2025)

work page arXiv 2025
[38]

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, and others. 2024. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems37 (2024), 52040–52094

2024
[39]

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, and others. 2026. Mobile-Agent- v3. 5: Multi-platform Fundamental GUI Agents.arXiv preprint arXiv:2602.16855 (2026)

work page arXiv 2026
[40]

Yiheng Xu, Zekun Wang, Junli Wang, Dunjie Lu, Tianbao Xie, Amrita Saha, Doyen Sahoo, Tao Yu, and Caiming Xiong. 2024. Aguvis: Unified pure vision agents for autonomous gui interaction.arXiv preprint arXiv:2412.04454(2024)

work page arXiv 2024
[41]

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, and others
[42]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876(2026)

work page arXiv 2026
[43]

Jingqi Yang, Zhilong Song, Jiawei Chen, Mingli Song, Sheng Zhou, Xiaogang Ouyang, Chun Chen, Can Wang, and others. 2025. Gui-robust: A comprehensive dataset for testing gui agent robustness in real-world anomalies.arXiv preprint arXiv:2506.14477(2025)

work page arXiv 2025
[44]

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, and others. 2025. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791(2025)

work page arXiv 2025
[45]

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. 2025. Aria-ui: Visual grounding for gui instructions. InFindings of the Association for Computational Linguistics: ACL 2025. 22418–22433

2025
[46]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations

2022
[47]

Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, and others. 2025. Mobile-agent- v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144 (2025)

work page arXiv 2025
[48]

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, and others. 2025. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370(2025)

work page arXiv 2025
[49]

Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. 2026. Tongui: Internet-scale trajecto- ries from multimodal web tutorials for generalized gui agents. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 12367–12375

2026
[50]

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmo- han, and others. 2025. Learning gui grounding with spatial reasoning from visual feedback.arXiv preprint arXiv:2509.21552(2025)

work page arXiv 2025