pith. sign in

arxiv: 2605.16883 · v1 · pith:RWN7GKM5new · submitted 2026-05-16 · 💻 cs.LG

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

Pith reviewed 2026-05-19 20:34 UTC · model grok-4.3

classification 💻 cs.LG
keywords GUI agentsmemory augmentationself-evolutiontest-time extensionautonomous agentsmulti-step tasksbenchmark evaluation
0
0 comments X p. Extension
pith:RWN7GKM5 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{RWN7GKM5}

Prints a linked pith:RWN7GKM5 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

The SE-GA framework lets GUI agents self-evolve by retrieving memories at test time and retraining on the resulting data to reach higher success rates on multi-step tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that current GUI agents are limited by short context windows and fixed policies that prevent effective handling of complex, changing interfaces. It introduces hierarchical memory retrieval during operation plus a training loop that turns those experiences into policy updates. A sympathetic reader would care because successful agents could automate routine computer interactions more reliably without needing constant retraining or human guidance.

Core claim

The central claim is that integrating Test-Time Memory Extension (TTME) for dynamic retrieval of episodic, semantic, and experiential memories with Memory-Augmented Self-Evolution (MASE) training on the data gathered during inference produces state-of-the-art results, including 89.0% success on ScreenSpot and 75.8% on AndroidControl-High plus clear gains on AndroidWorld.

What carries the argument

Test-Time Memory Extension (TTME) for retrieving hierarchical memories to support long-term planning during inference, paired with Memory-Augmented Self-Evolution (MASE) as the training pipeline that uses TTME data to improve the base policy.

Load-bearing premise

The data collected by TTME during inference is of sufficient quality and diversity to stabilize and enhance the foundational policy through the MASE training pipeline without introducing harmful biases or catastrophic forgetting.

What would settle it

If running MASE training on TTME-collected data produces no improvement or causes performance drops on the original benchmarks, that would show the self-evolution mechanism does not deliver the claimed benefits.

Figures

Figures reproduced from arXiv: 2605.16883 by Lanjun Wang, Shilong Jin, Zhuosheng Zhang.

Figure 1
Figure 1. Figure 1 [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A failure trajectory example. The task instruction is “Using BBC Sports, find out when the next MLB game is scheduled and then create a reminder in Microsoft To Do.” 17 [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A successful trajectory example. By using Hindsight Goal-Shifting, the agent successfully discovers and executes a sequence of actions that complete the assigned task. The new task instruction is “Using BBC Sports, find out the next MLB game in the search bar.” 18 [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: A failure trajectory example of UI-TARS. The task instruction is “Plan an evening of sports-themed entertainment by selecting a sports movie using DuckDuckgo and adding some snacks to your Amazon shopping cart. Invite Victor James through Facebook Messenger, and set a reminder on your Clock app so you don’t forget.” 19 [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: A successful trajectory example of SE-GA. The task instruction is “Plan an evening of sports-themed entertainment by selecting a sports movie using DuckDuckgo and adding some snacks to your Amazon shopping cart. Invite Victor James through Facebook Messenger, and set a reminder on your Clock app so you don’t forget.” 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: SE-GA Performance on Different Task Steps. C.4. Detailed ablation experiments [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
read the original abstract

Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self-Evolving GUI Agent (SE-GA), a novel framework that integrates hierarchical memory structures with an iterative self-improvement mechanism. At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE-GA achieves state-of-the-art performance, reaching success rates of 89.0\% on ScreenSpot and 75.8\% on the challenging AndroidControl-High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments. Open source code: https://github.com/jinshilong-dev/SE-GA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces SE-GA, a GUI agent framework combining Test-Time Memory Extension (TTME) for hierarchical (episodic/semantic/experiential) memory retrieval during inference with Memory-Augmented Self-Evolution (MASE), a training pipeline that uses TTME-collected trajectories to iteratively refine the base policy. It reports state-of-the-art empirical results: 89.0% success on ScreenSpot, 75.8% on AndroidControl-High, and gains on AndroidWorld, attributing improvements to better long-term planning and continuous adaptation in dynamic environments.

Significance. If the empirical claims hold after proper validation, the work could meaningfully advance autonomous GUI agents by addressing context-window limits and static-policy brittleness through memory-augmented self-improvement. The open-sourced code is a positive factor for reproducibility.

major comments (2)
  1. [MASE training pipeline (Methods section)] The MASE pipeline description provides no quality filters, reward signals, trajectory verification, or monitoring for policy drift/forgetting. This is load-bearing for the central claim because the reported SOTA numbers (89.0% ScreenSpot, 75.8% AndroidControl-High) presuppose that TTME-generated data reliably improves rather than degrades the policy; without such safeguards, error compounding in GUI trajectories could undermine the self-evolution results.
  2. [Experiments and results section] The experimental results state benchmark numbers but supply no statistical significance tests, exact baseline re-implementation details, or ablations isolating TTME/MASE contributions. This weakens attribution of gains to the proposed mechanisms.
minor comments (1)
  1. [Abstract] The abstract mentions 'extensive evaluations across both offline and online benchmarks' but does not name the full set of benchmarks or datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and describe the revisions we intend to incorporate in the updated version.

read point-by-point responses
  1. Referee: [MASE training pipeline (Methods section)] The MASE pipeline description provides no quality filters, reward signals, trajectory verification, or monitoring for policy drift/forgetting. This is load-bearing for the central claim because the reported SOTA numbers (89.0% ScreenSpot, 75.8% AndroidControl-High) presuppose that TTME-generated data reliably improves rather than degrades the policy; without such safeguards, error compounding in GUI trajectories could undermine the self-evolution results.

    Authors: We agree with the referee that the MASE pipeline description in the Methods section would be strengthened by including explicit details on quality control mechanisms. Although the current manuscript focuses on the overall framework, we will revise the paper to add a description of how TTME-collected trajectories are filtered for quality (retaining only those that successfully complete the task), the implicit reward signal derived from task success rates, verification through execution outcome logs, and monitoring for policy drift by periodically evaluating the evolved policy on a separate validation set. These additions will help demonstrate that the self-evolution process reliably improves the policy without degradation. revision: yes

  2. Referee: [Experiments and results section] The experimental results state benchmark numbers but supply no statistical significance tests, exact baseline re-implementation details, or ablations isolating TTME/MASE contributions. This weakens attribution of gains to the proposed mechanisms.

    Authors: We acknowledge that the Experiments and results section can be improved for better rigor. In the revised manuscript, we will include statistical significance tests, such as reporting p-values from appropriate tests comparing SE-GA to baselines. We will also provide more precise details on how baselines were re-implemented, including any specific adaptations to match our evaluation setup. Furthermore, we will add ablation experiments that isolate the contributions of TTME and MASE by comparing variants with and without each component. These revisions will better support the attribution of performance gains to the proposed mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical framework (SE-GA) that combines Test-Time Memory Extension (TTME) for dynamic memory retrieval during inference with Memory-Augmented Self-Evolution (MASE) to iteratively improve the base policy using TTME-collected trajectories. Central claims consist of measured success rates on external public benchmarks (89.0% on ScreenSpot, 75.8% on AndroidControl-High, gains on AndroidWorld) rather than any internally defined quantities that reduce to fitted parameters or self-referential definitions by construction. No equations, uniqueness theorems, or self-citation chains are invoked to force the reported outcomes; the approach follows standard practices of training on generated data without the prediction step being statistically forced by the input fit itself. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the framework relies on standard assumptions about memory retrieval quality and the stability of iterative self-training.

pith-pipeline@v0.9.0 · 5728 in / 1074 out tokens · 42103 ms · 2026-05-19T20:34:46.917590+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 24 internal anchors

  1. [1]

    Hindsight Experience Replay

    Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay, 2018. URL https://arxiv.org/abs/1707.01495

  2. [2]

    Our 3.5 models and computer use, 2024

    Anthropic. Our 3.5 models and computer use, 2024. URL https://www.anthropic.com/news/3-5-models-and-computer-use

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923

  4. [4]

    Amex: Android multi-annotation expo dataset for mobile gui agents

    Chai, Y., Huang, S., Niu, Y., Xiao, H., Liu, L., Wang, G., Zhang, D., Ren, S., and Li, H. Amex: Android multi-annotation expo dataset for mobile gui agents. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 2138–2156. Association for Computational Linguistics, 2025. doi:10.18653/v1/2025.findings-acl.110. URL http://dx.doi.org/10...

  5. [5]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024 a

  6. [6]

    Maskplan: Masked generative layout planning from partial input

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J. Intern vl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 24185--24198, 2024 b . doi:10.1109...

  7. [7]

    SeeClick: Harnessing GUI grounding for advanced visual GUI agents

    Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., and Wu, Z. S ee C lick: Harnessing GUI grounding for advanced visual GUI agents. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9313--9332, Bangkok, Thailand, August 2024. As...

  8. [8]

    OS -kairos: Adaptive interaction for MLLM -powered GUI agents

    Cheng, P., Wu, Z., Wu, Z., Ju, T., Zhang, A., Zhang, Z., and Liu, G. OS -kairos: Adaptive interaction for MLLM -powered GUI agents. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 6701--6725, Vienna, Austria, July 2025. Association for Computational Linguistics. IS...

  9. [9]

    Mind2web: Towards a generalist agent for the web

    Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

  10. [10]

    Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math, 2025

    Evstafev, E. Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math, 2025. URL https://arxiv.org/abs/2501.18576

  11. [11]

    A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

    Fang, J., Peng, Y., Zhang, X., Wang, Y., Yi, X., Zhang, G., Xu, Y., Wu, B., Liu, S., Li, Z., Ren, Z., Aletras, N., Wang, X., Zhou, H., and Meng, Z. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems, 2025. URL https://arxiv.org/abs/2508.07407

  12. [12]

    Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

    Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., and Su, Y. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024

  13. [13]

    Ui-venus technical report: Building high-performance ui agents with rft

    Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

  15. [15]

    E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F

    Hu, X., Xiong, T., Yi, B., Wei, Z., Xiao, R., Chen, Y., Ye, J., Tao, M., Zhou, X., Zhao, Z., Li, Y., Xu, S., Wang, S., Xu, X., Qiao, S., Wang, Z., Kuang, K., Zeng, T., Wang, L., Li, J., Jiang, Y. E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F. OS agents: A survey on MLLM -based agents for computer, phone and browser use....

  16. [16]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

  17. [17]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

  18. [18]

    Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments

    Kong, Q., Zhang, X., Yang, Z., Gao, N., Liu, C., Tong, P., Cai, C., Zhou, H., Zhang, J., Chen, L., et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments. arXiv preprint arXiv:2512.19432, 2025

  19. [19]

    On the effects of data scale on computer control agents

    Li, W., Bishop, W., Li, A., Rawles, C., Campbell-Ajala, F., Tyamagundlu, D., and Riva, O. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024

  20. [20]

    Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S

    Lin, K. Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S. W., Wang, L., and Shou, M. Z. Showui: One vision-language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 19498--19508, 2025

  21. [21]

    Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a

    Liu, G., Zhao, P., Liu, L., Chen, Z., Chai, Y., Ren, S., Wang, H., He, S., and Meng, W. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a . URL https://arxiv.org/abs/2504.13805

  22. [22]

    InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

    Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., and Wu, F. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025 b

  23. [23]

    2025 , publisher =

    Lu, Q., Shao, W., Liu, Z., Du, L., Meng, F., Li, B., Chen, B., Huang, S., Zhang, K., and Luo, P. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025 a . URL https://arxiv.org/abs/2406.08451

  24. [24]

    Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

    Lu, Y., Yang, J., Shen, Y., and Awadallah, A. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203, 2024

  25. [25]

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    Lu, Z., Chai, Y., Guo, Y., Yin, X., Liu, L., Wang, H., Xiao, H., Ren, S., Xiong, G., and Li, H. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025 b

  26. [26]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Luo, R., Wang, L., He, W., Chen, L., Li, J., and Xia, X. Gui-r1 : A generalist r1-style vision-language action model for gui agents, 2025. URL https://arxiv.org/abs/2504.10458

  27. [27]

    Playing Atari with Deep Reinforcement Learning

    Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning, 2013. URL https://arxiv.org/abs/1312.5602

  28. [28]

    Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

    Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., Li, X., Shi, J., Chen, H., Lai, V. D., Xie, Z., Kim, S., Zhang, R., Yu, T., Tanjim, M., Ahmed, N. K., Mathur, P., Yoon, S., Yao, L., Kveton, B., Kil, J., Nguyen, T. H., Bui, T., Zhou, T., Rossi, R. A., and Dernoncourt, F. GUI agents: A survey. In Che, W., Naben...

  29. [29]

    Screenagent: a vision language model-driven computer control agent

    Niu, R., Li, J., Wang, S., Fu, Y., Hu, X., Leng, X., Kong, H., Chang, Y., and Wang, Q. Screenagent: a vision language model-driven computer control agent. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI '24, 2024. ISBN 978-1-956792-04-1. doi:10.24963/ijcai.2024/711. URL https://doi.org/10.24963/ijcai.2024/711

  30. [30]

    UGround: Towards Unified Visual Grounding with Unrolled Transformers

    Qian, R., Yin, X., Deng, C., Peng, Z., Xiong, J., Zhai, W., and Dou, D. Uground: Towards unified visual grounding with unrolled transformers, 2025. URL https://arxiv.org/abs/2510.03853

  31. [31]

    Android in the wild: A large-scale dataset for android device control, 2023

    Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Android in the wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/2307.10088

  32. [32]

    AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

    Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., Toyama, D., Berry, R., Tyamagundlu, D., Lillicrap, T., and Riva, O. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024. URL https://arxiv.org/abs/2405.14573

  33. [33]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., and Zhao, T. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/abs/2504.07615

  34. [34]

    OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis

    Sun, Q., Cheng, K., Ding, Z., Jin, C., Wang, Y., Xu, F., Wu, Z., Jia, C., Chen, L., Liu, Z., Kao, B., Li, G., He, J., Qiao, Y., and Wu, Z. OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Comp...

  35. [35]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V. I., Burnell, R., and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530

  36. [36]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025 a

  37. [37]

    Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

    Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024 a

  38. [38]

    and Liu, B

    Wang, X. and Liu, B. Oscar: Operating system control via state-aware reasoning and re-planning. arXiv preprint arXiv:2410.18963, 2024

  39. [39]

    Ponder & press: Advancing visual GUI agent towards general computer control

    Wang, Y., Zhang, H., Tian, J., and Tang, Y. Ponder & press: Advancing visual GUI agent towards general computer control. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 1461--1473, Vienna, Austria, July 2025 b . Association for Computational Linguistics. ISBN 979-8...

  40. [40]

    History-aware reasoning for gui agents

    Wang, Z., Yang, L., Tang, X., Zhou, S., Chen, D., Jiang, W., and Li, Y. History-aware reasoning for gui agents. arXiv preprint arXiv:2511.09127, 2025 c

  41. [41]

    Agent Workflow Memory

    Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory, 2024 b . URL https://arxiv.org/abs/2409.07429

  42. [42]

    Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025

    Wanyan, Y., Zhang, X., Xu, H., Liu, H., Wang, J., Ye, J., Kou, Y., Yan, M., Huang, F., Yang, X., Dong, W., and Xu, C. Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025. URL https://arxiv.org/abs/2506.04614

  43. [43]

    Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

    Wu, W., Zhou, K., Yuan, R., Yu, V., Wang, S., Hu, Z., and Huang, B. Auto-scaling continuous memory for gui agent. arXiv preprint arXiv:2510.09038, 2025

  44. [44]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

  45. [45]

    Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

    Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

  46. [46]

    Aria-ui: Visual grounding for gui instructions, 2025

    Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., and Li, J. Aria-ui: Visual grounding for gui instructions, 2025. URL https://arxiv.org/abs/2412.16256

  47. [47]

    Webshop: Towards scalable real-world web interaction with grounded language agents

    Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 20744--20757. Curran Associates, Inc., 2022

  48. [48]

    Mobile-Agent-v3: Fundamental Agents for GUI Automation

    Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025

  49. [49]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.-Y., Zhang, Y.-Q., Yan, L., Qiao, M., Wu, Y., and Wang, M. Dapo: An open-sou...

  50. [50]

    Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

    Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.-T., et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370, 2025

  51. [51]

    Mobiagent: A systematic framework for customizable mobile agents, 2025 a

    Zhang, C., Feng, E., Zhao, X., Zhao, Y., Gong, W., Sun, J., Du, D., Hua, Z., Xia, Y., and Chen, H. Mobiagent: A systematic framework for customizable mobile agents, 2025 a . URL https://arxiv.org/abs/2509.00531

  52. [52]

    UI -hawk: Unleashing the screen stream understanding for mobile GUI agents

    Zhang, J., Yu, Y.-Q., Liao, M., Li, W., Wu, J., and Wei, Z. UI -hawk: Unleashing the screen stream understanding for mobile GUI agents. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 18217--18236, Suzhou, China, November 2025 b . Associ...

  53. [53]

    R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

    Zhou, H., Li, X., Wang, R., Cheng, M., Zhou, T., and Hsieh, C.-J. R1-zero's "aha moment" in visual reasoning on a 2b non-sft model, 2025 a . URL https://arxiv.org/abs/2503.05132

  54. [54]

    Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

    Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025 b