SE-GA: Memory-Augmented Self-Evolution for GUI Agents

arxiv: 2605.16883 · v1 · pith:RWN7GKM5new · submitted 2026-05-16 · 💻 cs.LG

SE-GA: Memory-Augmented Self-Evolution for GUI Agents

Shilong Jin , Lanjun Wang , Zhuosheng Zhang This is my paper

Pith reviewed 2026-05-19 20:34 UTC · model grok-4.3

classification 💻 cs.LG

keywords GUI agentsmemory augmentationself-evolutiontest-time extensionautonomous agentsmulti-step tasksbenchmark evaluation

0 comments p. Extension

pith:RWN7GKM5 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{RWN7GKM5}

Prints a linked pith:RWN7GKM5 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

The SE-GA framework lets GUI agents self-evolve by retrieving memories at test time and retraining on the resulting data to reach higher success rates on multi-step tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tries to establish that current GUI agents are limited by short context windows and fixed policies that prevent effective handling of complex, changing interfaces. It introduces hierarchical memory retrieval during operation plus a training loop that turns those experiences into policy updates. A sympathetic reader would care because successful agents could automate routine computer interactions more reliably without needing constant retraining or human guidance.

Core claim

The central claim is that integrating Test-Time Memory Extension (TTME) for dynamic retrieval of episodic, semantic, and experiential memories with Memory-Augmented Self-Evolution (MASE) training on the data gathered during inference produces state-of-the-art results, including 89.0% success on ScreenSpot and 75.8% on AndroidControl-High plus clear gains on AndroidWorld.

What carries the argument

Test-Time Memory Extension (TTME) for retrieving hierarchical memories to support long-term planning during inference, paired with Memory-Augmented Self-Evolution (MASE) as the training pipeline that uses TTME data to improve the base policy.

Load-bearing premise

The data collected by TTME during inference is of sufficient quality and diversity to stabilize and enhance the foundational policy through the MASE training pipeline without introducing harmful biases or catastrophic forgetting.

What would settle it

If running MASE training on TTME-collected data produces no improvement or causes performance drops on the original benchmarks, that would show the self-evolution mechanism does not deliver the claimed benefits.

Figures

Figures reproduced from arXiv: 2605.16883 by Lanjun Wang, Shilong Jin, Zhuosheng Zhang.

**Figure 2.** Figure 2: A failure trajectory example. The task instruction is “Using BBC Sports, find out when the next MLB game is scheduled and then create a reminder in Microsoft To Do.” 17 [PITH_FULL_IMAGE:figures/full_fig_p017_2.png] view at source ↗

**Figure 3.** Figure 3: A successful trajectory example. By using Hindsight Goal-Shifting, the agent successfully discovers and executes a sequence of actions that complete the assigned task. The new task instruction is “Using BBC Sports, find out the next MLB game in the search bar.” 18 [PITH_FULL_IMAGE:figures/full_fig_p018_3.png] view at source ↗

**Figure 4.** Figure 4: A failure trajectory example of UI-TARS. The task instruction is “Plan an evening of sports-themed entertainment by selecting a sports movie using DuckDuckgo and adding some snacks to your Amazon shopping cart. Invite Victor James through Facebook Messenger, and set a reminder on your Clock app so you don’t forget.” 19 [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: A successful trajectory example of SE-GA. The task instruction is “Plan an evening of sports-themed entertainment by selecting a sports movie using DuckDuckgo and adding some snacks to your Amazon shopping cart. Invite Victor James through Facebook Messenger, and set a reminder on your Clock app so you don’t forget.” 20 [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: SE-GA Performance on Different Task Steps. C.4. Detailed ablation experiments [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

read the original abstract

Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self-Evolving GUI Agent (SE-GA), a novel framework that integrates hierarchical memory structures with an iterative self-improvement mechanism. At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE-GA achieves state-of-the-art performance, reaching success rates of 89.0\% on ScreenSpot and 75.8\% on the challenging AndroidControl-High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments. Open source code: https://github.com/jinshilong-dev/SE-GA

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

SE-GA pairs test-time hierarchical memory retrieval with a self-evolution loop on GUI trajectories and reports clear benchmark lifts, but the loop's data quality controls are not shown in enough detail. The concrete addition is TTME, which pulls episodic, semantic, and experiential memories at inference to extend context for multi-step GUI tasks, plus MASE, which feeds the resulting runs back into policy updates. This is a direct engineering extension of prior memory-augmented and self-improvement agent work, applied specifically to GUI settings where static policies and short contexts are common limits. The evaluations cover both offline and online benchmarks and show 89% success on ScreenSpot, 75.8% on AndroidControl-High, and gains on AndroidWorld for dynamic environments. Releasing the code helps others inspect the implementation and try the pattern themselves. The results look usable for automation or accessibility tools that need longer-horizon planning. The softer spot is the self-evolution step. The description does not detail quality filters on the collected trajectories, reward signals used in MASE, or checks for error accumulation and policy drift. In GUI environments, agent-generated paths can contain compounding mistakes, so without those safeguards it is not yet clear whether the gains come from better retrieval or from simply amplifying successful runs. This paper is aimed at researchers and engineers already building or extending GUI agents. Someone looking for a concrete pattern to add memory retrieval and iterative improvement to an existing base model would find the methods and numbers worth examining. I would send it to peer review. The claims are empirical and rest on public benchmarks, so referees can check the ablations, any hidden controls in the methods, and whether the data pipeline holds up under closer inspection.

Referee Report

2 major / 1 minor

Summary. The paper introduces SE-GA, a GUI agent framework combining Test-Time Memory Extension (TTME) for hierarchical (episodic/semantic/experiential) memory retrieval during inference with Memory-Augmented Self-Evolution (MASE), a training pipeline that uses TTME-collected trajectories to iteratively refine the base policy. It reports state-of-the-art empirical results: 89.0% success on ScreenSpot, 75.8% on AndroidControl-High, and gains on AndroidWorld, attributing improvements to better long-term planning and continuous adaptation in dynamic environments.

Significance. If the empirical claims hold after proper validation, the work could meaningfully advance autonomous GUI agents by addressing context-window limits and static-policy brittleness through memory-augmented self-improvement. The open-sourced code is a positive factor for reproducibility.

major comments (2)

[MASE training pipeline (Methods section)] The MASE pipeline description provides no quality filters, reward signals, trajectory verification, or monitoring for policy drift/forgetting. This is load-bearing for the central claim because the reported SOTA numbers (89.0% ScreenSpot, 75.8% AndroidControl-High) presuppose that TTME-generated data reliably improves rather than degrades the policy; without such safeguards, error compounding in GUI trajectories could undermine the self-evolution results.
[Experiments and results section] The experimental results state benchmark numbers but supply no statistical significance tests, exact baseline re-implementation details, or ablations isolating TTME/MASE contributions. This weakens attribution of gains to the proposed mechanisms.

minor comments (1)

[Abstract] The abstract mentions 'extensive evaluations across both offline and online benchmarks' but does not name the full set of benchmarks or datasets used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and describe the revisions we intend to incorporate in the updated version.

read point-by-point responses

Referee: [MASE training pipeline (Methods section)] The MASE pipeline description provides no quality filters, reward signals, trajectory verification, or monitoring for policy drift/forgetting. This is load-bearing for the central claim because the reported SOTA numbers (89.0% ScreenSpot, 75.8% AndroidControl-High) presuppose that TTME-generated data reliably improves rather than degrades the policy; without such safeguards, error compounding in GUI trajectories could undermine the self-evolution results.

Authors: We agree with the referee that the MASE pipeline description in the Methods section would be strengthened by including explicit details on quality control mechanisms. Although the current manuscript focuses on the overall framework, we will revise the paper to add a description of how TTME-collected trajectories are filtered for quality (retaining only those that successfully complete the task), the implicit reward signal derived from task success rates, verification through execution outcome logs, and monitoring for policy drift by periodically evaluating the evolved policy on a separate validation set. These additions will help demonstrate that the self-evolution process reliably improves the policy without degradation. revision: yes
Referee: [Experiments and results section] The experimental results state benchmark numbers but supply no statistical significance tests, exact baseline re-implementation details, or ablations isolating TTME/MASE contributions. This weakens attribution of gains to the proposed mechanisms.

Authors: We acknowledge that the Experiments and results section can be improved for better rigor. In the revised manuscript, we will include statistical significance tests, such as reporting p-values from appropriate tests comparing SE-GA to baselines. We will also provide more precise details on how baselines were re-implemented, including any specific adaptations to match our evaluation setup. Furthermore, we will add ablation experiments that isolate the contributions of TTME and MASE by comparing variants with and without each component. These revisions will better support the attribution of performance gains to the proposed mechanisms. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical framework (SE-GA) that combines Test-Time Memory Extension (TTME) for dynamic memory retrieval during inference with Memory-Augmented Self-Evolution (MASE) to iteratively improve the base policy using TTME-collected trajectories. Central claims consist of measured success rates on external public benchmarks (89.0% on ScreenSpot, 75.8% on AndroidControl-High, gains on AndroidWorld) rather than any internally defined quantities that reduce to fitted parameters or self-referential definitions by construction. No equations, uniqueness theorems, or self-citation chains are invoked to force the reported outcomes; the approach follows standard practices of training on generated data without the prediction step being statistically forced by the input fit itself. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No explicit free parameters, axioms, or invented entities are described in the abstract; the framework relies on standard assumptions about memory retrieval quality and the stability of iterative self-training.

pith-pipeline@v0.9.0 · 5728 in / 1074 out tokens · 42103 ms · 2026-05-19T20:34:46.917590+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories... Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SE-GA achieves state-of-the-art performance, reaching success rates of 89.0% on ScreenSpot and 75.8% on the challenging AndroidControl-High dataset

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 24 internal anchors

[1]

Hindsight Experience Replay

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay, 2018. URL https://arxiv.org/abs/1707.01495

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Our 3.5 models and computer use, 2024

Anthropic. Our 3.5 models and computer use, 2024. URL https://www.anthropic.com/news/3-5-models-and-computer-use

work page 2024
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Amex: Android multi-annotation expo dataset for mobile gui agents

Chai, Y., Huang, S., Niu, Y., Xiao, H., Liu, L., Wang, G., Zhang, D., Ren, S., and Li, H. Amex: Android multi-annotation expo dataset for mobile gui agents. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 2138–2156. Association for Computational Linguistics, 2025. doi:10.18653/v1/2025.findings-acl.110. URL http://dx.doi.org/10...

work page doi:10.18653/v1/2025.findings-acl.110 2025
[5]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Maskplan: Masked generative layout planning from partial input

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J. Intern vl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 24185--24198, 2024 b . doi:10.1109...

work page doi:10.1109/cvpr52733.2024.02283 2024
[7]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents

Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., and Wu, Z. S ee C lick: Harnessing GUI grounding for advanced visual GUI agents. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9313--9332, Bangkok, Thailand, August 2024. As...

work page doi:10.18653/v1/2024.acl-long.505 2024
[8]

OS -kairos: Adaptive interaction for MLLM -powered GUI agents

Cheng, P., Wu, Z., Wu, Z., Ju, T., Zhang, A., Zhang, Z., and Liu, G. OS -kairos: Adaptive interaction for MLLM -powered GUI agents. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 6701--6725, Vienna, Austria, July 2025. Association for Computational Linguistics. IS...

work page doi:10.18653/v1/2025.findings-acl.348 2025
[9]

Mind2web: Towards a generalist agent for the web

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

work page 2023
[10]

Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math, 2025

Evstafev, E. Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math, 2025. URL https://arxiv.org/abs/2501.18576

work page arXiv 2025
[11]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Fang, J., Peng, Y., Zhang, X., Wang, Y., Yi, X., Zhang, G., Xu, Y., Wu, B., Liu, S., Li, Z., Ren, Z., Aletras, N., Wang, X., Zhou, H., and Meng, Z. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems, 2025. URL https://arxiv.org/abs/2508.07407

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., and Su, Y. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Ui-venus technical report: Building high-performance ui agents with rft

Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F

Hu, X., Xiong, T., Yi, B., Wei, Z., Xiao, R., Chen, Y., Ye, J., Tao, M., Zhou, X., Zhao, Z., Li, Y., Xu, S., Wang, S., Xu, X., Qiao, S., Wang, Z., Kuang, K., Zeng, T., Wang, L., Li, J., Jiang, Y. E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F. OS agents: A survey on MLLM -based agents for computer, phone and browser use....

work page doi:10.18653/v1/2025.acl-long.369 2025
[16]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments

Kong, Q., Zhang, X., Yang, Z., Gao, N., Liu, C., Tong, P., Cai, C., Zhou, H., Zhang, J., Chen, L., et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments. arXiv preprint arXiv:2512.19432, 2025

work page arXiv 2025
[19]

On the effects of data scale on computer control agents

Li, W., Bishop, W., Li, A., Rawles, C., Campbell-Ajala, F., Tyamagundlu, D., and Riva, O. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024

work page arXiv 2024
[20]

Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S

Lin, K. Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S. W., Wang, L., and Shou, M. Z. Showui: One vision-language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 19498--19508, 2025

work page 2025
[21]

Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a

Liu, G., Zhao, P., Liu, L., Chen, Z., Chai, Y., Ren, S., Wang, H., He, S., and Meng, W. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a . URL https://arxiv.org/abs/2504.13805

work page arXiv 2025
[22]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., and Wu, F. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

2025 , publisher =

Lu, Q., Shao, W., Liu, Z., Du, L., Meng, F., Li, B., Chen, B., Huang, S., Zhang, K., and Luo, P. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025 a . URL https://arxiv.org/abs/2406.08451

work page arXiv 2025
[24]

Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

Lu, Y., Yang, J., Shen, Y., and Awadallah, A. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203, 2024

work page arXiv 2024
[25]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Lu, Z., Chai, Y., Guo, Y., Yin, X., Liu, L., Wang, H., Xiao, H., Ren, S., Xiong, G., and Li, H. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Luo, R., Wang, L., He, W., Chen, L., Li, J., and Xia, X. Gui-r1 : A generalist r1-style vision-language action model for gui agents, 2025. URL https://arxiv.org/abs/2504.10458

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Playing Atari with Deep Reinforcement Learning

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning, 2013. URL https://arxiv.org/abs/1312.5602

work page internal anchor Pith review Pith/arXiv arXiv 2013
[28]

Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., Li, X., Shi, J., Chen, H., Lai, V. D., Xie, Z., Kim, S., Zhang, R., Yu, T., Tanjim, M., Ahmed, N. K., Mathur, P., Yoon, S., Yao, L., Kveton, B., Kil, J., Nguyen, T. H., Bui, T., Zhou, T., Rossi, R. A., and Dernoncourt, F. GUI agents: A survey. In Che, W., Naben...

work page doi:10.18653/v1/2025.findings-acl.1158 2025
[29]

Screenagent: a vision language model-driven computer control agent

Niu, R., Li, J., Wang, S., Fu, Y., Hu, X., Leng, X., Kong, H., Chang, Y., and Wang, Q. Screenagent: a vision language model-driven computer control agent. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI '24, 2024. ISBN 978-1-956792-04-1. doi:10.24963/ijcai.2024/711. URL https://doi.org/10.24963/ijcai.2024/711

work page doi:10.24963/ijcai.2024/711 2024
[30]

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Qian, R., Yin, X., Deng, C., Peng, Z., Xiong, J., Zhai, W., and Dou, D. Uground: Towards unified visual grounding with unrolled transformers, 2025. URL https://arxiv.org/abs/2510.03853

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

Android in the wild: A large-scale dataset for android device control, 2023

Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Android in the wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/2307.10088

work page arXiv 2023
[32]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., Toyama, D., Berry, R., Tyamagundlu, D., Lillicrap, T., and Riva, O. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024. URL https://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., and Zhao, T. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/abs/2504.07615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis

Sun, Q., Cheng, K., Ding, Z., Jin, C., Wang, Y., Xu, F., Wu, Z., Jia, C., Chen, L., Liu, Z., Kao, B., Li, G., He, J., Qiao, Y., and Wu, Z. OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2025.acl-long.277 2025
[35]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V. I., Burnell, R., and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

and Liu, B

Wang, X. and Liu, B. Oscar: Operating system control via state-aware reasoning and re-planning. arXiv preprint arXiv:2410.18963, 2024

work page arXiv 2024
[39]

Ponder & press: Advancing visual GUI agent towards general computer control

Wang, Y., Zhang, H., Tian, J., and Tang, Y. Ponder & press: Advancing visual GUI agent towards general computer control. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 1461--1473, Vienna, Austria, July 2025 b . Association for Computational Linguistics. ISBN 979-8...

work page doi:10.18653/v1/2025.findings-acl.76 2025
[40]

History-aware reasoning for gui agents

Wang, Z., Yang, L., Tang, X., Zhou, S., Chen, D., Jiang, W., and Li, Y. History-aware reasoning for gui agents. arXiv preprint arXiv:2511.09127, 2025 c

work page arXiv 2025
[41]

Agent Workflow Memory

Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory, 2024 b . URL https://arxiv.org/abs/2409.07429

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025

Wanyan, Y., Zhang, X., Xu, H., Liu, H., Wang, J., Ye, J., Kou, Y., Yan, M., Huang, F., Yang, X., Dong, W., and Xu, C. Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025. URL https://arxiv.org/abs/2506.04614

work page arXiv 2025
[43]

Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

Wu, W., Zhou, K., Yuan, R., Yu, V., Wang, S., Hu, Z., and Huang, B. Auto-scaling continuous memory for gui agent. arXiv preprint arXiv:2510.09038, 2025

work page arXiv 2025
[44]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Aria-ui: Visual grounding for gui instructions, 2025

Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., and Li, J. Aria-ui: Visual grounding for gui instructions, 2025. URL https://arxiv.org/abs/2412.16256

work page arXiv 2025
[47]

Webshop: Towards scalable real-world web interaction with grounded language agents

Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 20744--20757. Curran Associates, Inc., 2022

work page 2022
[48]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025

work page internal anchor Pith review arXiv 2025
[49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.-Y., Zhang, Y.-Q., Yan, L., Qiao, M., Wu, Y., and Wang, M. Dapo: An open-sou...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[50]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.-T., et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370, 2025

work page arXiv 2025
[51]

Mobiagent: A systematic framework for customizable mobile agents, 2025 a

Zhang, C., Feng, E., Zhao, X., Zhao, Y., Gong, W., Sun, J., Du, D., Hua, Z., Xia, Y., and Chen, H. Mobiagent: A systematic framework for customizable mobile agents, 2025 a . URL https://arxiv.org/abs/2509.00531

work page arXiv 2025
[52]

UI -hawk: Unleashing the screen stream understanding for mobile GUI agents

Zhang, J., Yu, Y.-Q., Liao, M., Li, W., Wu, J., and Wei, Z. UI -hawk: Unleashing the screen stream understanding for mobile GUI agents. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 18217--18236, Suzhou, China, November 2025 b . Associ...

work page doi:10.18653/v1/2025.emnlp-main.920 2025
[53]

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Zhou, H., Li, X., Wang, R., Cheng, M., Zhou, T., and Hsieh, C.-J. R1-zero's "aha moment" in visual reasoning on a 2b non-sft model, 2025 a . URL https://arxiv.org/abs/2503.05132

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025 b

work page arXiv 2025

[1] [1]

Hindsight Experience Replay

Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay, 2018. URL https://arxiv.org/abs/1707.01495

work page internal anchor Pith review Pith/arXiv arXiv 2018

[2] [2]

Our 3.5 models and computer use, 2024

Anthropic. Our 3.5 models and computer use, 2024. URL https://www.anthropic.com/news/3-5-models-and-computer-use

work page 2024

[3] [3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Amex: Android multi-annotation expo dataset for mobile gui agents

Chai, Y., Huang, S., Niu, Y., Xiao, H., Liu, L., Wang, G., Zhang, D., Ren, S., and Li, H. Amex: Android multi-annotation expo dataset for mobile gui agents. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 2138–2156. Association for Computational Linguistics, 2025. doi:10.18653/v1/2025.findings-acl.110. URL http://dx.doi.org/10...

work page doi:10.18653/v1/2025.findings-acl.110 2025

[5] [5]

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Maskplan: Masked generative layout planning from partial input

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J. Intern vl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 24185--24198, 2024 b . doi:10.1109...

work page doi:10.1109/cvpr52733.2024.02283 2024

[7] [7]

SeeClick: Harnessing GUI grounding for advanced visual GUI agents

Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., and Wu, Z. S ee C lick: Harnessing GUI grounding for advanced visual GUI agents. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9313--9332, Bangkok, Thailand, August 2024. As...

work page doi:10.18653/v1/2024.acl-long.505 2024

[8] [8]

OS -kairos: Adaptive interaction for MLLM -powered GUI agents

Cheng, P., Wu, Z., Wu, Z., Ju, T., Zhang, A., Zhang, Z., and Liu, G. OS -kairos: Adaptive interaction for MLLM -powered GUI agents. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 6701--6725, Vienna, Austria, July 2025. Association for Computational Linguistics. IS...

work page doi:10.18653/v1/2025.findings-acl.348 2025

[9] [9]

Mind2web: Towards a generalist agent for the web

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023

work page 2023

[10] [10]

Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math, 2025

Evstafev, E. Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math, 2025. URL https://arxiv.org/abs/2501.18576

work page arXiv 2025

[11] [11]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

Fang, J., Peng, Y., Zhang, X., Wang, Y., Yi, X., Zhang, G., Xu, Y., Wu, B., Liu, S., Li, Z., Ren, Z., Aletras, N., Wang, X., Zhou, H., and Meng, Z. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems, 2025. URL https://arxiv.org/abs/2508.07407

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents

Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., and Su, Y. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Ui-venus technical report: Building high-performance ui agents with rft

Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025

[14] [14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F

Hu, X., Xiong, T., Yi, B., Wei, Z., Xiao, R., Chen, Y., Ye, J., Tao, M., Zhou, X., Zhao, Z., Li, Y., Xu, S., Wang, S., Xu, X., Qiao, S., Wang, Z., Kuang, K., Zeng, T., Wang, L., Li, J., Jiang, Y. E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F. OS agents: A survey on MLLM -based agents for computer, phone and browser use....

work page doi:10.18653/v1/2025.acl-long.369 2025

[16] [16]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments

Kong, Q., Zhang, X., Yang, Z., Gao, N., Liu, C., Tong, P., Cai, C., Zhou, H., Zhang, J., Chen, L., et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments. arXiv preprint arXiv:2512.19432, 2025

work page arXiv 2025

[19] [19]

On the effects of data scale on computer control agents

Li, W., Bishop, W., Li, A., Rawles, C., Campbell-Ajala, F., Tyamagundlu, D., and Riva, O. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024

work page arXiv 2024

[20] [20]

Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S

Lin, K. Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S. W., Wang, L., and Shou, M. Z. Showui: One vision-language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 19498--19508, 2025

work page 2025

[21] [21]

Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a

Liu, G., Zhao, P., Liu, L., Chen, Z., Chai, Y., Ren, S., Wang, H., He, S., and Meng, W. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a . URL https://arxiv.org/abs/2504.13805

work page arXiv 2025

[22] [22]

InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners

Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., and Wu, F. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[23] [23]

2025 , publisher =

Lu, Q., Shao, W., Liu, Z., Du, L., Meng, F., Li, B., Chen, B., Huang, S., Zhang, K., and Luo, P. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025 a . URL https://arxiv.org/abs/2406.08451

work page arXiv 2025

[24] [24]

Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024

Lu, Y., Yang, J., Shen, Y., and Awadallah, A. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203, 2024

work page arXiv 2024

[25] [25]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Lu, Z., Chai, Y., Guo, Y., Yin, X., Liu, L., Wang, H., Xiao, H., Ren, S., Xiong, G., and Li, H. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025 b

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Luo, R., Wang, L., He, W., Chen, L., Li, J., and Xia, X. Gui-r1 : A generalist r1-style vision-language action model for gui agents, 2025. URL https://arxiv.org/abs/2504.10458

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Playing Atari with Deep Reinforcement Learning

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning, 2013. URL https://arxiv.org/abs/1312.5602

work page internal anchor Pith review Pith/arXiv arXiv 2013

[28] [28]

Ahmed, Puneet Mathur, Seunghyun Yoon, Lina Yao, Branislav Kveton, Jihyung Kil, Thien Huu Nguyen, Trung Bui, Tianyi Zhou, Ryan A

Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., Li, X., Shi, J., Chen, H., Lai, V. D., Xie, Z., Kim, S., Zhang, R., Yu, T., Tanjim, M., Ahmed, N. K., Mathur, P., Yoon, S., Yao, L., Kveton, B., Kil, J., Nguyen, T. H., Bui, T., Zhou, T., Rossi, R. A., and Dernoncourt, F. GUI agents: A survey. In Che, W., Naben...

work page doi:10.18653/v1/2025.findings-acl.1158 2025

[29] [29]

Screenagent: a vision language model-driven computer control agent

Niu, R., Li, J., Wang, S., Fu, Y., Hu, X., Leng, X., Kong, H., Chang, Y., and Wang, Q. Screenagent: a vision language model-driven computer control agent. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI '24, 2024. ISBN 978-1-956792-04-1. doi:10.24963/ijcai.2024/711. URL https://doi.org/10.24963/ijcai.2024/711

work page doi:10.24963/ijcai.2024/711 2024

[30] [30]

UGround: Towards Unified Visual Grounding with Unrolled Transformers

Qian, R., Yin, X., Deng, C., Peng, Z., Xiong, J., Zhai, W., and Dou, D. Uground: Towards unified visual grounding with unrolled transformers, 2025. URL https://arxiv.org/abs/2510.03853

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

Android in the wild: A large-scale dataset for android device control, 2023

Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Android in the wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/2307.10088

work page arXiv 2023

[32] [32]

AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents

Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., Toyama, D., Berry, R., Tyamagundlu, D., Lillicrap, T., and Riva, O. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024. URL https://arxiv.org/abs/2405.14573

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., and Zhao, T. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/abs/2504.07615

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis

Sun, Q., Cheng, K., Ding, Z., Jin, C., Wang, Y., Xu, F., Wu, Z., Jia, C., Chen, L., Liu, Z., Kao, B., Li, G., He, J., Qiao, Y., and Wu, Z. OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Comp...

work page doi:10.18653/v1/2025.acl-long.277 2025

[35] [35]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V. I., Burnell, R., and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025 a

work page internal anchor Pith review Pith/arXiv arXiv 2025

[37] [37]

Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception

Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[38] [38]

and Liu, B

Wang, X. and Liu, B. Oscar: Operating system control via state-aware reasoning and re-planning. arXiv preprint arXiv:2410.18963, 2024

work page arXiv 2024

[39] [39]

Ponder & press: Advancing visual GUI agent towards general computer control

Wang, Y., Zhang, H., Tian, J., and Tang, Y. Ponder & press: Advancing visual GUI agent towards general computer control. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 1461--1473, Vienna, Austria, July 2025 b . Association for Computational Linguistics. ISBN 979-8...

work page doi:10.18653/v1/2025.findings-acl.76 2025

[40] [40]

History-aware reasoning for gui agents

Wang, Z., Yang, L., Tang, X., Zhou, S., Chen, D., Jiang, W., and Li, Y. History-aware reasoning for gui agents. arXiv preprint arXiv:2511.09127, 2025 c

work page arXiv 2025

[41] [41]

Agent Workflow Memory

Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory, 2024 b . URL https://arxiv.org/abs/2409.07429

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025

Wanyan, Y., Zhang, X., Xu, H., Liu, H., Wang, J., Ye, J., Kou, Y., Yan, M., Huang, F., Yang, X., Dong, W., and Xu, C. Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025. URL https://arxiv.org/abs/2506.04614

work page arXiv 2025

[43] [43]

Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025

Wu, W., Zhou, K., Yuan, R., Yu, V., Wang, S., Hu, Z., and Huang, B. Auto-scaling continuous memory for gui agent. arXiv preprint arXiv:2510.09038, 2025

work page arXiv 2025

[44] [44]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[45] [45]

Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction

Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

Aria-ui: Visual grounding for gui instructions, 2025

Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., and Li, J. Aria-ui: Visual grounding for gui instructions, 2025. URL https://arxiv.org/abs/2412.16256

work page arXiv 2025

[47] [47]

Webshop: Towards scalable real-world web interaction with grounded language agents

Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 20744--20757. Curran Associates, Inc., 2022

work page 2022

[48] [48]

Mobile-Agent-v3: Fundamental Agents for GUI Automation

Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025

work page internal anchor Pith review arXiv 2025

[49] [49]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.-Y., Zhang, Y.-Q., Yan, L., Qiao, M., Wu, Y., and Wang, M. Dapo: An open-sou...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[50] [50]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning

Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.-T., et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370, 2025

work page arXiv 2025

[51] [51]

Mobiagent: A systematic framework for customizable mobile agents, 2025 a

Zhang, C., Feng, E., Zhao, X., Zhao, Y., Gong, W., Sun, J., Du, D., Hua, Z., Xia, Y., and Chen, H. Mobiagent: A systematic framework for customizable mobile agents, 2025 a . URL https://arxiv.org/abs/2509.00531

work page arXiv 2025

[52] [52]

UI -hawk: Unleashing the screen stream understanding for mobile GUI agents

Zhang, J., Yu, Y.-Q., Liao, M., Li, W., Wu, J., and Wei, Z. UI -hawk: Unleashing the screen stream understanding for mobile GUI agents. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 18217--18236, Suzhou, China, November 2025 b . Associ...

work page doi:10.18653/v1/2025.emnlp-main.920 2025

[53] [53]

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Zhou, H., Li, X., Wang, R., Cheng, M., Zhou, T., and Hsieh, C.-J. R1-zero's "aha moment" in visual reasoning on a 2b non-sft model, 2025 a . URL https://arxiv.org/abs/2503.05132

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025 b

work page arXiv 2025