SE-GA: Memory-Augmented Self-Evolution for GUI Agents
Pith reviewed 2026-05-19 20:34 UTC · model grok-4.3
pith:RWN7GKM5 Add to your LaTeX paper
What is a Pith Number?\usepackage{pith}
\pithnumber{RWN7GKM5}
Prints a linked pith:RWN7GKM5 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more
The pith
The SE-GA framework lets GUI agents self-evolve by retrieving memories at test time and retraining on the resulting data to reach higher success rates on multi-step tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that integrating Test-Time Memory Extension (TTME) for dynamic retrieval of episodic, semantic, and experiential memories with Memory-Augmented Self-Evolution (MASE) training on the data gathered during inference produces state-of-the-art results, including 89.0% success on ScreenSpot and 75.8% on AndroidControl-High plus clear gains on AndroidWorld.
What carries the argument
Test-Time Memory Extension (TTME) for retrieving hierarchical memories to support long-term planning during inference, paired with Memory-Augmented Self-Evolution (MASE) as the training pipeline that uses TTME data to improve the base policy.
Load-bearing premise
The data collected by TTME during inference is of sufficient quality and diversity to stabilize and enhance the foundational policy through the MASE training pipeline without introducing harmful biases or catastrophic forgetting.
What would settle it
If running MASE training on TTME-collected data produces no improvement or causes performance drops on the original benchmarks, that would show the self-evolution mechanism does not deliver the claimed benefits.
Figures
read the original abstract
Autonomous Graphical User Interface (GUI) agents often struggle with multi-step tasks due to constrained context windows and static policies that fail to adapt to dynamic environments. To address these limitations, this work proposes the Self-Evolving GUI Agent (SE-GA), a novel framework that integrates hierarchical memory structures with an iterative self-improvement mechanism. At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories to provide salient contexts during inference. To ensure continuous learning, we introduce Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy. Extensive evaluations across both offline and online benchmarks demonstrate SE-GA achieves state-of-the-art performance, reaching success rates of 89.0\% on ScreenSpot and 75.8\% on the challenging AndroidControl-High dataset. Furthermore, significant improvements on the AndroidWorld benchmark highlight the superior generalization to dynamic environments. Open source code: https://github.com/jinshilong-dev/SE-GA
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SE-GA, a GUI agent framework combining Test-Time Memory Extension (TTME) for hierarchical (episodic/semantic/experiential) memory retrieval during inference with Memory-Augmented Self-Evolution (MASE), a training pipeline that uses TTME-collected trajectories to iteratively refine the base policy. It reports state-of-the-art empirical results: 89.0% success on ScreenSpot, 75.8% on AndroidControl-High, and gains on AndroidWorld, attributing improvements to better long-term planning and continuous adaptation in dynamic environments.
Significance. If the empirical claims hold after proper validation, the work could meaningfully advance autonomous GUI agents by addressing context-window limits and static-policy brittleness through memory-augmented self-improvement. The open-sourced code is a positive factor for reproducibility.
major comments (2)
- [MASE training pipeline (Methods section)] The MASE pipeline description provides no quality filters, reward signals, trajectory verification, or monitoring for policy drift/forgetting. This is load-bearing for the central claim because the reported SOTA numbers (89.0% ScreenSpot, 75.8% AndroidControl-High) presuppose that TTME-generated data reliably improves rather than degrades the policy; without such safeguards, error compounding in GUI trajectories could undermine the self-evolution results.
- [Experiments and results section] The experimental results state benchmark numbers but supply no statistical significance tests, exact baseline re-implementation details, or ablations isolating TTME/MASE contributions. This weakens attribution of gains to the proposed mechanisms.
minor comments (1)
- [Abstract] The abstract mentions 'extensive evaluations across both offline and online benchmarks' but does not name the full set of benchmarks or datasets used.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each of the major comments below and describe the revisions we intend to incorporate in the updated version.
read point-by-point responses
-
Referee: [MASE training pipeline (Methods section)] The MASE pipeline description provides no quality filters, reward signals, trajectory verification, or monitoring for policy drift/forgetting. This is load-bearing for the central claim because the reported SOTA numbers (89.0% ScreenSpot, 75.8% AndroidControl-High) presuppose that TTME-generated data reliably improves rather than degrades the policy; without such safeguards, error compounding in GUI trajectories could undermine the self-evolution results.
Authors: We agree with the referee that the MASE pipeline description in the Methods section would be strengthened by including explicit details on quality control mechanisms. Although the current manuscript focuses on the overall framework, we will revise the paper to add a description of how TTME-collected trajectories are filtered for quality (retaining only those that successfully complete the task), the implicit reward signal derived from task success rates, verification through execution outcome logs, and monitoring for policy drift by periodically evaluating the evolved policy on a separate validation set. These additions will help demonstrate that the self-evolution process reliably improves the policy without degradation. revision: yes
-
Referee: [Experiments and results section] The experimental results state benchmark numbers but supply no statistical significance tests, exact baseline re-implementation details, or ablations isolating TTME/MASE contributions. This weakens attribution of gains to the proposed mechanisms.
Authors: We acknowledge that the Experiments and results section can be improved for better rigor. In the revised manuscript, we will include statistical significance tests, such as reporting p-values from appropriate tests comparing SE-GA to baselines. We will also provide more precise details on how baselines were re-implemented, including any specific adaptations to match our evaluation setup. Furthermore, we will add ablation experiments that isolate the contributions of TTME and MASE by comparing variants with and without each component. These revisions will better support the attribution of performance gains to the proposed mechanisms. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical framework (SE-GA) that combines Test-Time Memory Extension (TTME) for dynamic memory retrieval during inference with Memory-Augmented Self-Evolution (MASE) to iteratively improve the base policy using TTME-collected trajectories. Central claims consist of measured success rates on external public benchmarks (89.0% on ScreenSpot, 75.8% on AndroidControl-High, gains on AndroidWorld) rather than any internally defined quantities that reduce to fitted parameters or self-referential definitions by construction. No equations, uniqueness theorems, or self-citation chains are invoked to force the reported outcomes; the approach follows standard practices of training on generated data without the prediction step being statistically forced by the input fit itself. The derivation remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
At the core of our approach is Test-Time Memory Extension (TTME), which facilitates long-term planning by dynamically retrieving episodic, semantic, and experiential memories... Memory-Augmented Self-Evolution (MASE), which is a training pipeline that adopts the data collected by TTME to stabilize and enhance the agent's foundational policy.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SE-GA achieves state-of-the-art performance, reaching success rates of 89.0% on ScreenSpot and 75.8% on the challenging AndroidControl-High dataset
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, P., and Zaremba, W. Hindsight experience replay, 2018. URL https://arxiv.org/abs/1707.01495
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[2]
Our 3.5 models and computer use, 2024
Anthropic. Our 3.5 models and computer use, 2024. URL https://www.anthropic.com/news/3-5-models-and-computer-use
work page 2024
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., Zhong, H., Zhu, Y., Yang, M., Li, Z., Wan, J., Wang, P., Ding, W., Fu, Z., Xu, Y., Ye, J., Zhang, X., Xie, T., Cheng, Z., Zhang, H., Yang, Z., Xu, H., and Lin, J. Qwen2.5-vl technical report, 2025. URL https://arxiv.org/abs/2502.13923
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Amex: Android multi-annotation expo dataset for mobile gui agents
Chai, Y., Huang, S., Niu, Y., Xiao, H., Liu, L., Wang, G., Zhang, D., Ren, S., and Li, H. Amex: Android multi-annotation expo dataset for mobile gui agents. In Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 2138–2156. Association for Computational Linguistics, 2025. doi:10.18653/v1/2025.findings-acl.110. URL http://dx.doi.org/10...
-
[5]
Chen, Z., Wang, W., Cao, Y., Liu, Y., Gao, Z., Cui, E., Zhu, J., Ye, S., Tian, H., Liu, Z., et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Maskplan: Masked generative layout planning from partial input
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., Li, B., Luo, P., Lu, T., Qiao, Y., and Dai, J. Intern vl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 24185--24198, 2024 b . doi:10.1109...
-
[7]
SeeClick: Harnessing GUI grounding for advanced visual GUI agents
Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., and Wu, Z. S ee C lick: Harnessing GUI grounding for advanced visual GUI agents. In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 9313--9332, Bangkok, Thailand, August 2024. As...
-
[8]
OS -kairos: Adaptive interaction for MLLM -powered GUI agents
Cheng, P., Wu, Z., Wu, Z., Ju, T., Zhang, A., Zhang, Z., and Liu, G. OS -kairos: Adaptive interaction for MLLM -powered GUI agents. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 6701--6725, Vienna, Austria, July 2025. Association for Computational Linguistics. IS...
-
[9]
Mind2web: Towards a generalist agent for the web
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., and Su, Y. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems, 36: 0 28091--28114, 2023
work page 2023
-
[10]
Evstafev, E. Token-hungry, yet precise: Deepseek r1 highlights the need for multi-step reasoning over speed in math, 2025. URL https://arxiv.org/abs/2501.18576
-
[11]
Fang, J., Peng, Y., Zhang, X., Wang, Y., Yi, X., Zhang, G., Xu, Y., Wu, B., Liu, S., Li, Z., Ren, Z., Aletras, N., Wang, X., Zhou, H., and Meng, Z. A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems, 2025. URL https://arxiv.org/abs/2508.07407
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Navigating the Digital World as Humans Do: Universal Visual Grounding for GUI Agents
Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., and Su, Y. Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Ui-venus technical report: Building high-performance ui agents with rft
Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al. Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833, 2025
-
[14]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F
Hu, X., Xiong, T., Yi, B., Wei, Z., Xiao, R., Chen, Y., Ye, J., Tao, M., Zhou, X., Zhao, Z., Li, Y., Xu, S., Wang, S., Xu, X., Qiao, S., Wang, Z., Kuang, K., Zeng, T., Wang, L., Li, J., Jiang, Y. E., Zhou, W., Wang, G., Yin, K., Zhao, Z., Yang, H., Wu, F., Zhang, S., and Wu, F. OS agents: A survey on MLLM -based agents for computer, phone and browser use....
-
[16]
Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models
Huang, W., Jia, B., Zhai, Z., Cao, S., Ye, Z., Zhao, F., Xu, Z., Hu, Y., and Lin, S. Vision-r1: Incentivizing reasoning capability in multimodal large language models. arXiv preprint arXiv:2503.06749, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Hurst, A., Lerer, A., Goucher, A. P., Perelman, A., Ramesh, A., Clark, A., Ostrow, A., Welihinda, A., Hayes, A., Radford, A., et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Kong, Q., Zhang, X., Yang, Z., Gao, N., Liu, C., Tong, P., Cai, C., Zhou, H., Zhang, J., Chen, L., et al. Mobileworld: Benchmarking autonomous mobile agents in agent-user interactive, and mcp-augmented environments. arXiv preprint arXiv:2512.19432, 2025
-
[19]
On the effects of data scale on computer control agents
Li, W., Bishop, W., Li, A., Rawles, C., Campbell-Ajala, F., Tyamagundlu, D., and Riva, O. On the effects of data scale on computer control agents. arXiv preprint arXiv:2406.03679, 2024
-
[20]
Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S
Lin, K. Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S. W., Wang, L., and Shou, M. Z. Showui: One vision-language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp.\ 19498--19508, 2025
work page 2025
-
[21]
Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a
Liu, G., Zhao, P., Liu, L., Chen, Z., Chai, Y., Ren, S., Wang, H., He, S., and Meng, W. Learnact: Few-shot mobile gui agent with a unified demonstration benchmark, 2025 a . URL https://arxiv.org/abs/2504.13805
-
[22]
InfiGUI-R1: Advancing Multimodal GUI Agents from Reactive Actors to Deliberative Reasoners
Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., and Wu, F. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Lu, Q., Shao, W., Liu, Z., Du, L., Meng, F., Li, B., Chen, B., Huang, S., Zhang, K., and Luo, P. Guiodyssey: A comprehensive dataset for cross-app gui navigation on mobile devices, 2025 a . URL https://arxiv.org/abs/2406.08451
-
[24]
Omniparser for pure vision based gui agent.arXiv preprint arXiv:2408.00203, 2024
Lu, Y., Yang, J., Shen, Y., and Awadallah, A. Omniparser for pure vision based gui agent. arXiv preprint arXiv:2408.00203, 2024
-
[25]
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
Lu, Z., Chai, Y., Guo, Y., Yin, X., Liu, L., Wang, H., Xiao, H., Ren, S., Xiong, G., and Li, H. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025 b
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
Luo, R., Wang, L., He, W., Chen, L., Li, J., and Xia, X. Gui-r1 : A generalist r1-style vision-language action model for gui agents, 2025. URL https://arxiv.org/abs/2504.10458
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Playing Atari with Deep Reinforcement Learning
Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. Playing atari with deep reinforcement learning, 2013. URL https://arxiv.org/abs/1312.5602
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[28]
Nguyen, D., Chen, J., Wang, Y., Wu, G., Park, N., Hu, Z., Lyu, H., Wu, J., Aponte, R., Xia, Y., Li, X., Shi, J., Chen, H., Lai, V. D., Xie, Z., Kim, S., Zhang, R., Yu, T., Tanjim, M., Ahmed, N. K., Mathur, P., Yoon, S., Yao, L., Kveton, B., Kil, J., Nguyen, T. H., Bui, T., Zhou, T., Rossi, R. A., and Dernoncourt, F. GUI agents: A survey. In Che, W., Naben...
-
[29]
Screenagent: a vision language model-driven computer control agent
Niu, R., Li, J., Wang, S., Fu, Y., Hu, X., Leng, X., Kong, H., Chang, Y., and Wang, Q. Screenagent: a vision language model-driven computer control agent. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI '24, 2024. ISBN 978-1-956792-04-1. doi:10.24963/ijcai.2024/711. URL https://doi.org/10.24963/ijcai.2024/711
-
[30]
UGround: Towards Unified Visual Grounding with Unrolled Transformers
Qian, R., Yin, X., Deng, C., Peng, Z., Xiong, J., Zhai, W., and Dou, D. Uground: Towards unified visual grounding with unrolled transformers, 2025. URL https://arxiv.org/abs/2510.03853
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Android in the wild: A large-scale dataset for android device control, 2023
Rawles, C., Li, A., Rodriguez, D., Riva, O., and Lillicrap, T. Android in the wild: A large-scale dataset for android device control, 2023. URL https://arxiv.org/abs/2307.10088
-
[32]
AndroidWorld: A Dynamic Benchmarking Environment for Autonomous Agents
Rawles, C., Clinckemaillie, S., Chang, Y., Waltz, J., Lau, G., Fair, M., Li, A., Bishop, W., Li, W., Campbell-Ajala, F., Toyama, D., Berry, R., Tyamagundlu, D., Lillicrap, T., and Riva, O. Androidworld: A dynamic benchmarking environment for autonomous agents, 2024. URL https://arxiv.org/abs/2405.14573
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Shen, H., Liu, P., Li, J., Fang, C., Ma, Y., Liao, J., Shen, Q., Zhang, Z., Zhao, K., Zhang, Q., Xu, R., and Zhao, T. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/abs/2504.07615
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis
Sun, Q., Cheng, K., Ding, Z., Jin, C., Wang, Y., Xu, F., Wu, Z., Jia, C., Chen, L., Liu, Z., Kao, B., Li, G., He, J., Qiao, Y., and Wu, Z. OS -genesis: Automating GUI agent trajectory construction via reverse task synthesis. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Proceedings of the 63rd Annual Meeting of the Association for Comp...
-
[35]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V. I., Burnell, R., and et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URL https://arxiv.org/abs/2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Wang, H., Zou, H., Song, H., Feng, J., Fang, J., Lu, J., Liu, L., Luo, Q., Liang, S., Huang, S., et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning. arXiv preprint arXiv:2509.02544, 2025 a
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception
Wang, J., Xu, H., Ye, J., Yan, M., Shen, W., Zhang, J., Huang, F., and Sang, J. Mobile-agent: Autonomous multi-modal mobile device agent with visual perception. arXiv preprint arXiv:2401.16158, 2024 a
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Wang, X. and Liu, B. Oscar: Operating system control via state-aware reasoning and re-planning. arXiv preprint arXiv:2410.18963, 2024
-
[39]
Ponder & press: Advancing visual GUI agent towards general computer control
Wang, Y., Zhang, H., Tian, J., and Tang, Y. Ponder & press: Advancing visual GUI agent towards general computer control. In Che, W., Nabende, J., Shutova, E., and Pilehvar, M. T. (eds.), Findings of the Association for Computational Linguistics: ACL 2025, pp.\ 1461--1473, Vienna, Austria, July 2025 b . Association for Computational Linguistics. ISBN 979-8...
-
[40]
History-aware reasoning for gui agents
Wang, Z., Yang, L., Tang, X., Zhou, S., Chen, D., Jiang, W., and Li, Y. History-aware reasoning for gui agents. arXiv preprint arXiv:2511.09127, 2025 c
-
[41]
Wang, Z. Z., Mao, J., Fried, D., and Neubig, G. Agent workflow memory, 2024 b . URL https://arxiv.org/abs/2409.07429
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Wanyan, Y., Zhang, X., Xu, H., Liu, H., Wang, J., Ye, J., Kou, Y., Yan, M., Huang, F., Yang, X., Dong, W., and Xu, C. Look before you leap: A gui-critic-r1 model for pre-operative error diagnosis in gui automation, 2025. URL https://arxiv.org/abs/2506.04614
-
[43]
Auto-scaling continuous memory for gui agent.arXiv preprint arXiv:2510.09038, 2025
Wu, W., Zhou, K., Yuan, R., Yu, V., Wang, S., Hu, Z., and Huang, B. Auto-scaling continuous memory for gui agent. arXiv preprint arXiv:2510.09038, 2025
-
[44]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P. P., et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
Xu, Y., Wang, Z., Wang, J., Lu, D., Xie, T., Saha, A., Sahoo, D., Yu, T., and Xiong, C. Aguvis: Unified pure vision agents for autonomous gui interaction, 2025. URL https://arxiv.org/abs/2412.04454
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Aria-ui: Visual grounding for gui instructions, 2025
Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., and Li, J. Aria-ui: Visual grounding for gui instructions, 2025. URL https://arxiv.org/abs/2412.16256
-
[47]
Webshop: Towards scalable real-world web interaction with grounded language agents
Yao, S., Chen, H., Yang, J., and Narasimhan, K. Webshop: Towards scalable real-world web interaction with grounded language agents. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A. (eds.), Advances in Neural Information Processing Systems, volume 35, pp.\ 20744--20757. Curran Associates, Inc., 2022
work page 2022
-
[48]
Mobile-Agent-v3: Fundamental Agents for GUI Automation
Ye, J., Zhang, X., Xu, H., Liu, H., Wang, J., Zhu, Z., Zheng, Z., Gao, F., Cao, J., Lu, Z., et al. Mobile-agent-v3: Fundamental agents for gui automation. arXiv preprint arXiv:2508.15144, 2025
work page internal anchor Pith review arXiv 2025
-
[49]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Dai, W., Fan, T., Liu, G., Liu, L., Liu, X., Lin, H., Lin, Z., Ma, B., Sheng, G., Tong, Y., Zhang, C., Zhang, M., Zhang, W., Zhu, H., Zhu, J., Chen, J., Chen, J., Wang, C., Yu, H., Song, Y., Wei, X., Zhou, H., Liu, J., Ma, W.-Y., Zhang, Y.-Q., Yan, L., Qiao, M., Wu, Y., and Wang, M. Dapo: An open-sou...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[50]
Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning
Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.-T., et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370, 2025
-
[51]
Mobiagent: A systematic framework for customizable mobile agents, 2025 a
Zhang, C., Feng, E., Zhao, X., Zhao, Y., Gong, W., Sun, J., Du, D., Hua, Z., Xia, Y., and Chen, H. Mobiagent: A systematic framework for customizable mobile agents, 2025 a . URL https://arxiv.org/abs/2509.00531
-
[52]
UI -hawk: Unleashing the screen stream understanding for mobile GUI agents
Zhang, J., Yu, Y.-Q., Liao, M., Li, W., Wu, J., and Wei, Z. UI -hawk: Unleashing the screen stream understanding for mobile GUI agents. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V. (eds.), Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pp.\ 18217--18236, Suzhou, China, November 2025 b . Associ...
-
[53]
R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model
Zhou, H., Li, X., Wang, R., Cheng, M., Zhou, T., and Hsieh, C.-J. R1-zero's "aha moment" in visual reasoning on a 2b non-sft model, 2025 a . URL https://arxiv.org/abs/2503.05132
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
Zhou, H., Zhang, X., Tong, P., Zhang, J., Chen, L., Kong, Q., Cai, C., Liu, C., Wang, Y., Zhou, J., et al. Mai-ui technical report: Real-world centric foundation gui agents. arXiv preprint arXiv:2512.22047, 2025 b
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.