pith. machine review for the scientific record. sign in

arxiv: 2604.27955 · v1 · submitted 2026-04-30 · 💻 cs.AI · cs.CV

Recognition: unknown

GUI Agents with Reinforcement Learning: Toward Digital Inhabitants

Authors on Pith no claims yet

Pith reviewed 2026-05-07 05:43 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords GUI agentsreinforcement learningtaxonomyreward engineeringworld modelsdigital inhabitantsoffline RLhybrid strategies
0
0 comments X

The pith

Reinforcement learning organizes GUI agent methods into offline, online, and hybrid categories on the path to digital inhabitants.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

GUI agents perceive and act on computer screens through visual interfaces, yet supervised training alone cannot manage extended sequences of actions or adapt safely to changing conditions. This paper reviews the role of reinforcement learning in overcoming those limits by surveying current methods and grouping them under a new taxonomy. The taxonomy places approaches into offline learning from stored data, online learning through live interaction, and hybrid combinations of the two. It further examines how rewards are crafted, how training data can be used more efficiently, and which technical choices are gaining traction. If the patterns hold, GUI agents could progress from narrow automation tools to more autonomous systems capable of sustained operation in digital environments.

Core claim

The authors establish that reinforcement learning is central to GUI agent development because supervised fine-tuning cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments. They introduce a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and they complement the taxonomy with examinations of reward engineering, data efficiency, and key technical innovations. Their analysis identifies emerging trends: composite multi-tier reward architectures that address the reliability-scalability tension, world-model-based training that overcomes GUI I/O latency, and the spontaneous emergence

What carries the argument

A three-category taxonomy of Offline RL, Online RL, and Hybrid Strategies that classifies reinforcement learning methods for GUI agents and supports analysis of reward architectures and world models.

If this is right

  • Composite multi-tier reward architectures can balance reliability and scalability in agent training.
  • World-model-based training can deliver substantial performance gains by reducing reliance on slow GUI input-output operations.
  • Rich reward signals can produce emergent System-2 deliberation without the need for explicit reasoning supervision.
  • Future progress depends on advancing process rewards, continual reinforcement learning, cognitive architectures, and safe deployment practices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agents built under this taxonomy could eventually transfer skills across different software applications without full retraining.
  • The three-way split may highlight gaps where new hybrid designs are needed to combine data efficiency with real-time adaptability.
  • Emergent deliberation implies that complex internal reasoning could arise naturally once reward signals are sufficiently detailed.
  • Safe deployment methods will be required once agents begin handling real user data and irreversible actions on personal devices.

Load-bearing premise

The existing literature on reinforcement learning for GUI agents can be comprehensively and accurately captured by the proposed offline, online, and hybrid taxonomy without significant omissions or selection bias.

What would settle it

A search that identifies multiple prominent RL-based GUI agent papers that cannot be placed in any of the three taxonomy categories, or experiments showing that world-model training produces no performance gains over direct GUI interaction.

Figures

Figures reproduced from arXiv: 2604.27955 by Dazhao Du, Jian Li, Jian Liu, Jiarui Hu, Jingxiang Lai, Junan Hu, Shuang Chen, Song Guo, Yiwei Sheng.

Figure 1
Figure 1. Figure 1: Overview of the survey structure. Sections 1 and 2 introduce the background of GUI agent; Sec view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the RL training pipeline for GUI agents. The agent perceives the GUI environment view at source ↗
Figure 3
Figure 3. Figure 3: A taxonomy of representative GUI agent papers organised along five dimensions: view at source ↗
Figure 4
Figure 4. Figure 4: Timeline of GUI Agent Development. Grounding-specialized models. Visual grounding—precise mapping from natural language to screen coordinates—has emerged as a specialized focus for RL optimization, building on foundational work in universal visual grounding (Gou et al., 2024), unified pure vision agents (Xu et al., 2024d; Chen et al., 2026c), Aria-UI (Yang et al., 2025c), and Phi-Ground (Zhang et al., 2025… view at source ↗
Figure 5
Figure 5. Figure 5: The Reward Engineering Pyramid balances accuracy and generality for GUI Agents: rule-based view at source ↗
Figure 6
Figure 6. Figure 6: This pyramid depicts a four-stage data-training pipeline for agent capability, progressing from view at source ↗
Figure 7
Figure 7. Figure 7: An asynchronous distributed architecture for GUI RL agent training, addressing slow environment view at source ↗
read the original abstract

Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi-tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world-model-based training, which can yield substantial performance gains; and the spontaneous emergence of System-2-style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent-native infrastructure.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to offer the first comprehensive overview of reinforcement learning applied to GUI agents. It introduces a principled taxonomy categorizing methods into Offline RL, Online RL, and Hybrid Strategies. Complementary analyses cover reward engineering, data efficiency, and technical innovations. Emerging trends identified include the use of composite multi-tier reward architectures to balance reliability and scalability, a shift to world-model-based training to mitigate GUI I/O latency issues, and the spontaneous emergence of System-2-style deliberation from rich reward signals. The work concludes with a roadmap for future directions such as process rewards, continual RL, cognitive architectures, and safe deployment toward digital inhabitants.

Significance. If the proposed taxonomy proves to be comprehensive and the trends are based on a representative sample of the literature, this survey would be highly significant. It would provide a structured framework for understanding the current state of RL in GUI agents, highlight key challenges and innovations, and offer a forward-looking roadmap that could guide research in developing more advanced, autonomous GUI agents. This could accelerate progress in the field by standardizing approaches and identifying promising research avenues. The focus on evolving toward 'digital inhabitants' adds a visionary aspect that may inspire new lines of inquiry.

major comments (2)
  1. [Taxonomy (Section 3)] The central claim rests on a 'principled taxonomy' that partitions existing RL-for-GUI methods into Offline RL, Online RL, and Hybrid Strategies. However, the manuscript does not specify the categorization criteria, provide a mapping table of all cited methods to these categories, or discuss potential overlaps or methods that fall outside these bins (e.g., pure imitation learning approaches). This is load-bearing for the comprehensiveness claim and the validity of the derived trends.
  2. [Literature Review Methodology (Introduction or Section 2)] No search protocol, databases, keywords, date range, or inclusion criteria are described. This omission is critical because the trends (composite rewards, world-model shift, emergent deliberation) depend on the selected literature being representative without selection bias. The paper should either detail the review process or acknowledge limitations in coverage.
minor comments (2)
  1. [Abstract] The abstract mentions 'analyses of reward engineering, data efficiency, and key technical innovations' but the corresponding sections could benefit from more quantitative comparisons or summary tables to support the qualitative trends.
  2. [Roadmap section] The roadmap is presented at a high level; including specific open problems or example research questions would enhance its utility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their constructive and insightful review. We appreciate the positive assessment of the paper's potential significance in providing a structured framework for RL applied to GUI agents and the forward-looking roadmap toward digital inhabitants. We agree that the taxonomy requires explicit criteria and a mapping to support the comprehensiveness claim, and that the literature review process should be documented to address potential selection bias concerns. We will undertake major revisions to incorporate these elements, as detailed in our point-by-point responses below.

read point-by-point responses
  1. Referee: [Taxonomy (Section 3)] The central claim rests on a 'principled taxonomy' that partitions existing RL-for-GUI methods into Offline RL, Online RL, and Hybrid Strategies. However, the manuscript does not specify the categorization criteria, provide a mapping table of all cited methods to these categories, or discuss potential overlaps or methods that fall outside these bins (e.g., pure imitation learning approaches). This is load-bearing for the comprehensiveness claim and the validity of the derived trends.

    Authors: We agree that explicitly defining the categorization criteria is necessary to substantiate the taxonomy and derived trends. In the revised manuscript, we will add a dedicated subsection in Section 3 ('Taxonomy Criteria and Scope') that specifies the partitioning rules: Offline RL covers methods that train exclusively on pre-collected, static GUI interaction datasets using offline algorithms without further live interaction (to avoid unsafe exploration in irreversible environments); Online RL includes methods that perform iterative, direct interaction with GUI environments (real or simulated) during training; and Hybrid Strategies combine both phases, such as offline pre-training on large-scale data followed by online fine-tuning or adaptation. We will insert a new mapping table (Table 1) that enumerates all cited methods, assigns each to its primary category with a one-sentence justification, and flags any boundary cases. Regarding overlaps and out-of-scope methods, we will add explicit discussion noting that pure imitation learning approaches are classified under Offline RL when they rely on demonstration data for behavioral cloning without reward-driven optimization, but hybrid cases (e.g., imitation-initialized RL) are placed in Hybrid; methods falling outside (e.g., non-RL supervised fine-tuning or non-GUI agents) are excluded per the survey scope focused on RL for GUI agents. These additions will directly support the validity of trends such as composite rewards and world-model shifts. revision: yes

  2. Referee: [Literature Review Methodology (Introduction or Section 2)] No search protocol, databases, keywords, date range, or inclusion criteria are described. This omission is critical because the trends (composite rewards, world-model shift, emergent deliberation) depend on the selected literature being representative without selection bias. The paper should either detail the review process or acknowledge limitations in coverage.

    Authors: We acknowledge that documenting the review methodology is essential for a survey claiming comprehensiveness and to mitigate concerns about selection bias in the identified trends. In the revised manuscript, we will insert a new subsection titled 'Literature Review Methodology' (placed in Section 2 or immediately following the introduction) that details: databases searched (arXiv, Google Scholar, IEEE Xplore, ACL Anthology, and major conference proceedings from 2020 onward); search keywords and Boolean queries (e.g., ('GUI agent' OR 'graphical user interface agent') AND ('reinforcement learning' OR 'RL') AND ('visual' OR 'screenshot')); date range (primarily January 2018–December 2024, capturing the emergence of modern GUI agents while including foundational works); and inclusion criteria (empirical papers applying RL to GUI agents with visual perception-action loops, reporting metrics on task completion or efficiency; exclusion of pure prompting/LLM-only methods without learning, non-visual interfaces, or non-agent UI design papers). We will also add a limitations paragraph noting that while the sampled literature represents prominent and highly-cited works in the field, the rapidly evolving nature of the area may omit the most recent preprints, and trends are derived from this representative but not exhaustive set. This will strengthen confidence in the analyses of reward engineering, data efficiency, and emerging patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: survey taxonomy organizes external literature without self-referential derivation

full rationale

This paper is a literature survey that proposes a taxonomy (Offline RL, Online RL, Hybrid Strategies) to organize existing published methods and identifies interpretive trends in reward engineering and world models. No new mathematical derivations, fitted parameters, or predictions are generated from the paper's own data or definitions. The central claims rest on citations to external prior work rather than reducing to self-citations, ansatzes, or fitted inputs presented as novel results. No equations or self-definitional loops appear in the provided abstract or framing. The absence of an explicit search protocol affects completeness but does not create circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claims rest on the assumption that supervised fine-tuning is insufficient for GUI agent challenges and that a three-category taxonomy plus trend analysis can comprehensively capture the field; these are domain assumptions and author-proposed structures without independent verification from the abstract.

axioms (2)
  • domain assumption Supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments
    Invoked in the abstract to position RL as central for GUI agents.
  • ad hoc to paper Existing methods can be organized into a principled taxonomy of Offline RL, Online RL, and Hybrid Strategies
    Proposed by the authors as the organizing framework for the overview.
invented entities (1)
  • digital inhabitants no independent evidence
    purpose: To frame the long-term goal of advanced, autonomous GUI agents
    New conceptual term introduced to describe the evolutionary direction of the research.

pith-pipeline@v0.9.0 · 5533 in / 1496 out tokens · 38123 ms · 2026-05-07T05:43:46.207186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

88 extracted references · 82 canonical work pages · 25 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Automated reinforcement learning: An overview.arXiv preprint arXiv:2201.05000,

    Reza Refaei Afshar, Yingqian Zhang, Joaquin Vanschoren, and Uzay Kaymak. Automated reinforcement learning: An overview.arXiv preprint arXiv:2201.05000,

  3. [3]

    arXiv preprint arXiv:2410.08164 , year =

    Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,

  4. [4]

    Agent S2: A compositional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906, 2025

    Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A composi- tional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906,

  5. [5]

    [Accessed 09-02-2026]. 34 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, et al. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876,

  6. [6]

    Digi-q: Learning q-value functions for training device-control agents.arXiv preprint arXiv:2502.15760, 2025a

    Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, and Aviral Kumar. Digi-q: Learning q-value functions for training device-control agents.arXiv preprint arXiv:2502.15760, 2025a. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei H...

  7. [7]

    Terminal Agents Suffice for Enterprise Automation

    Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, and Sai Rajeswar. Terminal agents suffice for enterprise automation.arXiv preprint arXiv:2604.00073,

  8. [8]

    Bonatti, D

    Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,

  9. [9]

    [Accessed 09-02-2026]. S. Cai, Y. Qin, H. Lin, Z. Xu, G. Li, Y. Shi, Z. Li, Y. Mao, S. Cai, X. Tan, Y. Liang, K. Li, and X. Sun. Smartsnap: Proactive evidence seeking for self-verifying agents.arXiv preprint arXiv:2512.22322,

  10. [10]

    C. Chen, J. Shao, D. Lu, H. Hu, X. Liu, H. Yao, and W. Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026a. Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, et al. Agentcpm-explore: Realizing long-horizon deep exploration for e...

  11. [11]

    Unify-agent: A unified multimodal agent for world-grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026

    35 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world- grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026c. Z. Chen, Z. Zhao, K. Zhang, ...

  12. [12]

    Elmur: External layer memory with up- date/rewrite for long-horizon rl.arXiv preprint arXiv:2510.07151,

    Egor Cherepanov, Alexey K Kovalev, and Aleksandr I Panov. Elmur: External layer memory with up- date/rewrite for long-horizon rl.arXiv preprint arXiv:2510.07151,

  13. [13]

    The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

    De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,

  14. [14]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

  15. [15]

    Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,

    Chaoqun Cui, Jing Huang, Shijing Wang, Liming Zheng, Qingchao Kong, and Zhixiong Zeng. Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,

  16. [16]

    Process Reinforcement through Implicit Rewards

    Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,

  17. [17]

    G. Dai, S. Jiang, T. Cao, Y. Yang, Y. Li, R. Tan, M. Li, and L. Qiu. Prore: A proactive reward system for gui agents via reasoner–actor collaboration.arXiv preprint arXiv:2509.21823,

  18. [18]

    Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar

    [Accessed 09-02-2026]. Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building data-driven design applications. InProceedings of the 30th annual ACM symposium on user interface software and technology, pp. 845–854,

  19. [19]

    Simura: A world-model-driven simulative reasoning architecture for general goal-oriented agents.arXiv preprint arXiv:2507.23773,

    Mingkai Deng, Jinyu Hou, Zhiting Hu, and Eric Xing. Simura: A world-model-driven simulative reasoning architecture for general goal-oriented agents.arXiv preprint arXiv:2507.23773,

  20. [20]

    DynaWeb: Model-Based Reinforcement Learning of Web Agents

    Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, and Lei Yu. Dynaweb: Model-based reinforcement learning of web agents.arXiv preprint arXiv:2601.22149,

  21. [21]

    Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545,

    36 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545,

  22. [22]

    Memr3: Memory retrieval via reflective reasoning for llm agents.arXiv preprint arXiv:2512.20237,

    Xingbo Du, Loka Li, Duzhen Zhang, and Le Song. Memr3: Memory retrieval via reflective reasoning for llm agents.arXiv preprint arXiv:2512.20237,

  23. [23]

    arXiv preprint arXiv:2503.09572 (2025)

    Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-act: Improving planning of agents for long-horizon tasks.arXiv preprint arXiv:2503.09572,

  24. [24]

    Gui-bee: Align gui action grounding to novel environments via autonomous exploration.arXiv preprint arXiv:2501.13896,

    Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, and Gang Wu. Gui-bee: Align gui action grounding to novel environments via autonomous exploration.arXiv preprint arXiv:2501.13896,

  25. [25]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978,

  26. [26]

    Mano technical report

    Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, et al. Mano technical report.arXiv preprint arXiv:2509.17336, 2025a. Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning sys...

  27. [27]

    Websynthesis: World-model-guided mcts for efficient webui-trajectory synthesis.arXiv preprint arXiv:2507.04370,

    Yifei Gao, Junhong Ye, Jiaqi Wang, and Jitao Sang. Websynthesis: World-model-guided mcts for efficient webui-trajectory synthesis.arXiv preprint arXiv:2507.04370,

  28. [28]

    End-to-end navigation with vision language models: Transforming spatial reasoning into question-answering,

    Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vision lan- guage models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755,

  29. [29]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243,

  30. [30]

    arXiv preprint arXiv:2411.06559 , year=

    Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559,

  31. [31]

    Seed1.5-VL Technical Report

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025a. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5...

  32. [32]

    Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817,

    Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, and Bo An. Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817,

  33. [33]

    A data-driven approach for learning to control computers.arXiv preprint arXiv:2202.08137,

    Peter C Humphreys, David Raposo, Toby Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers.arXiv preprint arXiv:2202.08137,

  34. [34]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,

  35. [35]

    Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents.arXiv preprint arXiv:2510.24563, 2025

    Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents.arXiv preprint arXiv:2510.24563,

  36. [36]

    arXiv preprint arXiv:2505.05262 , year=

    Andreas Kontogiannis, Konstantinos Papathanasiou, Yi Shen, Giorgos Stamou, Michael M Zavlanos, and George Vouros. Enhancing cooperative multi-agent reinforcement learning with state modelling and ad- versarial exploration.arXiv preprint arXiv:2505.05262,

  37. [37]

    Os-harm: A benchmark for measuring safety of computer use agents

    Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents.arXiv preprint arXiv:2506.14866,

  38. [38]

    Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040,

    Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040,

  39. [39]

    Nested browser-use learning for agentic information seeking

    Baixuan Li, Jialong Wu, Wenbiao Yin, Kuan Li, Zhongwang Zhang, Huifeng Yin, Zhengwei Tao, Liwen Zhang, Pengjun Xie, Jingren Zhou, et al. Nested browser-use learning for agentic information seeking. arXiv preprint arXiv:2512.23647, 2025a. Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pr...

  40. [40]

    UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

    Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moor- thy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui 2: Mastering universal user interface understanding across platforms. InInternational Conference on Learning Representations (ICLR), 2024b. Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bing...

  41. [41]

    Showui: One vision-language-action model for gui visual agent

    39 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 19498–19508, ...

  42. [42]

    Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu

    arXiv preprint arXiv:2508.05731. Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575, 2025b. Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng ...

  43. [43]

    Webchorearena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952,

    Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, et al. Webchorearena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952,

  44. [44]

    Playing Atari with Deep Reinforcement Learning

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,

  45. [45]

    WebGPT: Browser-assisted question-answering with human feedback

    Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,

  46. [46]

    Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

    Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui-vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661,

  47. [47]

    Gui agents: A survey

    Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 22522–22538,

  48. [48]

    [Accessed 09-02-2026]. OpenAI. Computer-using agent — openai.com.https://openai.com/index/computer-using-agent/, 2025a. [Accessed 09-02-2026]. OpenAI. Computer-using agent, 2025b. URLhttps://openai.com/index/computer-using-agent/. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina ...

  49. [49]

    Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents

    Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Hassan. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 6300–6323,

  50. [50]

    Zehan Qi, Xiao Liu, Iat Long Iong, Hanyu Lai, Xueqiao Sun, Wenyi Zhao, Yu Yang, Xinyue Yang, Jiadai Sun, Shuntian Yao, et al

    Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373,

  51. [51]

    Ica: Information-aware credit assignment for visually grounded long-horizon information-seeking agents.arXiv preprint arXiv:2602.10863,

    Cong Pang, Xuyu Feng, Yujie Yi, Zixuan Chen, Jiawei Hong, Tiankuo Yao, Nang Yuan, Jiapeng Luo, Lewei Lu, and Xin Lou. Ica: Information-aware credit assignment for visually grounded long-horizon information-seeking agents.arXiv preprint arXiv:2602.10863,

  52. [52]

    Mapping natural language commandstowebelements

    41 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Panupong Pasupat, Tian-Shun Jiang, Evan Liu, Kelvin Guu, and Percy Liang. Mapping natural language commandstowebelements. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 4970–4976,

  53. [53]

    HiPER: Hierarchical Reinforcement Learning with Explicit Credit Assignment for Large Language Model Agents

    Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165,

  54. [54]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,

  55. [55]

    Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199, 2024

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199,

  56. [56]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,

  57. [57]

    Scaling synthetic task generation for agents via exploration.arXiv preprint arXiv:2509.25047,

    Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration.arXiv preprint arXiv:2509.25047,

  58. [58]

    A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions

    Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,

  59. [59]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,

  60. [60]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,

  61. [61]

    Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362,

    Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362,

  62. [62]

    Experiential reinforcement learning.arXiv preprint arXiv:2602.13949, 2026

    Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. Experiential reinforcement learning.arXiv preprint arXiv:2602.13949,

  63. [63]

    Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720, 2025

    Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720,

  64. [64]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,

  65. [65]

    C. H. Song, Y. Song, P. Goyal, Y. Su, O. Riva, H. Palangi, and T. Pfister. Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025a. Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, and Tomas Pfister. Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:25...

  66. [66]

    Magnet: Towards adaptive gui agents with memory-driven knowledge evolution.arXiv preprint arXiv:2601.19199,

    Libo Sun, Jiwen Zhang, Siyuan Wang, and Zhongyu Wei. Magnet: Towards adaptive gui agents with memory-driven knowledge evolution.arXiv preprint arXiv:2601.19199,

  67. [67]

    F. Tang, Z. Gu, Z. Lu, X. Liu, S. Shen, C. Meng, W. Wang, W. Zhang, Y. Shen, W. Lu, J. Xiao, and Y. Zhuang. Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025a. 43 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, ...

  68. [68]

    LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

    Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, et al. Lpo: Towards accurate gui agent interaction via location preference optimization. arXiv preprint arXiv:2506.09373, 2025d. Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhiheng Xi, Zhihui Cao, Ha...

  69. [69]

    AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

    Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li. Agentprog: Empowering long-horizon gui agents with program-guided context management. arXiv preprint arXiv:2512.10371,

  70. [70]

    Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan

    Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231,

  71. [71]

    UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

    Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing ...

  72. [72]

    Probabilistic subgoal representations for hierarchical reinforcement learning.arXiv preprint arXiv:2406.16707, 2024e

    Vivienne Huiling Wang, Tinghuai Wang, Wenyan Yang, Joni-Kristian Kämäräinen, and Joni Pajari- nen. Probabilistic subgoal representations for hierarchical reinforcement learning.arXiv preprint arXiv:2406.16707, 2024e. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internv...

  73. [73]

    Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Rui Qu, Maoliang Li, Xiang Chen, and Guojie Luo. Agent. xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc.arXiv preprint arXiv:2506.24045, 2025a. Z. Wei, W. Yao, Y. Liu, W. Zhang, Q. Lu, L. Qiu, C. Yu, P. Xu, C. Zhang, B. Yin, H. Yun, and L. Li. Webagent-r1: Training web agents via en...

  74. [74]

    Uisim: An interactive image-based ui simulator for dynamic mobile environments

    Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James Lyon, Srinivas Sunkara, and Jindong Chen. Uisim: An interactive image-based ui simulator for dynamic mobile environments. arXiv preprint arXiv:2509.21733,

  75. [75]

    Webworld: A large-scale world model for web agent training.arXiv preprint arXiv:2602.14721,

    Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, and Zuozhu Liu. Webworld: A large-scale world model for web agent training.arXiv preprint arXiv:2602.14721,

  76. [76]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: benchmarking llm agents on consequential real world tasks.arXiv preprint arXiv:2412.14161, 2024a. Si Xu, Zixiao Huang, Yan Zeng, Shengen Yan, Xuefei Ning, Quanlu Zhang, Haolin Ye, Sipei Gu, Chunsheng Shui,...

  77. [77]

    arXiv preprint arXiv:2507.05791 (2025)

    Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791, 2025b. Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. InFindings of the Associa...

  78. [78]

    Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025

    Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144,

  79. [79]

    Ovm, outcome-supervised value models for planning in math- ematical reasoning

    Fei Yu, Anningzhe Gao, and Benyou Wang. Ovm, outcome-supervised value models for planning in math- ematical reasoning. InFindings of the Association for Computational Linguistics: NAACL 2024, pp. 858–875,

  80. [80]

    MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

    Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, and Xianpei Han. Memsearcher: Training llms to reason, search and manage memory via end-to-end reinforcement learning.arXiv preprint arXiv:2511.02805,

Showing first 80 references.