Recognition: unknown
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
Pith reviewed 2026-05-07 05:43 UTC · model grok-4.3
The pith
Reinforcement learning organizes GUI agent methods into offline, online, and hybrid categories on the path to digital inhabitants.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that reinforcement learning is central to GUI agent development because supervised fine-tuning cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments. They introduce a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and they complement the taxonomy with examinations of reward engineering, data efficiency, and key technical innovations. Their analysis identifies emerging trends: composite multi-tier reward architectures that address the reliability-scalability tension, world-model-based training that overcomes GUI I/O latency, and the spontaneous emergence
What carries the argument
A three-category taxonomy of Offline RL, Online RL, and Hybrid Strategies that classifies reinforcement learning methods for GUI agents and supports analysis of reward architectures and world models.
If this is right
- Composite multi-tier reward architectures can balance reliability and scalability in agent training.
- World-model-based training can deliver substantial performance gains by reducing reliance on slow GUI input-output operations.
- Rich reward signals can produce emergent System-2 deliberation without the need for explicit reasoning supervision.
- Future progress depends on advancing process rewards, continual reinforcement learning, cognitive architectures, and safe deployment practices.
Where Pith is reading between the lines
- Agents built under this taxonomy could eventually transfer skills across different software applications without full retraining.
- The three-way split may highlight gaps where new hybrid designs are needed to combine data efficiency with real-time adaptability.
- Emergent deliberation implies that complex internal reasoning could arise naturally once reward signals are sufficiently detailed.
- Safe deployment methods will be required once agents begin handling real user data and irreversible actions on personal devices.
Load-bearing premise
The existing literature on reinforcement learning for GUI agents can be comprehensively and accurately captured by the proposed offline, online, and hybrid taxonomy without significant omissions or selection bias.
What would settle it
A search that identifies multiple prominent RL-based GUI agent papers that cannot be placed in any of the three taxonomy categories, or experiments showing that world-model training produces no performance gains over direct GUI interaction.
Figures
read the original abstract
Graphical User Interface (GUI) agents have emerged as a promising paradigm for intelligent systems that perceive and interact with graphical interfaces visually. Yet supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments, making Reinforcement Learning (RL) a central methodology for advancing automation. In this work, we present the first comprehensive overview of the intersection between RL and GUI agents, and examine how this research direction may evolve toward digital inhabitants. We propose a principled taxonomy that organizes existing methods into Offline RL, Online RL, and Hybrid Strategies, and complement it with analyses of reward engineering, data efficiency, and key technical innovations. Our analysis reveals several emerging trends: the tension between reliability and scalability is motivating the adoption of composite, multi-tier reward architectures; GUI I/O latency bottlenecks are accelerating the shift toward world-model-based training, which can yield substantial performance gains; and the spontaneous emergence of System-2-style deliberation suggests that explicit reasoning supervision may not be necessary when sufficiently rich reward signals are available. We distill these findings into a roadmap covering process rewards, continual RL, cognitive architectures, and safe deployment, aiming to guide the next generation of robust GUI automation and its agent-native infrastructure.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to offer the first comprehensive overview of reinforcement learning applied to GUI agents. It introduces a principled taxonomy categorizing methods into Offline RL, Online RL, and Hybrid Strategies. Complementary analyses cover reward engineering, data efficiency, and technical innovations. Emerging trends identified include the use of composite multi-tier reward architectures to balance reliability and scalability, a shift to world-model-based training to mitigate GUI I/O latency issues, and the spontaneous emergence of System-2-style deliberation from rich reward signals. The work concludes with a roadmap for future directions such as process rewards, continual RL, cognitive architectures, and safe deployment toward digital inhabitants.
Significance. If the proposed taxonomy proves to be comprehensive and the trends are based on a representative sample of the literature, this survey would be highly significant. It would provide a structured framework for understanding the current state of RL in GUI agents, highlight key challenges and innovations, and offer a forward-looking roadmap that could guide research in developing more advanced, autonomous GUI agents. This could accelerate progress in the field by standardizing approaches and identifying promising research avenues. The focus on evolving toward 'digital inhabitants' adds a visionary aspect that may inspire new lines of inquiry.
major comments (2)
- [Taxonomy (Section 3)] The central claim rests on a 'principled taxonomy' that partitions existing RL-for-GUI methods into Offline RL, Online RL, and Hybrid Strategies. However, the manuscript does not specify the categorization criteria, provide a mapping table of all cited methods to these categories, or discuss potential overlaps or methods that fall outside these bins (e.g., pure imitation learning approaches). This is load-bearing for the comprehensiveness claim and the validity of the derived trends.
- [Literature Review Methodology (Introduction or Section 2)] No search protocol, databases, keywords, date range, or inclusion criteria are described. This omission is critical because the trends (composite rewards, world-model shift, emergent deliberation) depend on the selected literature being representative without selection bias. The paper should either detail the review process or acknowledge limitations in coverage.
minor comments (2)
- [Abstract] The abstract mentions 'analyses of reward engineering, data efficiency, and key technical innovations' but the corresponding sections could benefit from more quantitative comparisons or summary tables to support the qualitative trends.
- [Roadmap section] The roadmap is presented at a high level; including specific open problems or example research questions would enhance its utility.
Simulated Author's Rebuttal
We sincerely thank the referee for their constructive and insightful review. We appreciate the positive assessment of the paper's potential significance in providing a structured framework for RL applied to GUI agents and the forward-looking roadmap toward digital inhabitants. We agree that the taxonomy requires explicit criteria and a mapping to support the comprehensiveness claim, and that the literature review process should be documented to address potential selection bias concerns. We will undertake major revisions to incorporate these elements, as detailed in our point-by-point responses below.
read point-by-point responses
-
Referee: [Taxonomy (Section 3)] The central claim rests on a 'principled taxonomy' that partitions existing RL-for-GUI methods into Offline RL, Online RL, and Hybrid Strategies. However, the manuscript does not specify the categorization criteria, provide a mapping table of all cited methods to these categories, or discuss potential overlaps or methods that fall outside these bins (e.g., pure imitation learning approaches). This is load-bearing for the comprehensiveness claim and the validity of the derived trends.
Authors: We agree that explicitly defining the categorization criteria is necessary to substantiate the taxonomy and derived trends. In the revised manuscript, we will add a dedicated subsection in Section 3 ('Taxonomy Criteria and Scope') that specifies the partitioning rules: Offline RL covers methods that train exclusively on pre-collected, static GUI interaction datasets using offline algorithms without further live interaction (to avoid unsafe exploration in irreversible environments); Online RL includes methods that perform iterative, direct interaction with GUI environments (real or simulated) during training; and Hybrid Strategies combine both phases, such as offline pre-training on large-scale data followed by online fine-tuning or adaptation. We will insert a new mapping table (Table 1) that enumerates all cited methods, assigns each to its primary category with a one-sentence justification, and flags any boundary cases. Regarding overlaps and out-of-scope methods, we will add explicit discussion noting that pure imitation learning approaches are classified under Offline RL when they rely on demonstration data for behavioral cloning without reward-driven optimization, but hybrid cases (e.g., imitation-initialized RL) are placed in Hybrid; methods falling outside (e.g., non-RL supervised fine-tuning or non-GUI agents) are excluded per the survey scope focused on RL for GUI agents. These additions will directly support the validity of trends such as composite rewards and world-model shifts. revision: yes
-
Referee: [Literature Review Methodology (Introduction or Section 2)] No search protocol, databases, keywords, date range, or inclusion criteria are described. This omission is critical because the trends (composite rewards, world-model shift, emergent deliberation) depend on the selected literature being representative without selection bias. The paper should either detail the review process or acknowledge limitations in coverage.
Authors: We acknowledge that documenting the review methodology is essential for a survey claiming comprehensiveness and to mitigate concerns about selection bias in the identified trends. In the revised manuscript, we will insert a new subsection titled 'Literature Review Methodology' (placed in Section 2 or immediately following the introduction) that details: databases searched (arXiv, Google Scholar, IEEE Xplore, ACL Anthology, and major conference proceedings from 2020 onward); search keywords and Boolean queries (e.g., ('GUI agent' OR 'graphical user interface agent') AND ('reinforcement learning' OR 'RL') AND ('visual' OR 'screenshot')); date range (primarily January 2018–December 2024, capturing the emergence of modern GUI agents while including foundational works); and inclusion criteria (empirical papers applying RL to GUI agents with visual perception-action loops, reporting metrics on task completion or efficiency; exclusion of pure prompting/LLM-only methods without learning, non-visual interfaces, or non-agent UI design papers). We will also add a limitations paragraph noting that while the sampled literature represents prominent and highly-cited works in the field, the rapidly evolving nature of the area may omit the most recent preprints, and trends are derived from this representative but not exhaustive set. This will strengthen confidence in the analyses of reward engineering, data efficiency, and emerging patterns. revision: yes
Circularity Check
No circularity: survey taxonomy organizes external literature without self-referential derivation
full rationale
This paper is a literature survey that proposes a taxonomy (Offline RL, Online RL, Hybrid Strategies) to organize existing published methods and identifies interpretive trends in reward engineering and world models. No new mathematical derivations, fitted parameters, or predictions are generated from the paper's own data or definitions. The central claims rest on citations to external prior work rather than reducing to self-citations, ansatzes, or fitted inputs presented as novel results. No equations or self-definitional loops appear in the provided abstract or framing. The absence of an explicit search protocol affects completeness but does not create circularity by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Supervised fine-tuning alone cannot handle long-horizon credit assignment, distribution shifts, and safe exploration in irreversible environments
- ad hoc to paper Existing methods can be organized into a principled taxonomy of Offline RL, Online RL, and Hybrid Strategies
invented entities (1)
-
digital inhabitants
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review arXiv
-
[2]
Automated reinforcement learning: An overview.arXiv preprint arXiv:2201.05000,
Reza Refaei Afshar, Yingqian Zhang, Joaquin Vanschoren, and Uzay Kaymak. Automated reinforcement learning: An overview.arXiv preprint arXiv:2201.05000,
-
[3]
arXiv preprint arXiv:2410.08164 , year =
Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s: An open agentic framework that uses computers like a human.arXiv preprint arXiv:2410.08164,
-
[4]
Saaket Agashe, Kyle Wong, Vincent Tu, Jiachen Yang, Ang Li, and Xin Eric Wang. Agent s2: A composi- tional generalist-specialist framework for computer use agents.arXiv preprint arXiv:2504.00906,
-
[5]
[Accessed 09-02-2026]. 34 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Tajamul Ashraf, Amal Saqib, Hanan Ghani, Muhra AlMahri, Yuhao Li, Noor Ahsan, Umair Nawaz, Jean Lahoud, Hisham Cholakkal, Mubarak Shah, et al. Agent-x: Evaluating deep multimodal reasoning in vision-centric agentic tasks.arXiv preprint arXiv:2505.24876,
-
[6]
Hao Bai, Yifei Zhou, Li Erran Li, Sergey Levine, and Aviral Kumar. Digi-q: Learning q-value functions for training device-control agents.arXiv preprint arXiv:2502.15760, 2025a. Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei H...
-
[7]
Terminal Agents Suffice for Enterprise Automation
Patrice Bechard, Orlando Marquez Ayala, Emily Chen, Jordan Skelton, Sagar Davasam, Srinivas Sunkara, Vikas Yadav, and Sai Rajeswar. Terminal agents suffice for enterprise automation.arXiv preprint arXiv:2604.00073,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Yadong Lu, Justin Wagle, Kazuhito Koishida, Arthur Bucker, et al. Windows agent arena: Evaluating multi-modal os agents at scale.arXiv preprint arXiv:2409.08264,
- [9]
-
[10]
C. Chen, J. Shao, D. Lu, H. Hu, X. Liu, H. Yao, and W. Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026a. Haotian Chen, Xin Cong, Shengda Fan, Yuyang Fu, Ziqin Gong, Yaxi Lu, Yishan Li, Boye Niu, Chengjun Pan, Zijun Song, et al. Agentcpm-explore: Realizing long-horizon deep exploration for e...
-
[11]
35 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Shuang Chen, Quanxin Shou, Hangting Chen, Yucheng Zhou, Kaituo Feng, Wenbo Hu, Yi-Fan Zhang, Yunlong Lin, Wenxuan Huang, Mingyang Song, et al. Unify-agent: A unified multimodal agent for world- grounded image synthesis.arXiv preprint arXiv:2603.29620, 2026c. Z. Chen, Z. Zhao, K. Zhang, ...
-
[12]
Egor Cherepanov, Alexey K Kovalev, and Aleksandr I Panov. Elmur: External layer memory with up- date/rewrite for long-horizon rl.arXiv preprint arXiv:2510.07151,
-
[13]
The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,
De Chezelles, Thibault Le Sellier, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F Xu, Siva Reddy, Quentin Cappart, et al. The browsergym ecosystem for web agent research.arXiv preprint arXiv:2412.05467,
-
[14]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with ad- vanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,
work page internal anchor Pith review arXiv
-
[15]
Chaoqun Cui, Jing Huang, Shijing Wang, Liming Zheng, Qingchao Kong, and Zhixiong Zeng. Agentic reward modeling: Verifying gui agent via online proactive interaction.arXiv preprint arXiv:2602.00575,
-
[16]
Process Reinforcement through Implicit Rewards
Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Yuchen Zhang, Jiacheng Chen, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, et al. Process reinforcement through implicit rewards.arXiv preprint arXiv:2502.01456,
work page internal anchor Pith review arXiv
-
[17]
G. Dai, S. Jiang, T. Cao, Y. Yang, Y. Li, R. Tan, M. Li, and L. Qiu. Prore: A proactive reward system for gui agents via reasoner–actor collaboration.arXiv preprint arXiv:2509.21823,
work page internal anchor Pith review Pith/arXiv arXiv
-
[18]
Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar
[Accessed 09-02-2026]. Biplab Deka, Zifeng Huang, Chad Franzen, Joshua Hibschman, Daniel Afergan, Yang Li, Jeffrey Nichols, and Ranjitha Kumar. Rico: A mobile app dataset for building data-driven design applications. InProceedings of the 30th annual ACM symposium on user interface software and technology, pp. 845–854,
2026
-
[19]
Mingkai Deng, Jinyu Hou, Zhiting Hu, and Eric Xing. Simura: A world-model-driven simulative reasoning architecture for general goal-oriented agents.arXiv preprint arXiv:2507.23773,
-
[20]
DynaWeb: Model-Based Reinforcement Learning of Web Agents
Hang Ding, Peidong Liu, Junqiao Wang, Ziwei Ji, Meng Cao, Rongzhao Zhang, Lynn Ai, Eric Yang, Tianyu Shi, and Lei Yu. Dynaweb: Model-based reinforcement learning of web agents.arXiv preprint arXiv:2601.22149,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545,
36 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, et al. Agentic entropy-balanced policy optimization.arXiv preprint arXiv:2510.14545,
-
[22]
Memr3: Memory retrieval via reflective reasoning for llm agents.arXiv preprint arXiv:2512.20237,
Xingbo Du, Loka Li, Duzhen Zhang, and Le Song. Memr3: Memory retrieval via reflective reasoning for llm agents.arXiv preprint arXiv:2512.20237,
-
[23]
arXiv preprint arXiv:2503.09572 (2025)
Lutfi Eren Erdogan, Nicholas Lee, Sehoon Kim, Suhong Moon, Hiroki Furuta, Gopala Anumanchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-act: Improving planning of agents for long-horizon tasks.arXiv preprint arXiv:2503.09572,
-
[24]
Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, and Gang Wu. Gui-bee: Align gui action grounding to novel environments via autonomous exploration.arXiv preprint arXiv:2501.13896,
-
[25]
Group-in-Group Policy Optimization for LLM Agent Training
Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group-in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978,
work page internal anchor Pith review arXiv
-
[26]
Tianyu Fu, Anyang Su, Chenxu Zhao, Hanning Wang, Minghui Wu, Zhe Yu, Fei Hu, Mingjia Shi, Wei Dong, Jiayao Wang, et al. Mano technical report.arXiv preprint arXiv:2509.17336, 2025a. Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning sys...
-
[27]
Yifei Gao, Junhong Ye, Jiaqi Wang, and Jitao Sang. Websynthesis: World-model-guided mcts for efficient webui-trajectory synthesis.arXiv preprint arXiv:2507.04370,
-
[28]
Dylan Goetting, Himanshu Gaurav Singh, and Antonio Loquercio. End-to-end navigation with vision lan- guage models: Transforming spatial reasoning into question-answering.arXiv preprint arXiv:2411.05755,
-
[29]
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243,
-
[30]
arXiv preprint arXiv:2411.06559 , year=
Yu Gu, Kai Zhang, Yuting Ning, Boyuan Zheng, Boyu Gou, Tianci Xue, Cheng Chang, Sanjari Srivastava, Yanan Xie, Peng Qi, et al. Is your llm secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559,
-
[31]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.Nature, 2025a. Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5...
work page internal anchor Pith review arXiv 2023
-
[32]
Shuo He, Lang Feng, Qi Wei, Xin Cheng, Lei Feng, and Bo An. Hierarchy-of-groups policy optimization for long-horizon agentic tasks.arXiv preprint arXiv:2602.22817,
-
[33]
A data-driven approach for learning to control computers.arXiv preprint arXiv:2202.08137,
Peter C Humphreys, David Raposo, Toby Pohlen, Gregory Thornton, Rachita Chhaparia, Alistair Muldal, Josh Abramson, Petko Georgiev, Alex Goldin, Adam Santoro, and Timothy Lillicrap. A data-driven approach for learning to control computers.arXiv preprint arXiv:2202.08137,
-
[34]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720,
work page internal anchor Pith review Pith/arXiv arXiv
-
[35]
Hongrui Jia, Jitong Liao, Xi Zhang, Haiyang Xu, Tianbao Xie, Chaoya Jiang, Ming Yan, Si Liu, Wei Ye, and Fei Huang. Osworld-mcp: Benchmarking mcp tool invocation in computer-use agents.arXiv preprint arXiv:2510.24563,
-
[36]
arXiv preprint arXiv:2505.05262 , year=
Andreas Kontogiannis, Konstantinos Papathanasiou, Yi Shen, Giorgos Stamou, Michael M Zavlanos, and George Vouros. Enhancing cooperative multi-agent reinforcement learning with state modelling and ad- versarial exploration.arXiv preprint arXiv:2505.05262,
-
[37]
Os-harm: A benchmark for measuring safety of computer use agents
Thomas Kuntz, Agatha Duzan, Hao Zhao, Francesco Croce, Zico Kolter, Nicolas Flammarion, and Maksym Andriushchenko. Os-harm: A benchmark for measuring safety of computer use agents.arXiv preprint arXiv:2506.14866,
-
[38]
Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040,
-
[39]
Nested browser-use learning for agentic information seeking
Baixuan Li, Jialong Wu, Wenbiao Yin, Kuan Li, Zhongwang Zhang, Huifeng Yin, Zhengwei Tao, Liwen Zhang, Pengjun Xie, Jingren Zhou, et al. Nested browser-use learning for agentic information seeking. arXiv preprint arXiv:2512.23647, 2025a. Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pr...
-
[40]
Zhangheng Li, Keen You, Haotian Zhang, Di Feng, Harsh Agrawal, Xiujun Li, Mohana Prasad Sathya Moor- thy, Jeff Nichols, Yinfei Yang, and Zhe Gan. Ferret-ui 2: Mastering universal user interface understanding across platforms. InInternational Conference on Learning Representations (ICLR), 2024b. Shuquan Lian, Yuhang Wu, Jia Ma, Yifan Ding, Zihan Song, Bing...
work page internal anchor Pith review Pith/arXiv arXiv
-
[41]
Showui: One vision-language-action model for gui visual agent
39 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Stan Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent. In Proceedings of the Computer Vision and Pattern Recognition Conference, pp. 19498–19508, ...
-
[42]
arXiv preprint arXiv:2508.05731. Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu. Infiguiagent: A multimodal generalist gui agent with native reasoning and reflection.arXiv preprint arXiv:2501.04575, 2025b. Yuhang Liu, Zeyu Liu, Shuanghe Zhu, Pengxiang Li, Congkai Xie, Jiasheng ...
-
[43]
Atsuyuki Miyai, Zaiying Zhao, Kazuki Egashira, Atsuki Sato, Tatsumi Sunada, Shota Onohara, Hiromasa Yamanishi, Mashiro Toyooka, Kunato Nishina, Ryoma Maeda, et al. Webchorearena: Evaluating web browsing agents on realistic tedious web tasks.arXiv preprint arXiv:2506.01952,
-
[44]
Playing Atari with Deep Reinforcement Learning
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. Playing atari with deep reinforcement learning.arXiv preprint arXiv:1312.5602,
work page internal anchor Pith review arXiv
-
[45]
WebGPT: Browser-assisted question-answering with human feedback
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, et al. Webgpt: Browser-assisted question-answering with human feedback.arXiv preprint arXiv:2112.09332,
work page internal anchor Pith review arXiv
-
[46]
Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui-vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661,
-
[47]
Gui agents: A survey
Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 22522–22538,
2025
-
[48]
[Accessed 09-02-2026]. OpenAI. Computer-using agent — openai.com.https://openai.com/index/computer-using-agent/, 2025a. [Accessed 09-02-2026]. OpenAI. Computer-using agent, 2025b. URLhttps://openai.com/index/computer-using-agent/. Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina ...
2026
-
[49]
Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents
Vardaan Pahuja, Yadong Lu, Corby Rosset, Boyu Gou, Arindam Mitra, Spencer Whitehead, Yu Su, and Ahmed Hassan. Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. InFindings of the Association for Computational Linguistics: ACL 2025, pp. 6300–6323,
2025
-
[50]
Yichen Pan, Dehan Kong, Sida Zhou, Cheng Cui, Yifei Leng, Bing Jiang, Hangyu Liu, Yanyi Shang, Shuyan Zhou, Tongshuang Wu, et al. Webcanvas: Benchmarking web agents in online environments.arXiv preprint arXiv:2406.12373,
-
[51]
Cong Pang, Xuyu Feng, Yujie Yi, Zixuan Chen, Jiawei Hong, Tiankuo Yao, Nang Yuan, Jiapeng Luo, Lewei Lu, and Xin Lou. Ica: Information-aware credit assignment for visually grounded long-horizon information-seeking agents.arXiv preprint arXiv:2602.10863,
-
[52]
Mapping natural language commandstowebelements
41 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Panupong Pasupat, Tian-Shun Jiang, Evan Liu, Kelvin Guu, and Percy Liang. Mapping natural language commandstowebelements. InProceedings of the 2018 conference on empirical methods in natural language processing, pp. 4970–4976,
2018
-
[53]
Jiangweizhi Peng, Yuanxin Liu, Ruida Zhou, Charles Fleming, Zhaoran Wang, Alfredo Garcia, and Mingyi Hong. Hiper: Hierarchical reinforcement learning with explicit credit assignment for large language model agents.arXiv preprint arXiv:2602.16165,
work page internal anchor Pith review arXiv
-
[54]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning.arXiv preprint arXiv:1910.00177,
work page internal anchor Pith review arXiv 1910
-
[55]
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents.arXiv preprint arXiv:2408.07199,
-
[56]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326,
work page internal anchor Pith review arXiv
-
[57]
Scaling synthetic task generation for agents via exploration.arXiv preprint arXiv:2509.25047,
Ram Ramrakhya, Andrew Szot, Omar Attia, Yuhao Yang, Anh Nguyen, Bogdan Mazoure, Zhe Gan, Harsh Agrawal, and Alexander Toshev. Scaling synthetic task generation for agents via exploration.arXiv preprint arXiv:2509.25047,
-
[58]
A Comprehensive Survey of Agents for Computer Use: Foundations, Challenges, and Future Directions
Pascal J Sager, Benjamin Meyer, Peng Yan, Rebekka von Wartburg-Kottler, Layan Etaiwi, Aref Enayati, Gabriel Nobel, Ahmed Abdulkadir, Benjamin F Grewe, and Thilo Stadelmann. A comprehensive survey of agents for computer use: Foundations, challenges, and future directions.arXiv preprint arXiv:2501.16150,
work page internal anchor Pith review Pith/arXiv arXiv
-
[59]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347,
work page internal anchor Pith review arXiv
-
[60]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
work page internal anchor Pith review arXiv
-
[61]
Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362,
Huawen Shen, Chang Liu, Gengluo Li, Xinlong Wang, Yu Zhou, Can Ma, and Xiangyang Ji. Falcon-ui: Understanding gui before following user instructions.arXiv preprint arXiv:2412.09362,
-
[62]
Experiential reinforcement learning.arXiv preprint arXiv:2602.13949, 2026
Taiwei Shi, Sihao Chen, Bowen Jiang, Linxin Song, Longqi Yang, and Jieyu Zhao. Experiential reinforcement learning.arXiv preprint arXiv:2602.13949,
-
[63]
Yucheng Shi, Wenhao Yu, Zaitang Li, Yonglin Wang, Hongming Zhang, Ninghao Liu, Haitao Mi, and Dong Yu. Mobilegui-rl: Advancing mobile gui agent through reinforcement learning in online environment.arXiv preprint arXiv:2507.05720,
-
[64]
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053,
work page internal anchor Pith review arXiv 1909
-
[65]
C. H. Song, Y. Song, P. Goyal, Y. Su, O. Riva, H. Palangi, and T. Pfister. Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:2510.04673, 2025a. Chan Hee Song, Yiwen Song, Palash Goyal, Yu Su, Oriana Riva, Hamid Palangi, and Tomas Pfister. Watch and learn: Learning to use computers from online videos.arXiv preprint arXiv:25...
-
[66]
Libo Sun, Jiwen Zhang, Siyuan Wang, and Zhongyu Wei. Magnet: Towards adaptive gui agents with memory-driven knowledge evolution.arXiv preprint arXiv:2601.19199,
-
[67]
F. Tang, Z. Gu, Z. Lu, X. Liu, S. Shen, C. Meng, W. Wang, W. Zhang, Y. Shen, W. Lu, J. Xiao, and Y. Zhuang. Gui-g 2: Gaussian reward modeling for gui grounding.arXiv preprint arXiv:2507.15846, 2025a. 43 GUI Agents with Reinforcement Learning: Toward Digital Inhabitants Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, ...
-
[68]
LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization
Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, et al. Lpo: Towards accurate gui agent interaction via location preference optimization. arXiv preprint arXiv:2506.09373, 2025d. Liujian Tang, Shaokang Dong, Yijia Huang, Minqi Xiang, Hongtao Ruan, Bin Wang, Shuo Li, Zhiheng Xi, Zhihui Cao, Ha...
work page internal anchor Pith review Pith/arXiv arXiv
-
[69]
AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management
Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, and Yuanchun Li. Agentprog: Empowering long-horizon gui agents with program-guided context management. arXiv preprint arXiv:2512.10371,
work page internal anchor Pith review arXiv
-
[70]
Sagar Gubbi Venkatesh, Partha Talukdar, and Srini Narayanan
Daniel Toyama, Philippe Hamel, Anita Gergely, Gheorghe Comanici, Amelia Glaese, Zafarali Ahmed, Tyler Jackson, Shibl Mourad, and Doina Precup. Androidenv: A reinforcement learning platform for android. arXiv preprint arXiv:2105.13231,
-
[71]
UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning
Haoming Wang, Haoyang Zou, Huatong Song, Jiazhan Feng, Junjie Fang, Junting Lu, Longxiang Liu, Qinyu Luo, Shihao Liang, Shijue Huang, et al. Ui-tars-2 technical report: Advancing gui agent with multi-turn reinforcement learning.arXiv preprint arXiv:2509.02544, 2025a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing ...
work page internal anchor Pith review arXiv
-
[72]
Vivienne Huiling Wang, Tinghuai Wang, Wenyan Yang, Joni-Kristian Kämäräinen, and Joni Pajari- nen. Probabilistic subgoal representations for hierarchical reinforcement learning.arXiv preprint arXiv:2406.16707, 2024e. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internv...
-
[73]
Xinming Wei, Jiahao Zhang, Haoran Li, Jiayu Chen, Rui Qu, Maoliang Li, Xiang Chen, and Guojie Luo. Agent. xpu: Efficient scheduling of agentic llm workloads on heterogeneous soc.arXiv preprint arXiv:2506.24045, 2025a. Z. Wei, W. Yao, Y. Liu, W. Zhang, Q. Lu, L. Qiu, C. Yu, P. Xu, C. Zhang, B. Yin, H. Yun, and L. Li. Webagent-r1: Training web agents via en...
-
[74]
Uisim: An interactive image-based ui simulator for dynamic mobile environments
Jiannan Xiang, Yun Zhu, Lei Shu, Maria Wang, Lijun Yu, Gabriel Barcik, James Lyon, Srinivas Sunkara, and Jindong Chen. Uisim: An interactive image-based ui simulator for dynamic mobile environments. arXiv preprint arXiv:2509.21733,
-
[75]
Webworld: A large-scale world model for web agent training.arXiv preprint arXiv:2602.14721,
Zikai Xiao, Jianhong Tu, Chuhang Zou, Yuxin Zuo, Zhi Li, Peng Wang, Bowen Yu, Fei Huang, Junyang Lin, and Zuozhu Liu. Webworld: A large-scale world model for web agent training.arXiv preprint arXiv:2602.14721,
-
[76]
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Frank F Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, et al. Theagentcompany: benchmarking llm agents on consequential real world tasks.arXiv preprint arXiv:2412.14161, 2024a. Si Xu, Zixiao Huang, Yan Zeng, Shengen Yan, Xuefei Ning, Quanlu Zhang, Haolin Ye, Sipei Gu, Chunsheng Shui,...
work page internal anchor Pith review arXiv
-
[77]
arXiv preprint arXiv:2507.05791 (2025)
Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791, 2025b. Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. InFindings of the Associa...
-
[78]
Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144, 2025
Jiabo Ye, Xi Zhang, Haiyang Xu, Haowei Liu, Junyang Wang, Zhaoqing Zhu, Ziwei Zheng, Feiyu Gao, Junjie Cao, Zhengxi Lu, et al. Mobile-agent-v3: Fundamental agents for gui automation.arXiv preprint arXiv:2508.15144,
-
[79]
Ovm, outcome-supervised value models for planning in math- ematical reasoning
Fei Yu, Anningzhe Gao, and Benyou Wang. Ovm, outcome-supervised value models for planning in math- ematical reasoning. InFindings of the Association for Computational Linguistics: NAACL 2024, pp. 858–875,
2024
-
[80]
MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning
Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, and Xianpei Han. Memsearcher: Training llms to reason, search and manage memory via end-to-end reinforcement learning.arXiv preprint arXiv:2511.02805,
work page internal anchor Pith review arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.