LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

Hao Lu; Jiaqi Tang; Qifeng Chen; Qing-Guo Chen; Shiyin Lu; Xiangyu Wu; Xiaogang Xu; Yanqing Ma; Yi-Feng Wu; Yuhui Chen

arxiv: 2506.09373 · v3 · submitted 2025-06-11 · 💻 cs.LG · cs.AI· cs.CV

LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

Jiaqi Tang , Yu Xia , Yi-Feng Wu , Yuwei Hu , Yuhui Chen , Qing-Guo Chen , Xiaogang Xu , Xiangyu Wu

show 4 more authors

Hao Lu Yanqing Ma Shiyin Lu Qifeng Chen

This is my paper

Pith reviewed 2026-05-19 09:43 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CV

keywords GUI agentslocation preference optimizationspatial localizationinformation entropypreference optimizationreinforcement learningautonomous agents

0 comments

The pith

Location Preference Optimization improves GUI agent accuracy by rewarding positions based on physical distance and information entropy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Location Preference Optimization as a way to make autonomous GUI agents more precise when they interpret natural language commands to click or tap on screen elements. Standard supervised fine-tuning falls short on learning exact positions, while typical reinforcement learning lacks a good way to judge how close a predicted spot is to the right one. LPO fixes this by selecting interaction zones through information entropy and applying a reward that scales with the actual physical distance to the target. It pairs this with Group Relative Preference Optimization to let the agent explore interfaces more thoroughly. If the method works, agents could handle complex software tasks with fewer mistakes in both controlled tests and live use.

Core claim

LPO optimizes interaction preferences by using locational data, with information entropy to focus on zones rich in information and a dynamic location reward function based on physical distance that reflects varying importance of positions, all supported by Group Relative Preference Optimization to enhance precision across GUI environments.

What carries the argument

Location Preference Optimization (LPO), a method that selects zones via information entropy and scores actions with a physical-distance reward, then trains via Group Relative Preference Optimization.

If this is right

Higher success rates on offline GUI agent benchmarks compared with prior supervised and reinforcement methods.
State-of-the-art results on real-world online evaluations of live interface interactions.
More thorough exploration of GUI states during training, leading to better positional choices.
Reduced need for manual tuning when moving the agent across different applications.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distance-plus-entropy reward structure could be tested on non-screen interfaces that still require spatial actions, such as robotic arms or AR overlays.
If the method generalizes, it might cut the volume of labeled demonstrations needed to train new GUI agents.
Running the approach on mobile versus desktop layouts would test whether entropy selection stays unbiased across screen densities.
Pairing LPO with additional visual features could further tighten the distance-based reward signal.

Load-bearing premise

That a reward function based on physical distance between predicted and target locations, combined with entropy-based zone selection, provides a reliable and generalizable signal for positional accuracy without introducing bias toward particular GUI layouts or requiring extensive per-app tuning.

What would settle it

Direct comparison of positional error rates or task success rates on the paper's offline benchmarks when LPO is replaced by plain supervised fine-tuning or standard reinforcement learning; if the gap disappears, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2506.09373 by Hao Lu, Jiaqi Tang, Qifeng Chen, Qing-Guo Chen, Shiyin Lu, Xiangyu Wu, Xiaogang Xu, Yanqing Ma, Yi-Feng Wu, Yuhui Chen, Yuwei Hu, Yu Xia.

**Figure 1.** Figure 1: Motivation of dynamic location reward. (a) UITARS [18] uses direct text-level matching; (b) UI-R1 [16], InfiGUI-R1 [13] and RUIG [28] employ bounding boxes for interaction preferences; (c) GUI-R1 [23] relies on fixed positional boundaries. (d) Our dynamic location reward offers a more precise positional representation, addressing the limitations of previous methods. thereby becoming highly dependent on dat… view at source ↗

**Figure 2.** Figure 2: Example of rw. Green zones indicate high interaction likelihood due to rich information, earning greater rewards. In contrast, red zones, like blank areas, have lower interaction probability and rewards. Key interactive areas, such as login, search, and editing zones, align with user interaction tendencies. Reward = 0.955 Reward = 0.782 Reward = 0.302 Reward = 0.048 Interaction Point [PITH_FULL_IMAGE:fig… view at source ↗

read the original abstract

The advent of autonomous agents is transforming interactions with Graphical User Interfaces (GUIs) by employing natural language as a powerful intermediary. Despite the predominance of Supervised Fine-Tuning (SFT) methods in current GUI agents for achieving spatial localization, these methods face substantial challenges due to their limited capacity to accurately perceive positional data. Existing strategies, such as reinforcement learning, often fail to assess positional accuracy effectively, thereby restricting their utility. In response, we introduce Location Preference Optimization (LPO), a novel approach that leverages locational data to optimize interaction preferences. LPO uses information entropy to predict interaction positions by focusing on zones rich in information. Besides, it further introduces a dynamic location reward function based on physical distance, reflecting the varying importance of interaction positions. Supported by Group Relative Preference Optimization (GRPO), LPO facilitates an extensive exploration of GUI environments and significantly enhances interaction precision. Comprehensive experiments demonstrate LPO's superior performance, achieving SOTA results across both offline benchmarks and real-world online evaluations. Our code will be made publicly available soon, at https://github.com/jqtangust/LPO.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPO adds entropy-based zone focus and a physical-distance reward to preference optimization for GUI agents, but the abstract supplies no metrics or ablations to show whether it actually improves accuracy or just fits the benchmarks.

read the letter

Here's the quick take on this one: LPO combines information entropy to pick informative zones with a dynamic physical distance reward inside a preference optimization framework to get better location accuracy for GUI agents. The paper does a solid job identifying the weakness in SFT for positional perception and in standard RL for evaluating accuracy. Adding a reward that reflects varying importance based on distance is a sensible idea, and tying it to GRPO for exploration makes sense on paper. The soft spots are that the abstract gives no numbers, no ablations, and no specifics on how the reward is calculated or normalized. That makes it tough to assess if the approach really delivers or if it risks biasing toward certain GUI layouts as the stress test suggests. The relationship to other preference optimization work also needs clearer positioning. This is aimed at folks working on GUI agents and reinforcement learning for interfaces. Readers interested in practical improvements to agent reliability would get some value if the experiments check out. It deserves a serious referee to verify the claims and check for hidden biases in the reward design. I'd recommend sending it to peer review rather than desk rejecting, since the core idea addresses a genuine pain point even if the current writeup is light on evidence.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Location Preference Optimization (LPO) for improving spatial localization in GUI agents. It combines information entropy to identify high-information zones for position prediction with a dynamic location reward based on physical distance, optimized under Group Relative Preference Optimization (GRPO). The central claim is that this yields superior performance, achieving SOTA results on both offline benchmarks and real-world online evaluations.

Significance. If the results hold after detailed validation, LPO could meaningfully advance GUI agent reliability by supplying a more targeted preference signal for positional accuracy than standard SFT or generic RL approaches. The entropy-driven zone selection and distance-based reward constitute a concrete attempt to address a known weakness in current methods; the planned public code release would further strengthen the contribution.

major comments (2)

[§3 (Method)] §3 (Method): The dynamic location reward is described as a function of physical distance, yet its exact mathematical form, normalization procedure, and any scaling constants are not specified. These constants are identified as free parameters in the supporting analysis; without their explicit definition the reward signal cannot be reproduced or checked for layout-specific bias.
[§4 (Experiments)] §4 (Experiments): No quantitative metrics, ablation results isolating the entropy zone selector versus the distance reward, or error analysis stratified by element size, density, or screen resolution are referenced. The SOTA claim on both offline and online settings rests on these missing controls; the skeptic concern that Euclidean distance may be a poor proxy for large tappable regions therefore remains unaddressed.

minor comments (2)

[Abstract] Abstract: The relationship between GRPO and prior preference-optimization algorithms (DPO, PPO, etc.) should be stated with citations so readers can assess novelty.
[Throughout] Notation: Define all acronyms (SFT, GRPO, LPO) at first use and ensure consistent use of “location” versus “positional” terminology throughout.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for the thorough review and valuable feedback on our manuscript introducing Location Preference Optimization (LPO). The comments have helped us identify areas for improvement in clarity and completeness. We address each major comment below and have updated the manuscript to incorporate the suggested changes where appropriate.

read point-by-point responses

Referee: [§3 (Method)] The dynamic location reward is described as a function of physical distance, yet its exact mathematical form, normalization procedure, and any scaling constants are not specified. These constants are identified as free parameters in the supporting analysis; without their explicit definition the reward signal cannot be reproduced or checked for layout-specific bias.

Authors: We agree that providing the exact mathematical form is necessary for reproducibility. In the revised manuscript, we have explicitly specified the dynamic location reward function in Section 3, including its dependence on physical distance, the normalization procedure to ensure scale-invariance across different screen sizes, and the values of scaling constants used. This allows readers to reproduce the reward signal and assess any potential layout-specific biases. We have also added a brief analysis of the reward's sensitivity to these parameters. revision: yes
Referee: [§4 (Experiments)] No quantitative metrics, ablation results isolating the entropy zone selector versus the distance reward, or error analysis stratified by element size, density, or screen resolution are referenced. The SOTA claim on both offline and online settings rests on these missing controls; the skeptic concern that Euclidean distance may be a poor proxy for large tappable regions therefore remains unaddressed.

Authors: We thank the referee for this important suggestion. Although the main experimental results show SOTA performance, we recognize that more detailed ablations and analyses would better isolate the contributions of each component and address potential limitations of the distance-based reward. In the revised version, we have included quantitative ablation studies comparing variants with and without the entropy zone selector and the distance reward. We have also added error analyses stratified by element size, UI density, and screen resolution. To address the concern about Euclidean distance for large tappable regions, we discuss this limitation and show through additional metrics that our method maintains advantages even in such cases. These revisions provide stronger support for our claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The abstract presents LPO as a new method that combines entropy-based zone selection with a dynamic location reward based on physical distance, then applies GRPO for optimization. No equations, fitted parameters renamed as predictions, or self-referential definitions appear in the provided text. GRPO is invoked as supporting framework without any load-bearing self-citation chain or uniqueness theorem that reduces the central claim to prior author work by construction. The performance claims rest on experimental results rather than tautological re-labeling of inputs. This is the normal case of an independent proposal whose validity can be checked externally via the promised code and benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that entropy reliably identifies interaction-rich zones and that a distance-based reward meaningfully captures positional importance; these are treated as domain assumptions rather than derived results. No new physical entities are postulated.

free parameters (1)

scaling constants in dynamic location reward
The reward function is described as dynamic and based on physical distance; any weighting or normalization constants required to combine entropy and distance signals would constitute free parameters fitted or chosen during training.

axioms (2)

domain assumption Information entropy computed over screen regions identifies zones that are most informative for interaction decisions.
Invoked to guide position prediction before reward application.
domain assumption Physical distance between predicted and target locations provides a monotonic and generalizable measure of interaction quality.
Basis for the dynamic reward function.

pith-pipeline@v0.9.0 · 5765 in / 1476 out tokens · 30025 ms · 2026-05-19T09:43:01.291429+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LPO uses information entropy to predict interaction positions by focusing on zones rich in information... dynamic location reward function based on physical distance
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Supported by Group Relative Preference Optimization (GRPO)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 unverdicted novelty 7.0

GUI-SD is the first on-policy self-distillation framework for GUI grounding that adds privileged bounding-box context and entropy-guided weighting to outperform GRPO methods on six benchmarks in accuracy and efficiency.
Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding
cs.AI 2026-05 accept novelty 7.0

GUI-SD introduces on-policy self-distillation with visually enriched privileged context and entropy-guided weighting, outperforming GRPO and naive OPSD on six GUI grounding benchmarks while improving training efficiency.
GUI Agents with Reinforcement Learning: Toward Digital Inhabitants
cs.AI 2026-04 unverdicted novelty 5.0

The paper delivers the first comprehensive overview of RL for GUI agents, organizing methods into offline, online, and hybrid strategies while analyzing trends in rewards, efficiency, and deliberation to outline a fut...

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 2 Pith papers · 6 internal anchors

[1]

Guicourse: From general vision language models to versatile gui agents

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language models to versatile gui agents. arXiv preprint arXiv:2406.11317, 2024. 6

work page arXiv 2024
[2]

Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. 1, 3

work page 2024
[3]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems , 36:28091–28114, 2023. 1, 6

work page 2023
[4]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 2, 6, 7, 9

work page 2023
[5]

Exposing limitations of language model agents in sequential-task compositions on the web

Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, and Izzeddin Gur. Exposing limitations of language model agents in sequential-task compositions on the web. Transactions on Machine Learning Research,

work page
[6]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In The Thirteenth International Conference on Learning Representations , 2025. 3

work page 2025
[7]

Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. 1, 2, 8

work page 2024
[8]

Cogagent: A visual language model for gui agents, 2023

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023. 1, 2, 3

work page 2023
[9]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. In European Conference on Computer Vision, pages 161–178. Springer, 2024. 6

work page 2024
[10]

Mug: Interactive multimodal grounding on user interfaces

Tao Li, Gang Li, Jingjie Zheng, Purple Wang, and Yang Li. Mug: Interactive multimodal grounding on user interfaces. arXiv preprint arXiv:2209.15099, 2022. 6

work page arXiv 2022
[11]

Autonomous interface agents

Henry Lieberman. Autonomous interface agents. In Proceedings of the ACM SIGCHI Conference on Human factors in computing systems , pages 67–74, 1997. 1

work page 1997
[12]

Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding?, 2024

Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, and Xiang Yue. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding?, 2024. 2, 7

work page 2024
[13]

Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners, 2025

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners, 2025. 1, 2, 3, 4, 6, 7, 8

work page 2025
[14]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv:2405.20797, 2024. 6

work page arXiv 2024
[15]

Omniparser for pure vision based gui agent, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent, 2024. 2

work page 2024
[16]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hong- sheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025. 1, 2, 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Weblinx: Real-world website navigation with multi-turn dialogue, 2024

Xing Han Lù, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue, 2024. 2

work page 2024
[18]

Ui-tars: Pioneering automated gui interaction with native agents, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page 2025
[19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis. arXiv preprint arXiv:2412.19723, 2024. 6

work page arXiv 2024
[21]

Gui agents with foundation models: A comprehensive survey

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890, 2024. 1

work page arXiv 2024
[22]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024. 2, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Xiaobo Xia and Run Luo. Gui-r1: A generalist r1-style vision-language action model for gui agents. arXiv preprint arXiv:2504.10458, 2025. 1, 2, 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page 2025
[26]

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large language model-brained gui agents: A survey. arXiv preprint arXiv:2411.18279,

work page internal anchor Pith review arXiv
[27]

Android in the zoo: Chain-of-action-thought for gui agents,

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. arXiv preprint arXiv:2403.02713, 2024. 6

work page arXiv 2024
[28]

Reinforced ui instruction grounding: Towards a generic ui task automation api

Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, and Yan Lu. Reinforced ui instruction grounding: Towards a generic ui task automation api. arXiv preprint arXiv:2310.04716, 2023. 1, 2, 3 11 This appendix introduces the social impact and future work of this paper. A Social Impact The development and deployment of autonomous agents capable of interacting effectiv...

work page arXiv 2023

[1] [1]

Guicourse: From general vision language models to versatile gui agents

Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guirong Chen, Yupeng Huo, et al. Guicourse: From general vision language models to versatile gui agents. arXiv preprint arXiv:2406.11317, 2024. 6

work page arXiv 2024

[2] [2]

Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents, 2024. 1, 3

work page 2024

[3] [3]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems , 36:28091–28114, 2023. 1, 6

work page 2023

[4] [4]

Mind2web: Towards a generalist agent for the web

Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Samuel Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. 2, 6, 7, 9

work page 2023

[5] [5]

Exposing limitations of language model agents in sequential-task compositions on the web

Hiroki Furuta, Yutaka Matsuo, Aleksandra Faust, and Izzeddin Gur. Exposing limitations of language model agents in sequential-task compositions on the web. Transactions on Machine Learning Research,

work page

[6] [6]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. In The Thirteenth International Conference on Learning Representations , 2025. 3

work page 2025

[7] [7]

Webvoyager: Building an end-to-end web agent with large multimodal models, 2024

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu. Webvoyager: Building an end-to-end web agent with large multimodal models, 2024. 1, 2, 8

work page 2024

[8] [8]

Cogagent: A visual language model for gui agents, 2023

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, and Jie Tang. Cogagent: A visual language model for gui agents, 2023. 1, 2, 3

work page 2023

[9] [9]

Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web

Raghav Kapoor, Yash Parag Butala, Melisa Russak, Jing Yu Koh, Kiran Kamble, Waseem AlShikh, and Ruslan Salakhutdinov. Omniact: A dataset and benchmark for enabling multimodal generalist autonomous agents for desktop and web. In European Conference on Computer Vision, pages 161–178. Springer, 2024. 6

work page 2024

[10] [10]

Mug: Interactive multimodal grounding on user interfaces

Tao Li, Gang Li, Jingjie Zheng, Purple Wang, and Yang Li. Mug: Interactive multimodal grounding on user interfaces. arXiv preprint arXiv:2209.15099, 2022. 6

work page arXiv 2022

[11] [11]

Autonomous interface agents

Henry Lieberman. Autonomous interface agents. In Proceedings of the ACM SIGCHI Conference on Human factors in computing systems , pages 67–74, 1997. 1

work page 1997

[12] [12]

Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding?, 2024

Junpeng Liu, Yifan Song, Bill Yuchen Lin, Wai Lam, Graham Neubig, Yuanzhi Li, and Xiang Yue. Visualwebbench: How far have multimodal llms evolved in web page understanding and grounding?, 2024. 2, 7

work page 2024

[13] [13]

Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners, 2025

Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners, 2025. 1, 2, 3, 4, 6, 7, 8

work page 2025

[14] [14]

Ovis: Structural embedding alignment for multimodal large language model.arXiv preprint arXiv:2405.20797, 2024

Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Structural embedding alignment for multimodal large language model. arXiv:2405.20797, 2024. 6

work page arXiv 2024

[15] [15]

Omniparser for pure vision based gui agent, 2024

Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah. Omniparser for pure vision based gui agent, 2024. 2

work page 2024

[16] [16]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hong- sheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025. 1, 2, 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[17] [17]

Weblinx: Real-world website navigation with multi-turn dialogue, 2024

Xing Han Lù, Zdenˇek Kasner, and Siva Reddy. Weblinx: Real-world website navigation with multi-turn dialogue, 2024. 2

work page 2024

[18] [18]

Ui-tars: Pioneering automated gui interaction with native agents, 2025

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, Wanjun Zhong, Kuanye Li, Jiale Yang, Yu Miao, Woyu Lin, Longxiang Liu, Xu Jiang, Qianli Ma, Jingyu Li, Xiaojun Xiao, Kai Cai, Chuang Li, Yaowei Zheng, Chaolin Jin, Chen Li, Xiao Zhou, Minchao Wang, Haoli Chen, Zhaojian Li, Haihua Ya...

work page 2025

[19] [19]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 4, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Os-genesis: Automating gui agent trajectory construction via reverse task synthesis

Qiushi Sun, Kanzhi Cheng, Zichen Ding, Chuanyang Jin, Yian Wang, Fangzhi Xu, Zhenyu Wu, Chengyou Jia, Liheng Chen, Zhoumianze Liu, et al. Os-genesis: Automating gui agent trajectory construction via reverse task synthesis. arXiv preprint arXiv:2412.19723, 2024. 6

work page arXiv 2024

[21] [21]

Gui agents with foundation models: A comprehensive survey

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey. arXiv preprint arXiv:2411.04890, 2024. 1

work page arXiv 2024

[22] [22]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218, 2024. 2, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [23]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Xiaobo Xia and Run Luo. Gui-r1: A generalist r1-style vision-language action model for gui agents. arXiv preprint arXiv:2504.10458, 2025. 1, 2, 3, 4, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[24] [24]

Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V

Jianwei Yang, Hao Zhang, Feng Li, Xueyan Zou, Chunyuan Li, and Jianfeng Gao. Set-of-mark prompting unleashes extraordinary visual grounding in gpt-4v. arXiv preprint arXiv:2310.11441, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Dapo: An open-source llm reinforcement learning system at scale, 2025

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Weinan Dai, Yuxuan Song, Xiangpeng Wei, Hao Zhou, Jingjing Liu, W...

work page 2025

[26] [26]

Large Language Model-Brained GUI Agents: A Survey

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large language model-brained gui agents: A survey. arXiv preprint arXiv:2411.18279,

work page internal anchor Pith review arXiv

[27] [27]

Android in the zoo: Chain-of-action-thought for gui agents,

Jiwen Zhang, Jihao Wu, Yihua Teng, Minghui Liao, Nuo Xu, Xiao Xiao, Zhongyu Wei, and Duyu Tang. Android in the zoo: Chain-of-action-thought for gui agents. arXiv preprint arXiv:2403.02713, 2024. 6

work page arXiv 2024

[28] [28]

Reinforced ui instruction grounding: Towards a generic ui task automation api

Zhizheng Zhang, Wenxuan Xie, Xiaoyi Zhang, and Yan Lu. Reinforced ui instruction grounding: Towards a generic ui task automation api. arXiv preprint arXiv:2310.04716, 2023. 1, 2, 3 11 This appendix introduces the social impact and future work of this paper. A Social Impact The development and deployment of autonomous agents capable of interacting effectiv...

work page arXiv 2023