pith. machine review for the scientific record. sign in

arxiv: 2605.00642 · v3 · submitted 2026-05-01 · 💻 cs.AI · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Learn where to Click from Yourself: On-Policy Self-Distillation for GUI Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:58 UTC · model grok-4.3

classification 💻 cs.AI cs.CV
keywords GUI groundingself-distillationon-policy learningvision-language modelsreinforcement learningcoordinate predictionautonomous agents
0
0 comments X

The pith

On-policy self-distillation with visual context improves GUI grounding accuracy and efficiency over reinforcement learning methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GUI-SD, a framework that applies on-policy self-distillation to GUI grounding tasks. Instead of relying on multiple expensive rollouts as in GRPO methods, it uses a single rollout to generate dense supervision signals. The teacher model receives a target bounding box and Gaussian soft mask for guidance, and distillation weights tokens by their entropy and significance. This approach is shown to outperform baselines on six benchmarks while being more efficient. Readers interested in training vision-language models for agentic tasks would find this relevant because it reduces computational costs in learning precise coordinate predictions.

Core claim

GUI-SD demonstrates that on-policy self-distillation can be effectively adapted to GUI grounding by constructing a visually enriched privileged context for the teacher using a target bounding box and Gaussian soft mask, combined with entropy-guided weighting of tokens based on digit significance and teacher confidence, leading to consistent improvements in accuracy and training efficiency over GRPO-based methods and naive OPSD on six representative benchmarks.

What carries the argument

The GUI-SD framework's visually enriched privileged context (target bounding box plus Gaussian soft mask) and entropy-guided distillation that weights tokens by digit significance and teacher confidence.

If this is right

  • GUI-SD outperforms GRPO-based methods in accuracy across six benchmarks.
  • It requires fewer computational resources by avoiding multiple rollouts.
  • Entropy-guided distillation focuses learning on significant digits and confident predictions.
  • The method provides a dense supervision signal from a single rollout for hard samples.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Extending GUI-SD to other multimodal grounding tasks could improve efficiency in agent training.
  • The privileged context design might inspire similar techniques in other self-distillation scenarios to prevent information leakage.
  • Testing GUI-SD on larger models or different architectures would reveal its scalability.

Load-bearing premise

The visually enriched privileged context supplies useful guidance to the teacher without leaking the exact target coordinates, and entropy-guided weighting reliably concentrates learning on the most impactful and reliable tokens.

What would settle it

An ablation study removing the Gaussian soft mask and showing performance dropping to the level of naive OPSD would falsify the claim that the privileged context provides non-leaking guidance.

Figures

Figures reproduced from arXiv: 2605.00642 by Can Ma, Daiqing Wu, Huawen Shen, Yan Zhang, Yu Zhou.

Figure 1
Figure 1. Figure 1: (a) GRPO requires expensive multiple rollouts and produces zero reward on hard samples. (b) Naive OPSD forwards the policy twice and distills via reverse KL between student and teacher logits with uniform per-token weight w = 1.0, yet suffers from distillation-to-SFT collapse and indiscriminate optimization. (c) Ours addresses both issues via visual privileged guidance and entropy-guided optimization. such… view at source ↗
Figure 2
Figure 2. Figure 2: Per-token analysis of teacher and student predictions on incorrectly predicted tokens across view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the GUI-SD framework. (a) The teacher branch receives a privileged context view at source ↗
Figure 4
Figure 4. Figure 4: Training dynamics of GUI-SD, Standard Reverse KL, and GRPO-Gaussian over optimiza view at source ↗
read the original abstract

Graphical User Interface (GUI) grounding maps natural language instructions to the visual coordinates of target elements and serves as a core capability for autonomous GUI agents. Recent reinforcement learning methods (e.g., GRPO) have achieved strong performance, but they rely on expensive multiple rollouts and suffer from sparse signals on hard samples. These limitations make on-policy self-distillation (OPSD), which provides dense token-level supervision from a single rollout, a promising alternative. However, its applicability to GUI grounding remains unexplored. In this paper, we present GUI-SD, the first OPSD framework tailored for GUI grounding. First, it constructs a visually enriched privileged context for the teacher using a target bounding box and a Gaussian soft mask, providing informative guidance without leaking exact coordinates. Second, it employs entropy-guided distillation, which adaptively weights tokens based on digit significance and teacher confidence, concentrating optimization on the most impactful and reliable positions. Extensive experiments on six representative GUI grounding benchmarks show that GUI-SD consistently outperforms GRPO-based methods and naive OPSD in both accuracy and training efficiency. Code and training data are available at https://zhangyan-ucas.github.io/GUI-SD/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces GUI-SD as the first on-policy self-distillation (OPSD) framework for GUI grounding. It constructs a visually enriched privileged context for the teacher via the target bounding box plus a Gaussian soft mask, applies entropy-guided distillation that weights tokens by digit significance and teacher confidence, and reports consistent outperformance over GRPO-based methods and naive OPSD across six GUI grounding benchmarks in both accuracy and training efficiency.

Significance. If the empirical claims hold, the work offers a computationally lighter alternative to rollout-heavy RL methods for training GUI agents, addressing sparse reward issues on hard samples while maintaining dense token-level supervision. The public release of code and training data is a clear strength that supports reproducibility.

major comments (1)
  1. [Privileged context construction] The central methodological claim (abstract and method description) asserts that the Gaussian soft mask 'provides informative guidance without leaking exact coordinates.' Because the mask is centered on the ground-truth target, its spatial peak directly encodes approximate location information unavailable to the student or to naive OPSD. An ablation that recenters the mask at random locations (while preserving shape and variance) is required to confirm that reported gains over baselines are attributable to entropy-guided weighting and on-policy distillation rather than this privileged cue.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by including at least one or two key quantitative deltas (e.g., average accuracy lift and training-time reduction) rather than only the qualitative statement of 'consistent outperformance.'

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and detailed review of our manuscript. We address the single major comment below.

read point-by-point responses
  1. Referee: [Privileged context construction] The central methodological claim (abstract and method description) asserts that the Gaussian soft mask 'provides informative guidance without leaking exact coordinates.' Because the mask is centered on the ground-truth target, its spatial peak directly encodes approximate location information unavailable to the student or to naive OPSD. An ablation that recenters the mask at random locations (while preserving shape and variance) is required to confirm that reported gains over baselines are attributable to entropy-guided weighting and on-policy distillation rather than this privileged cue.

    Authors: We appreciate the referee's observation. The Gaussian soft mask is indeed centered on the ground-truth target, which provides a soft spatial prior unavailable to the student or naive OPSD. While the mask remains probabilistic and does not encode exact pixel coordinates, it does convey approximate location information. To rigorously isolate this effect from the entropy-guided weighting and on-policy distillation, we agree that the proposed ablation (recentering the mask at random locations while preserving shape and variance) is necessary. We will conduct this experiment and include the results in the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework validated on external benchmarks

full rationale

The paper introduces GUI-SD as an on-policy self-distillation method for GUI grounding, consisting of a privileged teacher context (bounding box + Gaussian mask) and entropy-guided token weighting. These are presented as design choices, not derived predictions. Performance claims rest entirely on comparative experiments across six independent benchmarks against GRPO and naive OPSD baselines. No equations, fitted parameters renamed as predictions, or self-citations appear in the provided text. The derivation chain is absent; the work is self-contained via external empirical evaluation rather than internal reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on standard supervised-distillation assumptions plus one domain-specific assumption about non-leaking privileged context; no new physical entities are postulated and the only free parameters are typical training hyperparameters.

free parameters (1)
  • Gaussian mask spread
    The width of the soft mask around the target box is a tunable hyperparameter required to construct the privileged context.
axioms (1)
  • domain assumption Privileged visual context supplies useful guidance without coordinate leakage
    Invoked when constructing the teacher input from the target bounding box and Gaussian mask.

pith-pipeline@v0.9.0 · 5515 in / 1217 out tokens · 61802 ms · 2026-05-12T01:58:21.866201+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages · 17 internal anchors

  1. [1]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  2. [2]

    Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026

    Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, and Wu Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026

  3. [3]

    Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning.arXiv preprint arXiv:2510.20286, 2025

    Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Quyu Kong, Xu Zhang, Chen Liu, Yuqi Liu, Wenxuan Wang, et al. Ui-ins: Enhancing gui grounding with multi-perspective instruction-as-reasoning.arXiv preprint arXiv:2510.20286, 2025

  4. [4]

    Seeclick: Harnessing gui grounding for advanced visual gui agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

  5. [5]

    WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

    Sicheng Fan, Qingyun Shi, Shengze Xu, Shengbo Cai, Tieyong Zeng, Li Ling, Yanyi Shang, and Dehan Kong. Webfactory: Automated compression of foundational language intelligence into grounded web agents.arXiv preprint arXiv:2603.05044, 2026

  6. [6]

    Gui-bee: Align gui action grounding to novel environments via autonomous exploration

    Yue Fan, Handong Zhao, Ruiyi Zhang, Yu Shen, Xin Eric Wang, and Gang Wu. Gui-bee: Align gui action grounding to novel environments via autonomous exploration. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33249–33266, 2025

  7. [7]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for gui agents.arXiv preprint arXiv:2410.05243, 2024

  8. [8]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  9. [9]

    MolmoWeb: Open Visual Web Agent and Open Data for the Open Web

    Tanmay Gupta, Piper Wolters, Zixian Ma, Peter Sushko, Rock Yuren Pang, Diego Llanes, Yue Yang, Taira Anderson, Boyuan Zheng, Zhongzheng Ren, et al. Molmoweb: Open visual web agent and open data for the open web.arXiv preprint arXiv:2604.08516, 2026

  10. [10]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  11. [11]

    Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

    Kun Huang, Weikai Xu, Yuxuan Liu, Quandong Wang, Pengzhi Gao, Wei Liu, Jian Luan, Bin Wang, and Bo An. Mobileipl: Enhancing mobile agents thinking process via iterative preference learning.arXiv preprint arXiv:2505.12299, 2025

  12. [12]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026. 10

  13. [13]

    Todi: Token-wise distilla- tion via fine-grained divergence control

    Seongryong Jung, Suwan Yoon, DongGeon Kim, and Hwanhee Lee. Todi: Token-wise distilla- tion via fine-grained divergence control. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8089–8102, 2025

  14. [14]

    Guirlvg: Incentivize gui visual grounding via empirical exploration on reinforcement learning.arXiv preprint arXiv:2508.04389, 2025

    Weitai Kang, Bin Lei, Gaowen Liu, Caiwen Ding, and Yan Yan. Guirlvg: Incentivize gui visual grounding via empirical exploration on reinforcement learning.arXiv preprint arXiv:2508.04389, 2025

  15. [15]

    Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty.arXiv preprint arXiv:2602.12687, 2026

    Jeonghyun Kim, SooKyung Kim, Richeng Xuan, and Hyunsoo Cho. Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty.arXiv preprint arXiv:2602.12687, 2026

  16. [16]

    Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040,

    Hanyu Lai, Xiao Liu, Yanxiao Zhao, Han Xu, Hanchen Zhang, Bohao Jing, Yanyu Ren, Shuntian Yao, Yuxiao Dong, and Jie Tang. Computerrl: Scaling end-to-end online reinforcement learning for computer use agents.arXiv preprint arXiv:2508.14040, 2025

  17. [17]

    Screenspot-pro: Gui grounding for professional high-resolution computer use

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778– 8786, 2025

  18. [18]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  19. [19]

    arXiv preprint arXiv:2504.14239 , year=

    Yuhang Liu, Pengxiang Li, Congkai Xie, Xavier Hu, Xiaotian Han, Shengyu Zhang, Hongxia Yang, and Fei Wu. Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners.arXiv preprint arXiv:2504.14239, 2025

  20. [20]

    arXiv preprint arXiv:2509.15221 , year =

    Zhaoyang Liu, JingJing Xie, Zichen Ding, Zehao Li, Bowen Yang, Zhenyu Wu, Xuehui Wang, Qiushi Sun, Shi Liu, Weiyun Wang, et al. Scalecua: Scaling open-source computer use agents with cross-platform data.arXiv preprint arXiv:2509.15221, 2025

  21. [21]

    Zoom to essence: Trainless gui grounding by inferring upon interface elements.arXiv preprint arXiv:2603.14448, 2026

    Ziwei Liu, Tao Feng, Borui Kang, Yanbing Yang, and Jun Luo. Zoom to essence: Trainless gui grounding by inferring upon interface elements.arXiv preprint arXiv:2603.14448, 2026

  22. [22]

    Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Han Xiao, Shuai Ren, Pengxiang Zhao, Guangyi Liu, et al. Ui-r1: Enhancing efficient action prediction of gui agents by reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 17608–17616, 2026

  23. [23]

    GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

    Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

  24. [24]

    Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

    Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

  25. [25]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  26. [26]

    Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

    Yuxiao Qu, Amrith Setlur, Virginia Smith, Ruslan Salakhutdinov, and Aviral Kumar. Pope: Learning to reason on hard problems via privileged on-policy exploration.arXiv preprint arXiv:2601.18779, 2026

  27. [27]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  28. [28]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026. 11

  29. [29]

    A Survey of On-Policy Distillation for Large Language Models

    Mingyang Song and Mao Zheng. A survey of on-policy distillation for large language models. arXiv preprint arXiv:2604.00626, 2026

  30. [30]

    Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

    Yuda Song, Lili Chen, Fahim Tajwar, Remi Munos, Deepak Pathak, J Andrew Bagnell, Aarti Singh, and Andrea Zanette. Expanding the capabilities of reinforcement learning via text feedback.arXiv preprint arXiv:2602.02482, 2026

  31. [31]

    Ea-kd: Entropy-based adaptive knowledge distillation

    Chi-Ping Su, Ching-Hsun Tseng, Bin Pu, Lei Zhao, Jiewen Yang, Zhuangzhuang Chen, and Shin-Jye Lee. Ea-kd: Entropy-based adaptive knowledge distillation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 731–740, 2025

  32. [32]

    Gui-g2: Gaussian reward modeling for gui grounding

    Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g2: Gaussian reward modeling for gui grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33214–33222, 2026

  33. [33]

    LPO: Towards Accurate GUI Agent Interaction via Location Preference Optimization

    Jiaqi Tang, Yu Xia, Yi-Feng Wu, Yuwei Hu, Yuhui Chen, Qing-Guo Chen, Xiaogang Xu, Xiangyu Wu, Hao Lu, Yanqing Ma, et al. Lpo: Towards accurate gui agent interaction via location preference optimization.arXiv preprint arXiv:2506.09373, 2025

  34. [34]

    Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

    Venus Team, Changlong Gao, Zhangxuan Gu, Yulin Liu, Xinyu Qiu, Shuheng Shen, Yue Wen, Tianyu Xia, Zhenyu Xu, Zhengwen Zeng, et al. Ui-venus-1.5 technical report.arXiv preprint arXiv:2602.09082, 2026

  35. [35]

    Learning while staying curious: Entropy-preserving supervised fine- tuning via adaptive self-distillation for large reasoning models.arXiv preprint arXiv:2602.02244, 2026

    Hao Wang, Hao Gu, Hongming Piao, Kaixiong Gong, Yuxiao Ye, Xiangyu Yue, Sirui Han, Yike Guo, and Dapeng Wu. Learning while staying curious: Entropy-preserving supervised fine- tuning via adaptive self-distillation for large reasoning models.arXiv preprint arXiv:2602.02244, 2026

  36. [36]

    Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding

    Wenkai Wang, Xiyun Li, Hongcan Guo, Wenhao Yu, Tianqing Fang, Haitao Mi, Dong Yu, and Shengyu Zhang. Measure twice, click once: Co-evolving proposer and visual critic via reinforcement learning for gui grounding.arXiv preprint arXiv:2604.21268, 2026

  37. [37]

    Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

    Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

  38. [38]

    arXiv preprint arXiv:2602.11858 , year=

    Lai Wei, Liangbo He, Jun Lan, Lingzhong Dong, Yutong Cai, Siyuan Li, Huijia Zhu, Weiqiang Wang, Linghe Kong, Yue Wang, et al. Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception.arXiv preprint arXiv:2602.11858, 2026

  39. [39]

    arXiv preprint arXiv:2506.03143 , year=

    Qianhui Wu, Kanzhi Cheng, Rui Yang, Chaoyun Zhang, Jianwei Yang, Huiqiang Jiang, Jian Mu, Baolin Peng, Bo Qiao, Reuben Tan, et al. Gui-actor: Coordinate-free visual grounding for gui agents.arXiv preprint arXiv:2506.03143, 2025

  40. [40]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents, 2024.URL https://arxiv. org/abs/2410.23218

  41. [42]

    Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

    Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

  42. [43]

    Mobilerl: Online agentic reinforcement learning for mobile gui agents

    Yifan Xu, Xiao Liu, Xinghan Liu, Jiaqi Fu, Hanchen Zhang, Bohao Jing, Shudan Zhang, Yuting Wang, Wenyi Zhao, and Yuxiao Dong. Mobilerl: Online agentic reinforcement learning for mobile gui agents.arXiv preprint arXiv:2509.18119, 2025

  43. [44]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026. 12

  44. [45]

    java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

    Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, et al. Gta1: Gui test-time scaling agent.arXiv preprint arXiv:2507.05791, 2025

  45. [48]

    Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

    Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

  46. [49]

    Fdc-ground: Improving grpo for gui grounding via exponential rewards and fact-aligned pruning

    Xiangjian Zeng, Wenjing Li, Qingqiang Wu, and Liang Zhang. Fdc-ground: Improving grpo for gui grounding via exponential rewards and fact-aligned pruning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 28122–28130, 2026

  47. [50]

    Tongui: Building generalized gui agents by learning from multimodal web tutorials.arXiv e-prints, pages arXiv–2504, 2025

    Bofei Zhang, Zirui Shang, Zhi Gao, Wang Zhang, Rui Xie, Xiaojian Ma, Tao Yuan, Xinxiao Wu, Song-Chun Zhu, and Qing Li. Tongui: Building generalized gui agents by learning from multimodal web tutorials.arXiv e-prints, pages arXiv–2504, 2025

  48. [51]

    Hyperclick: Advancing reliable gui grounding via uncertainty calibration.arXiv preprint arXiv:2510.27266, 2025

    Shaojie Zhang, Pei Fu, Ruoceng Zhang, Jiahui Yang, Anan Du, Xiuwen Xi, Shaokang Wang, Ying Huang, Bin Qin, Zhenbo Luo, et al. Hyperclick: Advancing reliable gui grounding via uncertainty calibration.arXiv preprint arXiv:2510.27266, 2025

  49. [52]

    Btl-ui: Blink-think-link reasoning model for gui agent

    Shaojie Zhang, Ruoceng Zhang, Pei Fu, Shaokang Wang, Jiahui Yang, Xin Du, Shiqi Cui, Bin Qin, Ying Huang, Zhenbo Luo, et al. Btl-ui: Blink-think-link reasoning model for gui agent. arXiv preprint arXiv:2509.15566, 2025

  50. [53]

    OPSDL: On-Policy Self-Distillation for Long-Context Language Models

    Xinsen Zhang, Zhenkai Ding, Tianjun Pan, Run Yang, Chun Kang, Xue Xiong, and Jingnan Gu. Opsdl: On-policy self-distillation for long-context language models.arXiv preprint arXiv:2604.17535, 2026

  51. [54]

    Co- epg: A framework for co-evolution of planning and grounding in autonomous gui agents

    Yuan Zhao, Hualei Zhu, Tingyu Jiang, Shen Li, Xiaohang Xu, and Hao Henry Wang. Co- epg: A framework for co-evolution of planning and grounding in autonomous gui agents. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 36582–36590, 2026

  52. [55]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

    Yuqi Zhou, Sunhao Dai, Shuai Wang, Kaiwen Zhou, Qinglin Jia, and Jun Xu. Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025. 13 Appendix The appendix includes the following aspects: • Appendix A: Evaluation Benchmarks. • Appendix B: Training Details. • Appendix C: Additional Experiments and...