arxiv: 2605.12549 · v1 · submitted 2026-05-10 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs

Jiaping Lin , Fei Shen , Junzhe Li , Ping Nie , Fei Yu , Ming Li , Haizhou Li

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords GUI groundingVision-Language Modelsprefill stageattention mechanismtraining-free methodUI elementscoordinate predictionmultimodal inference

0 comments

The pith

GUI grounding in VLMs follows a two-stage process where the prefill stage selects candidate UI elements that the decoding stage cannot correct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language models handle GUI grounding in two distinct phases. During prefill the model identifies which UI elements are likely targets using attention patterns from the instruction. In decoding it only refines the exact coordinates, so an early mistake in element selection stays uncorrected. To fix this bottleneck the authors introduce Re-Prefill, a training-free step that extracts the most-attended visual tokens and re-appends them with instruction states so the model can reconsider its choice before outputting coordinates. Experiments on four models and five benchmarks report consistent gains, reaching 4.3 percent on ScreenSpot-Pro.

Core claim

Grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Re-Prefill extracts visual tokens that consistently receive high attention from the query position across layers as a preliminary target hypothesis and appends them to the input together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation.

What carries the argument

Attention-guided second prefill that re-appends visual tokens receiving consistently high attention from the final query token across layers, together with instruction hidden states, to refine the initial candidate hypothesis.

If this is right

Errors made during the first prefill cannot be recovered in the decoding phase, making early candidate selection the dominant source of grounding failure.
Re-appending high-attention visual tokens with instruction states produces measurable accuracy gains on ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI without any training.
The same attention pattern works across four different VLMs, indicating the two-stage behavior is a general property rather than model-specific.
Decoding only adjusts coordinates once the candidate set is fixed, so further coordinate-level improvements yield diminishing returns if the wrong element was chosen early.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Early attention maps in VLMs may already encode most of the spatial decision for grounding tasks, suggesting similar re-prefill tricks could help other coordinate or region-output problems.
If the high-attention tokens prove stable across layers, one could extract the candidate set after the first few layers and skip later computation in latency-sensitive settings.
The method implies that progressive interaction among visual tokens, rather than independent forward passes, is a useful direction for training-free GUI agents.

Load-bearing premise

Visual tokens that receive high attention from the query position across layers reliably mark the correct target element, and re-appending them lets the model improve its choice without adding noise or bias.

What would settle it

Running Re-Prefill on the same models and benchmarks yields no accuracy gain, or the high-attention tokens selected from the first prefill show no better correlation with ground-truth elements than random visual tokens.

Figures

Figures reproduced from arXiv: 2605.12549 by Fei Shen, Fei Yu, Haizhou Li, Jiaping Lin, Junzhe Li, Ming Li, Ping Nie.

**Figure 1.** Figure 1: Prefill vs. Re-Prefill vs. Decoding. (a) Query-position attention heatmaps over visual tokens. Re-Prefill produces a sharper, more focused distribution that disambiguates the correct target from other candidates. Additional visualizations are provided in Appendix C. (b) Spatial variance of query-position attention across generation steps. The sharp drop after the first generated token shows that target sel… view at source ↗

**Figure 2.** Figure 2: Overview of Re-Prefill. (1) Prefill. The input [S; V; T] is processed through L decoder layers to obtain contextualized representations [S˜; V˜ ; T˜ ]. (2) Key visual token selection. Visual tokens that consistently receive high attention across layers are selected as V˜ ∗ , representing candidate target regions. (3) Layer-wise second prefill. A copy of the original input is re-encoded with layer-wise pref… view at source ↗

**Figure 3.** Figure 3: Query-position attention heatmaps across stages on ScreenSpot-Pro. The first two panels illustrate the baseline transition, while the last two panels show the corresponding transition under Re-Prefill. The blue rectangle marks the ground-truth target, and the orange circle indicates the predicted coordinate. Re-Prefill focuses attention on the correct region during prefill, suppresses distractors, and lead… view at source ↗

**Figure 5.** Figure 5: Effect of Lc. The optimum at Lc=3 balances two modes. For small Lc, insufficient semantic alignment arises between uncontextualized input tokens and the first-prefill prefix (red zone). For large Lc, noise from unrelated tokens propagates into deeper layers (grey zone). based token selection with random sampling. All three variants perform worse than Re-Prefill. The largest gap, observed for Embedding Add… view at source ↗

**Figure 6.** Figure 6: Query-position attention heatmaps across stages on ScreenSpot-Pro. Each row shows one example. Columns 1–2 present the baseline transition from prefill to the first decoding step, while Columns 3–4 show the corresponding transition with Re-Prefill. The blue rectangle marks the ground-truth target, and the orange circle indicates the predicted coordinate. Compared to the baseline, Re-Prefill focuses attenti… view at source ↗

**Figure 7.** Figure 7: Spatial variance and prefill-stage error analysis across models and benchmarks. Rows 1–2 show results for Qwen3-VL-8B-Instruct, and Rows 3–4 for GUI-Owl-1.5-8B-Instruct. For each model, the first row shows spatial variance across generation steps, and the second row shows attention-centroid deviation for correct and incorrect predictions. Across all settings, attention is dispersed at prefill and rapidly c… view at source ↗

read the original abstract

Existing training-free approaches for GUI grounding often rely on multiple inference runs, such as iterative cropping or candidate aggregation, to identify target elements. Despite this additional computation, each forward pass still independently interprets the instruction and parses the visual layout, without enabling progressive interaction among visual tokens. In this paper, we study what happens during GUI grounding in Vision-Language Models (VLMs) and identify a previously overlooked bottleneck. We show that grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates. This asymmetry establishes prefill as the critical step, as errors in candidate selection cannot be effectively corrected during decoding. Based on this observation, we propose Re-Prefill, a training-free method that revisits inference by introducing an attention-guided second prefill stage to refine target selection. Specifically, visual tokens that consistently receive high attention from the query position, i.e., the final token, across layers are extracted as a preliminary target hypothesis and appended to the input, together with the instruction hidden states, enabling the model to deeply re-think its decision before coordinate generation. Experiments across four VLMs and five benchmarks, including ScreenSpot-Pro, ScreenSpot-V2, OSWorld-G, UI-Vision, and MMBench-GUI, demonstrate consistent improvements without additional training, with gains of up to 4.3% on ScreenSpot-Pro. Code will be available at https://github.com/linjiaping1/Re-Prefill.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Prefill locks in the candidate UI elements for VLM GUI grounding while decoding has limited recovery power, and their attention-based Re-Prefill fix delivers modest but consistent gains across models.

read the letter

The main thing to know is that this paper frames GUI grounding as a two-stage process where the prefill phase picks the rough set of UI elements and decoding mostly just refines coordinates without much ability to recover from bad choices. They turn that into a training-free Re-Prefill step that pulls high-attention visual tokens from the final position across layers, re-appends them with instruction states, and runs again before coordinate output. That produces gains up to 4.3% on ScreenSpot-Pro and smaller lifts on the other four benchmarks, all without retraining four different VLMs. The approach is straightforward and the experiments cover a decent range of models and tasks, which makes the practical side easy to check. The attention extraction is simple enough that someone could reimplement it quickly if the code drops as promised. The soft spot is the causal claim that decoding cannot correct prefill errors. The evidence is observational—attention patterns plus the re-prefill improvement—rather than a direct intervention such as masking the selected tokens or forcing incorrect candidates to test whether decoding can still recover. Without that, it is hard to separate genuine bottleneck fixing from the benefit of simply adding extra context. Implementation details like exact attention thresholds and whether all baselines used identical settings also need the full paper and code to confirm the gains are clean. This is aimed at people building GUI agents or inference optimizations for grounded VLMs. A reader who cares about low-cost reliability tweaks will find the method useful even if the two-stage story stays partly correlational. I would send it to peer review. The empirical scope is broad enough and the idea is concrete, so referees can tighten the causal tests and check reproducibility.

Referee Report

2 major / 2 minor

Summary. The paper claims that GUI grounding in VLMs follows a two-stage process in which the prefill stage selects candidate UI elements (via attention from the final token) while the decoding stage only refines coordinates, and that prefill errors are largely irreversible. It introduces Re-Prefill, a training-free inference modification that extracts high-attention visual tokens, re-appends them with instruction hidden states for a second prefill, and reports consistent gains (up to 4.3%) across four VLMs and five benchmarks.

Significance. If the two-stage asymmetry and irreversibility hold, the work supplies a mechanistic insight into VLM inference for grounding and a lightweight, parameter-free improvement that avoids the multiple forward passes of prior training-free methods. The empirical consistency across models and benchmarks is a strength, though the absence of direct causal interventions limits the strength of the irreversibility claim.

major comments (2)

[Abstract and §3 (method)] The central irreversibility claim (prefill errors cannot be corrected in decoding) is load-bearing for the two-stage paradigm yet rests on indirect evidence: observed gains from Re-Prefill and the attention patterns. No direct intervention (attention masking of high-attention tokens, forced incorrect candidates, or ablation of the re-prefill step) is described to test whether decoding can recover from deliberately introduced prefill errors.
[§4] §4 (experiments): baseline comparisons and Re-Prefill results are reported without explicit confirmation that all methods used identical inference settings, temperature, or attention-extraction thresholds; the reported gains could be inflated by uncontrolled differences in implementation.

minor comments (2)

[§3] Clarify the exact criterion and threshold used to select 'consistently high attention' visual tokens across layers; the description in the abstract is qualitative.
[Abstract] The paper states 'Code will be available'; confirm that the released repository will include the precise attention-extraction and re-prefill implementation details needed for reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments point-by-point below and will revise the manuscript to strengthen the claims with additional experiments and clarifications.

read point-by-point responses

Referee: [Abstract and §3 (method)] The central irreversibility claim (prefill errors cannot be corrected in decoding) is load-bearing for the two-stage paradigm yet rests on indirect evidence: observed gains from Re-Prefill and the attention patterns. No direct intervention (attention masking of high-attention tokens, forced incorrect candidates, or ablation of the re-prefill step) is described to test whether decoding can recover from deliberately introduced prefill errors.

Authors: We agree that direct causal interventions would provide stronger evidence for the irreversibility of prefill errors. In the revised manuscript, we will add two new experiments in §3 and §4: (1) attention masking of the top-attended visual tokens during the initial prefill to measure whether decoding can still produce correct coordinates, and (2) forced injection of incorrect candidate tokens to test recovery capability in the decoding stage. These interventions will directly test the two-stage asymmetry beyond the current indirect evidence from attention patterns and Re-Prefill gains. revision: yes
Referee: [§4] §4 (experiments): baseline comparisons and Re-Prefill results are reported without explicit confirmation that all methods used identical inference settings, temperature, or attention-extraction thresholds; the reported gains could be inflated by uncontrolled differences in implementation.

Authors: All reported results used identical inference settings across baselines and Re-Prefill: temperature=0 for deterministic outputs, the same top-k=10 attention threshold per layer (averaged over layers), and consistent model loading and prompt formatting. We will explicitly document these settings in the revised §4, add a dedicated paragraph on implementation details, and release the exact evaluation scripts to ensure full reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical attention analysis

full rationale

The paper conducts an empirical study of attention patterns from the final query token across layers in frozen VLMs, observes that high-attention visual tokens correlate with candidate UI elements, and uses this to motivate a training-free Re-Prefill procedure that re-appends those tokens plus instruction states. No mathematical derivation, parameter fitting, or first-principles claim reduces to its own inputs by construction. The two-stage paradigm is an interpretive summary of observed behavior rather than a self-defined quantity, and the method is directly tested on external benchmarks without renaming known results or relying on load-bearing self-citations for uniqueness. The central claim remains independently falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on standard transformer attention behavior without new fitted parameters or invented entities.

axioms (1)

domain assumption Attention scores from the final query token to visual tokens indicate relevance for target element selection in GUI grounding
Used to select high-attention tokens as the preliminary hypothesis for the second prefill.

pith-pipeline@v0.9.0 · 5585 in / 1279 out tokens · 43495 ms · 2026-05-14T21:40:43.236562+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

grounding follows a two-stage paradigm: the prefill stage determines candidate UI elements, while the decoding stage subsequently refines the final coordinates

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 4 internal anchors

[1]

Gui agents: A survey

Dang Nguyen, Jian Chen, Yu Wang, Gang Wu, Namyong Park, Zhengmian Hu, Hanjia Lyu, Junda Wu, Ryan Aponte, Yu Xia, et al. Gui agents: A survey. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22522–22538, 2025

work page 2025
[2]

Large language model-brained gui agents: A survey,

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Guyue Liu, Qingwei Lin, et al. Large language model-brained gui agents: A survey.arXiv preprint arXiv:2411.18279, 2024

work page arXiv 2024
[3]

Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024

Shuai Wang, Weiwen Liu, Jingxuan Chen, Yuqi Zhou, Weinan Gan, Xingshan Zeng, Yuhan Che, Shuai Yu, Xinlong Hao, Kun Shao, et al. Gui agents with foundation models: A comprehensive survey.arXiv preprint arXiv:2411.04890, 2024

work page arXiv 2024
[4]

GTA1: GUI test-time scaling agent

Yan Yang, Dongxu Li, Yutong Dai, Yuhao Yang, Ziyang Luo, Zirui Zhao, Zhiyuan Hu, Junzhe Huang, Amrita Saha, Zeyuan Chen, Ran Xu, Liyuan Pan, Caiming Xiong, and Junnan Li. GTA1: GUI test-time scaling agent. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=3VIPmz7iAi

work page 2026
[5]

Gui-g2: Gaussian reward modeling for gui grounding

Fei Tang, Zhangxuan Gu, Zhengxi Lu, Xuyang Liu, Shuheng Shen, Changhua Meng, Wen Wang, Wenqi Zhang, Yongliang Shen, Weiming Lu, et al. Gui-g2: Gaussian reward modeling for gui grounding. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 33214–33222, 2026

work page 2026
[6]

Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

Hanzhang Zhou, Xu Zhang, Panrong Tong, Jianan Zhang, Liangyu Chen, Quyu Kong, Chenglin Cai, Chen Liu, Yue Wang, Jingren Zhou, et al. Mai-ui technical report: Real-world centric foundation gui agents.arXiv preprint arXiv:2512.22047, 2025

work page arXiv 2025
[7]

Mobile-agent-v3

Haiyang Xu, Xi Zhang, Haowei Liu, Junyang Wang, Zhaozai Zhu, Shengjie Zhou, Xuhao Hu, Feiyu Gao, Junjie Cao, Zihua Wang, et al. Mobile-agent-v3. 5: Multi-platform fundamental gui agents.arXiv preprint arXiv:2602.16855, 2026

work page arXiv 2026
[8]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14281–14290, 2024

work page 2024
[9]

OS-ATLAS: Foundation action model for generalist GUI agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, and Yu Qiao. OS-ATLAS: Foundation action model for generalist GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=n9PDaFNi8t

work page 2025
[10]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT

work page 2025
[11]

Reguide: Data efficient gui grounding via spatial reasoning and search.arXiv preprint arXiv:2505.15259, 2025

Hyunseok Lee, Jeonghoon Kim, Beomjun Kim, Jihoon Tack, Chansong Jo, Jaehong Lee, Cheonbok Park, Sookyo In, Jinwoo Shin, and Kang Min Yoo. Reguide: Data efficient gui grounding via spatial reasoning and search.arXiv preprint arXiv:2505.15259, 2025

work page arXiv 2025
[12]

Visual test-time scaling for gui agent grounding

Tiange Luo, Lajanugen Logeswaran, Justin Johnson, and Honglak Lee. Visual test-time scaling for gui agent grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19989–19998, 2025. 10

work page 2025
[13]

Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding.arXiv preprint arXiv:2512.05941, 2025

Zhiyuan Jiang, Shenghao Xie, Wenyi Li, Wenqiang Zu, Peihang Li, Jiahao Qiu, Siqi Pei, Lei Ma, Tiejun Huang, Mengdi Wang, et al. Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding.arXiv preprint arXiv:2512.05941, 2025

work page arXiv 2025
[14]

Chain-of-ground: Improving gui grounding via iterative reasoning and reference feedback.arXiv preprint arXiv:2512.01979, 2025

Aiden Yiliu Li, Bizhi Yu, Daoan Lei, Tianhe Ren, and Shilong Liu. Chain-of-ground: Improving gui grounding via iterative reasoning and reference feedback.arXiv preprint arXiv:2512.01979, 2025

work page arXiv 2025
[15]

UI-Zoomer: Uncertainty-Driven Adaptive Zoom-In for GUI Grounding

Fei Tang, Bofan Chen, Zhengxi Lu, Tongbo Chen, Songqin Nong, Tao Jiang, Wenhao Xu, Weiming Lu, Jun Xiao, Yueting Zhuang, et al. Ui-zoomer: Uncertainty-driven adaptive zoom-in for gui grounding.arXiv preprint arXiv:2604.14113, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[16]

Mvp: Multiple view prediction improves gui grounding.arXiv preprint arXiv:2512.08529, 2025

Yunzhu Zhang, Zeyu Pan, Zhengwen Zeng, Shuheng Shen, Changhua Meng, and Linchao Zhu. Mvp: Multiple view prediction improves gui grounding.arXiv preprint arXiv:2512.08529, 2025

work page arXiv 2025
[17]

Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning

Hang Wu, Hongkai Chen, Yujun Cai, Chang Liu, Qingwen Ye, Ming-Hsuan Yang, and Yiwei Wang. Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26257–26267, 2025

work page 2025
[18]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Screenspot-pro: Gui grounding for professional high-resolution computer use

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use. InProceedings of the 33rd ACM International Conference on Multimedia, pages 8778– 8786, 2025

work page 2025
[20]

Scaling computer-use grounding via user interface decomposition and synthesis

Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, et al. Scaling computer-use grounding via user interface decomposition and synthesis. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

work page 2025
[21]

Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Juan A Rodriguez, Montek Kalsi, Rabiul Awal, Nicolas Chapados, M Tamer Özsu, Aishwarya Agrawal, David Vazquez, et al. Ui- vision: A desktop-centric gui benchmark for visual perception and interaction.arXiv preprint arXiv:2503.15661, 2025

work page arXiv 2025
[22]

Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

Xuehui Wang, Zhenyu Wu, JingJing Xie, Zichen Ding, Bowen Yang, Zehao Li, Zhaoyang Liu, Qingyun Li, Xuan Dong, Zhe Chen, et al. Mmbench-gui: Hierarchical multi-platform evaluation framework for gui agents.arXiv preprint arXiv:2507.19478, 2025

work page arXiv 2025
[23]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

UI-ins: Enhancing GUI grounding with multi-perspective instruction as reasoning

Liangyu Chen, Hanzhang Zhou, Chenglin Cai, Jianan Zhang, Panrong Tong, Xu Zhang, Quyu Kong, Chen Liu, Yuqi Liu, Wenxuan Wang, Yue Wang, Qin Jin, and Steven HOI. UI-ins: Enhancing GUI grounding with multi-perspective instruction as reasoning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=d...

work page 2026
[25]

GuirlVG: Incentivize GUI visual grounding via empirical exploration on reinforcement learning

Weitai Kang, Bin Lei, Gaowen Liu, Caiwen Ding, and Yan Yan. GuirlVG: Incentivize GUI visual grounding via empirical exploration on reinforcement learning. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum?id=zrH2A1upAo

work page 2026
[26]

Learning gui grounding with spatial reasoning from visual feedback.arXiv preprint arXiv:2509.21552, 2025

Yu Zhao, Wei-Ning Chen, Huseyin Atahan Inan, Samuel Kessler, Lu Wang, Lukas Wutschitz, Fangkai Yang, Chaoyun Zhang, Pasquale Minervini, Saravan Rajmohan, et al. Learning gui grounding with spatial reasoning from visual feedback.arXiv preprint arXiv:2509.21552, 2025. 11

work page arXiv 2025
[27]

Ui-tars-1.5: A multimodal ui understanding and reasoning model, 2025

ByteDance Seed Team. Ui-tars-1.5: A multimodal ui understanding and reasoning model, 2025. URLhttps://seed-tars.com/1.5

work page 2025
[28]

A survey on (m) llm-based gui agents.arXiv preprint arXiv:2504.13865, 2025

Fei Tang, Haolei Xu, Hang Zhang, Siqi Chen, Xingyu Wu, Yongliang Shen, Wenqi Zhang, Guiyang Hou, Zeqi Tan, Yuchen Yan, et al. A survey on (m) llm-based gui agents.arXiv preprint arXiv:2504.13865, 2025

work page arXiv 2025
[29]

Showui: One vision-language-action model for generalist gui agent

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for generalist gui agent. InNeurIPS 2024 Workshop on Open-World Agents, 2024

work page 2024
[30]

Seeclick: Harnessing gui grounding for advanced visual gui agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 9313–9332, 2024

work page 2024
[31]

Aria-ui: Visual grounding for gui instructions

Yuhao Yang, Yue Wang, Dongxu Li, Ziyang Luo, Bei Chen, Chao Huang, and Junnan Li. Aria-ui: Visual grounding for gui instructions. InFindings of the Association for Computational Linguistics: ACL 2025, pages 22418–22433, 2025

work page 2025
[32]

Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

Taofeng Xue, Chong Peng, Mianqiu Huang, Linsen Guo, Tiancheng Han, Haozhe Wang, Jianing Wang, Xiaocheng Zhang, Xin Yang, Dengchang Zhao, et al. Evocua: Evolving computer use agents via learning from scalable synthetic experience.arXiv preprint arXiv:2601.15876, 2026

work page arXiv 2026
[33]

Opencua: Open foundations for computer-use agents

Xinyuan Wang, Bowen Wang, Dunjie Lu, Junlin Yang, Tianbao Xie, Junli Wang, Jiaqi Deng, Xiaole Guo, Yiheng Xu, Chen Henry Wu, et al. Opencua: Open foundations for computer-use agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[34]

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

Run Luo, Lu Wang, Wanwei He, Longze Chen, Jiaming Li, and Xiaobo Xia. Gui-r1: A generalist r1-style vision-language action model for gui agents.arXiv preprint arXiv:2504.10458, 2025

work page internal anchor Pith review arXiv 2025
[35]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

Xinbin Yuan, Jian Zhang, Kaixin Li, Zhuoxuan Cai, Lujian Yao, Jie Chen, Enguang Wang, Qibin Hou, Jinwei Chen, Peng-Tao Jiang, et al. Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

work page arXiv 2025
[36]

Ui-venus technical report: Building high-performance ui agents with rft

Zhangxuan Gu, Zhengwen Zeng, Zhenyu Xu, Xingran Zhou, Shuheng Shen, Yunfei Liu, Beitong Zhou, Changhua Meng, Tianyu Xia, Weizhi Chen, et al. Ui-venus technical report: Building high-performance ui agents with rft.arXiv preprint arXiv:2508.10833, 2025

work page arXiv 2025
[37]

Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026

Chen Chen, Jiawei Shao, Dakuan Lu, Haoyi Hu, Xiangcheng Liu, Hantao Yao, and Wu Liu. Gui-eyes: Tool-augmented perception for visual grounding in gui agents.arXiv preprint arXiv:2601.09770, 2026

work page arXiv 2026
[38]

Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning

Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling LLM test- time compute optimally can be more effective than scaling parameters for reasoning. In The Thirteenth International Conference on Learning Representations, 2025. URL https: //openreview.net/forum?id=4FWAwZtd2n

work page 2025
[39]

Gui- spotlight: Adaptive iterative focus refinement for enhanced gui visual grounding.arXiv preprint arXiv:2510.04039, 2025

Bin Lei, Nuo Xu, Ali Payani, Mingyi Hong, Chunhua Liao, Yu Cao, and Caiwen Ding. Gui- spotlight: Adaptive iterative focus refinement for enhanced gui visual grounding.arXiv preprint arXiv:2510.04039, 2025

work page arXiv 2025
[40]

Improved gui grounding via iterative narrowing.arXiv preprint arXiv:2411.13591, 2024

Anthony Nguyen. Improved gui grounding via iterative narrowing.arXiv preprint arXiv:2411.13591, 2024

work page arXiv 2024
[41]

Test-time reinforcement learning for gui grounding via region consistency

Yong Du, Yuchen Yan, Fei Tang, Zhengxi Lu, Chang Zong, Weiming Lu, Shengpei Jiang, and Yongliang Shen. Test-time reinforcement learning for gui grounding via region consistency. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 30593–30601, 2026. 12 A Broader Impact This work proposes a training-free method for improving G...

work page 2026