Learning GUI Grounding with Spatial Reasoning from Visual Feedback

Chaoyun Zhang; Fangkai Yang; Huseyin Atahan Inan; Lukas Wutschitz; Lu Wang; Pasquale Minervini; Robert Sim; Samuel Kessler; Saravan Rajmohan; Wei-Ning Chen

arxiv: 2509.21552 · v2 · pith:QMNI4LRTnew · submitted 2025-09-25 · 💻 cs.CV · cs.CL

Learning GUI Grounding with Spatial Reasoning from Visual Feedback

Yu Zhao , Wei-Ning Chen , Huseyin Atahan Inan , Samuel Kessler , Lu Wang , Lukas Wutschitz , Fangkai Yang , Chaoyun Zhang

show 3 more authors

Pasquale Minervini Saravan Rajmohan Robert Sim

This is my paper

classification 💻 cs.CV cs.CL

keywords groundingcursorgui-cursormodelspatialtargetactionscoordinates

0 comments

read the original abstract

Graphical User Interface (GUI) grounding is commonly framed as a coordinate prediction task -- given a natural language instruction, generate on-screen coordinates for actions such as clicks and keystrokes. However, recent Vision Language Models (VLMs) often fail to predict accurate numeric coordinates when processing GUI images with high resolutions and complex layouts. To address this issue, we reframe GUI grounding as an interactive search task, where the VLM generates actions to move a cursor in the GUI to locate UI elements. At each step, the model determines the target object, evaluates the spatial relations between the cursor and the target, and moves the cursor closer to the target conditioned on the movement history. In this interactive process, the rendered cursor provides visual feedback to help the model align its predictions with the corresponding on-screen locations. We train our GUI grounding model, GUI-Cursor, using multi-step online reinforcement learning with a dense trajectory-based reward function. Experimental results demonstrate that GUI-Cursor surpasses strong baselines in GUI grounding and agentic tasks, achieving superior performance with the same base models while requiring less training data. Further analysis shows that GUI-Cursor learns to adaptively conduct more steps on more difficult examples, and it obtains better spatial reasoning capability on out-of-distribution domains.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
cs.CV 2026-05 conditional novelty 7.0

GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 conditional novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by using video-based dynamic perception, action-conditioned refinement, and reflection, outperforming prior agents on the new DynamicGUIBench while...
Benchmarking and Improving GUI Agents in High-Dynamic Environments
cs.CV 2026-04 unverdicted novelty 7.0

DynamicUI improves GUI agent performance in high-dynamic environments by processing interaction videos with frame clustering, action-conditioned refinement, and reflection, outperforming prior approaches on the new Dy...
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
cs.AI 2026-05 unverdicted novelty 5.0

DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without ...
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding
cs.LG 2026-04 unverdicted novelty 5.0

A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.