GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding.arXiv preprint arXiv:2512.05941
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3years
2026 3roles
background 1polarities
background 1representative citing papers
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.
citing papers explorer
-
What Happens Before Decoding? Prefill Determines GUI Grounding in VLMs
GUI grounding in VLMs is bottlenecked by prefill-stage candidate selection that decoding cannot fix, so Re-Prefill uses attention to extract and re-inject target tokens for up to 4.3% gains on ScreenSpot-Pro.
-
AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
AutoFocus converts token perplexity into an anisotropic Gaussian uncertainty field to drive region proposals and shape-aware zooming for improved GUI grounding in VLMs.
-
Zoom Consistency: A Free Confidence Signal in Multi-Step Visual Grounding Pipelines
Zoom consistency provides a geometric, cross-model confidence signal in zoom-in grounding pipelines that correlates with prediction correctness and enables modest gains in specialist-generalist routing.