arxiv: 2605.02630 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

Ruilin Yao , Shegnwu Xiong , Tianyu Zou , Shili Xiong , Yi Rong

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:34 UTC · model grok-4.3

classification 💻 cs.CV

keywords GUI groundingactive visual searchuncertainty estimationvision-language modelscoordinate predictionanisotropic Gaussianregion proposal

0 comments

The pith

AutoFocus converts token perplexity into an anisotropic Gaussian field to adaptively zoom and ground GUI coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to make vision-language models better at turning natural language instructions into exact pixel coordinates on crowded, high-resolution screens. Existing zoom methods use fixed grids or random anchors that ignore what the model itself is unsure about. AutoFocus instead draws several possible coordinate guesses, reads the model's token-level perplexity in each axis as a signal of spatial uncertainty, and builds a directional probability map from those scores. This map drives smart region proposals and a final consistency check via visual prompts. A sympathetic reader would care because the approach is training-free and works on both general and GUI-specialized models, directly addressing the resolution gap that currently limits reliable GUI agents.

Core claim

AutoFocus samples multiple coordinate hypotheses from a VLM, converts their axial perplexities into an anisotropic Gaussian spatial probability field that models directional uncertainty, generates global and local region proposals using Shape-Aware Zooming, and selects the most consistent prediction through structured visual-prompt comparison.

What carries the argument

The anisotropic Gaussian spatial probability field built from axial perplexities of sampled coordinates, which supplies directional uncertainty estimates to generate adaptive region proposals.

If this is right

Grounding accuracy rises consistently on high-resolution GUI benchmarks for both general-purpose and specialized VLMs without any retraining.
Region proposals adapt to the model's actual uncertainty instead of relying on fixed anchors or heuristic grids.
Shape-Aware Zooming preserves context while tightening localization around uncertain areas.
Visual-prompt aggregation selects the prediction with highest internal consistency across proposals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same perplexity-to-Gaussian conversion could be tested on other coordinate-output tasks such as robotic manipulation or document layout analysis.
If the uncertainty signal proves reliable, black-box APIs that hide token probabilities would need alternative proxies to use the method.
The framework suggests a broader pattern where model-internal uncertainty replaces external search heuristics in visual grounding pipelines.

Load-bearing premise

Token-level perplexity during coordinate generation directly reflects and can be mapped to spatial uncertainty in the predicted screen location.

What would settle it

If experiments on ScreenSpot-Pro or ScreenSpot-V2 show no accuracy improvement over single-shot prediction or fixed-grid zooming when perplexity is used to drive proposals, the central mechanism is refuted.

Figures

Figures reproduced from arXiv: 2605.02630 by Ruilin Yao, Shegnwu Xiong, Shili Xiong, Tianyu Zou, Yi Rong.

**Figure 1.** Figure 1: Comparison with other zoom-in methods. Recent works have explored multi-stage “zoom-in” strategies to mitigate this problem. However, existing approaches often rely on fixed anchors, heuristic grids, or zooming around a single initial prediction (as shown in view at source ↗

**Figure 2.** Figure 2: Perplexity (PPL) distribution of grounding results on ScreenSpot-Pro. We compare the PPL across multiple models, with green and red areas indicating correct and incorrect grounding predictions, respectively. Higher PPL values for incorrect results suggest a strong correlation between model uncertainty and grounding failure. exhibits high ambiguity, the core Gaussian Dynamic Focusing (GDF) module is activa… view at source ↗

**Figure 3.** Figure 3: Overview of AutoFocus. The model first samples multiple coordinate hypotheses with axial perplexities from the initial prediction. Gaussian Dynamic Focusing constructs an anisotropic spatial field to generate global and local proposals. Highresolution zoom-in predictions are performed on candidate regions, and a structured aggregation step selects the final coordinate. 3 Method In this section, we presen… view at source ↗

**Figure 4.** Figure 4: Effect of Sampling Budget N. Sampling budget N view at source ↗

read the original abstract

Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AutoFocus maps axial perplexity to an anisotropic uncertainty field for adaptive GUI zooming, but the mapping lacks independent validation and the gains may come more from zooming than the uncertainty model.

read the letter

The paper's main contribution is a training-free pipeline that samples multiple coordinate predictions from a VLM, converts axial perplexities into an anisotropic Gaussian spatial field, generates region proposals from that field, applies shape-aware zooming, and aggregates via visual prompts. This targets the resolution gap in high-res GUI screens better than fixed grids or RL baselines. The approach is practical because it works on off-the-shelf VLMs without fine-tuning, and the experiments report consistent lifts on ScreenSpot-Pro and ScreenSpot-V2 for both general and GUI-tuned models. That combination of directional uncertainty modeling plus shape preservation is the concrete new piece, and it is not just a routine extension of prior active search work. The implementation details and prompt aggregation step look reproducible from the description. The central assumption—that token perplexity during coordinate generation naturally encodes separable directional spatial uncertainty—still feels thin. The paper does not show an independent check, such as correlating the derived Gaussian with actual localization error on held-out data or ablating the anisotropy against isotropic or random fields. If coordinate tokens are generated jointly rather than independently, the axial split could be mostly noise, which would make the subsequent zooming steps function more like a heuristic than a principled search. The reported improvements are positive but would be more convincing with error breakdowns by screen density or element size and with controls that isolate the uncertainty field from the zooming itself. This work is aimed at researchers building VLM-based GUI agents or screen-understanding systems. A reader already working on grounding or active perception will get the most out of the specific construction and the benchmark numbers. The idea is coherent enough and the experiments are on standard datasets, so it deserves a serious referee rather than a desk reject. The reviewers can pressure the authors on the perplexity-to-uncertainty link and ask for the missing ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding with VLMs. The core idea is that token-level perplexity during coordinate generation reflects spatial uncertainty; this is used to sample hypotheses, construct an anisotropic Gaussian spatial probability field from axial perplexities, generate global/local region proposals, apply Shape-Aware Zooming, and aggregate predictions via visual-prompt comparison. Experiments on ScreenSpot-Pro and ScreenSpot-V2 report consistent gains across general-purpose and GUI-specialized VLMs.

Significance. If the perplexity-to-uncertainty mapping is reliable, the work supplies a principled, parameter-free mechanism for adaptive refinement in high-resolution GUI interfaces, avoiding the limitations of fixed anchors or heuristic grids. It leverages existing VLM outputs without retraining and could improve autonomous agents by explicitly modeling directional uncertainty.

major comments (2)

[Abstract] Abstract: The central claim that 'token-level perplexity in coordinate generation naturally reflects spatial uncertainty' and can be 'directly converted' into an anisotropic Gaussian is asserted without derivation, correlation analysis, or independent verification against localization error. This assumption is load-bearing for the region-proposal and zooming steps; if axial perplexities are only loosely correlated with true spatial error (e.g., due to joint token generation or coarse number tokenization), the framework reduces to a heuristic zoom strategy.
[Method] Method description: No explicit mapping function, normalization, or covariance construction details are supplied for converting axial perplexities to the Gaussian parameters. Without these, reproducibility and the claim of a 'principled' rather than heuristic approach cannot be assessed.

minor comments (1)

[Abstract] The abstract would be strengthened by briefly stating the magnitude of improvements and the specific baselines compared against.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We have addressed each of the major comments in detail below and will incorporate the suggested revisions into the next version of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'token-level perplexity in coordinate generation naturally reflects spatial uncertainty' and can be 'directly converted' into an anisotropic Gaussian is asserted without derivation, correlation analysis, or independent verification against localization error. This assumption is load-bearing for the region-proposal and zooming steps; if axial perplexities are only loosely correlated with true spatial error (e.g., due to joint token generation or coarse number tokenization), the framework reduces to a heuristic zoom strategy.

Authors: We acknowledge the referee's concern that the abstract states the core assumption without sufficient supporting evidence. While the full paper elaborates on the intuition in Section 3, we agree that a more rigorous justification is warranted. In the revised manuscript, we will add a dedicated subsection in the method that derives the perplexity-to-uncertainty mapping from the principles of autoregressive token prediction, where higher perplexity indicates greater ambiguity in the coordinate token. Additionally, we will include an empirical correlation analysis between axial perplexities and ground-truth localization errors on the ScreenSpot datasets to verify the assumption. This will strengthen the claim that the approach is principled rather than purely heuristic. revision: yes
Referee: [Method] Method description: No explicit mapping function, normalization, or covariance construction details are supplied for converting axial perplexities to the Gaussian parameters. Without these, reproducibility and the claim of a 'principled' rather than heuristic approach cannot be assessed.

Authors: We apologize for the omission of explicit implementation details in the method section, which is critical for reproducibility. In the revised paper, we will provide the complete mapping function, including the normalization of axial perplexities (p_x, p_y) to a standard range, the construction of the covariance matrix as an anisotropic diagonal matrix scaled by the normalized perplexities, and the exact formula for the spatial probability field. We will also include pseudocode and a step-by-step example to clarify how the Gaussian parameters are derived from the VLM's token outputs. These additions will allow readers to fully assess and replicate the framework. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses VLM perplexity outputs directly without self-referential reduction

full rationale

The paper's central step states that token-level perplexity 'naturally reflects spatial uncertainty' and is 'directly converted' into an anisotropic Gaussian field to guide proposals and Shape-Aware Zooming. This is presented as an insight applied to existing VLM token outputs rather than a quantity fitted or defined in terms of the downstream predictions. No equations reduce the uncertainty field back to the input coordinates by construction, no parameters are fitted on a subset and then 'predicted,' and no load-bearing self-citations or uniqueness theorems are invoked. The subsequent aggregation and zooming steps operate on the constructed field but remain independent of it. The chain is therefore self-contained and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about perplexity reflecting spatial uncertainty. No free parameters or new invented entities are introduced in the abstract description.

axioms (1)

domain assumption Token-level perplexity in coordinate generation naturally reflects spatial uncertainty
This is explicitly stated as the key insight that enables conversion to an anisotropic Gaussian field.

pith-pipeline@v0.9.0 · 5506 in / 1222 out tokens · 82905 ms · 2026-05-08T18:34:25.678986+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost (J(x)=½(x+x⁻¹)−1) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we map the x- and y-axis perplexities into anisotropic variances (σ_x, σ_y), forming a continuous 2D Gaussian probability field ... σ_{z,i} = β · PPL^(i)_z
Foundation/AlphaCoordinateFixation (parameter-free α=1 pin) alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We set the sampling temperature to τ=0.75 ... β=50 ... β=80 ... α∈{5,8} ... λ=0.5 ... K_local=3, K_global=2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 25 canonical work pages · 5 internal anchors

[1]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 10, 11, 12

work page Pith review arXiv 2025
[2]

Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Hu, W., Mao, Y., et al.: Spider2-v: How far are multimodal agents from automating data science and engineering workflows? Advances in Neural Information Processing Systems37, 107703–107744 (2024) 4

2024
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Chen, G., Zhou, X., Shao, R., Lyu, Y., Zhou, K., Wang, S., Li, W., Li, Y., Qi, Z., Nie, L.: Less is more: Empowering gui agent with context-aware simplification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5901–5911 (2025) 4

2025
[4]

In: Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers)

Chen, W., Cui, J., Hu, J., Qin, Y., Fang, J., Zhao, Y., Wang, C., Liu, J., Chen, G., Huo, Y., et al.: Guicourse: From general vision language model to versatile gui agent. In: Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). pp. 21936–21959 (2025) 4

2025
[5]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., Wu, Z.: Seeclick: Harnessing gui grounding for advanced visual gui agents. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9313–9332 (2024) 11, 12

2024
[6]

Advances in Neural Information Processing Systems36, 28091–28114 (2023) 4

Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36, 28091–28114 (2023) 4

2023
[7]

Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., Su, Y.: Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243 (2024) 4

work page arXiv 2024
[8]

Ui-venus technical report: Building high-performance ui agents with rft

Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al.: Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 (2025) 2, 10, 11, 12

work page arXiv 2025
[9]

Seed1.5-VL Technical Report

Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al.: Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062 (2025) 11

work page internal anchor Pith review arXiv 2025
[10]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) 4

2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14281–14290 (2024) 11

2024
[12]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Huang, Z., Cheng, Z., Pan, J., Hou, Z., Zhan, M.: Spiritsight agent: Advanced gui agent with one look. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29490–29500 (2025) 1

2025
[13]

GPT-4o System Card

Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 11

work page Pith review arXiv 2024
[14]

Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding.arXiv preprint arXiv:2512.05941, 2025

Jiang, Z., Xie, S., Li, W., Zu, W., Li, P., Qiu, J., Pei, S., Ma, L., Huang, T., Wang, M., et al.: Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding. arXiv preprint arXiv:2512.05941 (2025) 2 16 F. Author et al

work page arXiv 2025
[15]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Koh, J.Y., Lo, R., Jang, L., Duvvur, V., Lim, M., Huang, P.Y., Neubig, G., Zhou, S., Salakhutdinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 881–905 (2024) 4

2024
[16]

Gui- spotlight: Adaptive iterative focus refinement for enhanced gui visual grounding.arXiv preprint arXiv:2510.04039, 2025

Lei, B., Xu, N., Payani, A., Hong, M., Liao, C., Cao, Y., Ding, C.:\textsc{GUI- Spotlight}: Adaptive iterative focus refinement for enhanced gui visual grounding. arXiv preprint arXiv:2510.04039 (2025) 2

work page arXiv 2025
[17]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9098–9108 (2025) 4

2025
[18]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.S.: Screenspot-pro: Gui grounding for professional high-resolution computer use. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 8778– 8786 (2025) 4, 10, 11

2025
[19]

In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 19498–19508 (2025) 11

2025
[20]

arXiv preprint arXiv:2504.14239 , year=

Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., Wu, F.: Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239 (2025) 11

work page arXiv 2025
[21]

Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu

Liu, Y., Liu, Z., Zhu, S., Li, P., Xie, C., Wang, J., Hu, X., Han, X., Yuan, J., Wang, X., et al.: Infigui-g1: Advancing gui grounding with adaptive exploration policy optimization. arXiv preprint arXiv:2508.05731 (2025) 11

work page arXiv 2025
[22]

arXiv preprint arXiv:2505.00684 , year=

Luo, T., Logeswaran, L., Johnson, J., Lee, H.: Visual test-time scaling for gui agent grounding. arXiv preprint arXiv:2505.00684 (2025) 2, 4, 5, 11

work page arXiv 2025
[23]

In: Findings of the Association for Computational Linguistics: ACL

Pahuja, V., Lu, Y., Rosset, C., Gou, B., Mitra, A., Whitehead, S., Su, Y., Hassan, A.: Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. In: Findings of the Association for Computational Linguistics: ACL
[24]

6300–6323 (2025) 4

pp. 6300–6323 (2025) 4

2025
[25]

In: Findings of the Association for Computational Linguistics: ACL 2025

Park, J., Tang, P., Das, S., Appalaraju, S., Singh, K.Y., Manmatha, R., Ghadar, S.: R-vlm: Region-aware vision language model for precise gui grounding. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 9669–9685 (2025) 2

2025
[26]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326 (2025) 4, 10, 11, 12

work page Pith review arXiv 2025
[27]

Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025) 4

work page internal anchor Pith review arXiv 2025
[28]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Sun, Y., Zhao, S., Yu, T., Wen, H., Va, S., Xu, M., Li, Y., Zhang, C.: Gui-xplore: Empowering generalizable gui agents with one exploration. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19477–19486 (2025) 1

2025
[29]

Advances in Neural Information Processing Systems37, 2686–2710 (2024) 4 Abbreviated paper title 17

Wang, J., Xu, H., Jia, H., Zhang, X., Yan, M., Shen, W., Zhang, J., Huang, F., Sang, J.: Mobile-agent-v2: Mobile device operation assistant with effective navi- gation via multi-agent collaboration. Advances in Neural Information Processing Systems37, 2686–2710 (2024) 4 Abbreviated paper title 17

2024
[30]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 11

work page Pith review arXiv 2024
[31]

In: Pro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)

Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: Aligning language models with self-generated instructions. In: Pro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). pp. 13484–13508 (2023) 4

2023
[32]

Advances in neural information processing systems35, 24824–24837 (2022) 4

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) 4

2022
[33]

arXiv preprint arXiv:2602.11858 , year=

Wei, L., He, L., Lan, J., Dong, L., Cai, Y., Li, S., Zhu, H., Wang, W., Kong, L., Wang, Y., et al.: Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception. arXiv preprint arXiv:2602.11858 (2026) 4

work page arXiv 2026
[34]

Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning.arXiv, abs/2507.00008, 2025

Wu, H., Chen, H., Cai, Y., Liu, C., Ye, Q., Yang, M.H., Wang, Y.: Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning. arXiv preprint arXiv:2507.00008 (2025) 2, 4

work page arXiv 2025
[35]

arXiv preprint arXiv:2506.03143 , year=

Wu, Q., Cheng, K., Yang, R., Zhang, C., Yang, J., Jiang, H., Mu, J., Peng, B., Qiao, B., Tan, R., et al.: Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143 (2025) 12

work page arXiv 2025
[36]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P.P., et al.: Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218 (2024) 4, 10, 11, 12

work page internal anchor Pith review arXiv 2024
[37]

Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

Xie, T., Deng, J., Li, X., Yang, J., Wu, H., Chen, J., Hu, W., Wang, X., Xu, Y., Wang, Z., et al.: Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227 (2025) 11, 12

work page arXiv 2025
[38]

java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., et al.: Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791 (2025) 2, 10, 11

work page arXiv 2025
[39]

In: Findings of the Association for Computational Linguistics: ACL 2025

Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., Li, J.: Aria-ui: Visual grounding for gui instructions. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 22418–22433 (2025) 11

2025
[40]

Advances in neural information processing systems36, 11809–11822 (2023) 4

Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023) 4

2023
[41]

In: Findings of the Association for Computational Linguistics: ACL 2024

Ye, Q., Ahmed, M., Pryzant, R., Khani, F.: Prompt engineering a prompt engineer. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 355– 385 (2024)

2024
[42]

arXiv preprint arXiv:2509.15532 (2025) 2

Ye, X., Li, Y., Dai, W., Liu, M., Chen, Z., Han, Z., Min, H., Ren, J., Zhang, X., Yang, W., et al.: Gui-arp: Enhancing grounding with adaptive region perception for gui agents. arXiv preprint arXiv:2509.15532 (2025) 2

work page arXiv 2025
[43]

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

Yu, W., Yang, Z., Wan, J., Song, S., Tang, J., Cheng, W., Liu, Y., Bai, X.: Om- niparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models. arXiv preprint arXiv:2502.16161 (2025) 4, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 1

Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., Li, B.: SE-GUI: Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 1

2025
[45]

Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., et al.: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370 (2025) 11 18 F. Author et al

work page arXiv 2025
[46]

Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025

Zhang, J., Khayatkhoei, M., Chhikara, P., Ilievski, F.: Mllms know where to look: Training-free perception of small visual details with multimodal llms. arXiv preprint arXiv:2502.17422 (2025) 4

work page arXiv 2025
[47]

In: International Conference on Machine Learning

Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v (ision) is a generalist web agent, if grounded. In: International Conference on Machine Learning. pp. 61349–61385. PMLR (2024) 4

2024
[48]

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025) 4

work page internal anchor Pith review arXiv 2025
[49]

Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

Zhou, Y., Dai, S., Wang, S., Zhou, K., Jia, Q., Xu, J.: Gui-g1: Understand- ing r1-zero-like training for visual grounding in gui agents. arXiv preprint arXiv:2505.15810 (2025) 11

work page arXiv 2025