pith. machine review for the scientific record. sign in

arxiv: 2605.02630 · v1 · submitted 2026-05-04 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:34 UTC · model grok-4.3

classification 💻 cs.CV
keywords GUI groundingactive visual searchuncertainty estimationvision-language modelscoordinate predictionanisotropic Gaussianregion proposal
0
0 comments X

The pith

AutoFocus converts token perplexity into an anisotropic Gaussian field to adaptively zoom and ground GUI coordinates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to make vision-language models better at turning natural language instructions into exact pixel coordinates on crowded, high-resolution screens. Existing zoom methods use fixed grids or random anchors that ignore what the model itself is unsure about. AutoFocus instead draws several possible coordinate guesses, reads the model's token-level perplexity in each axis as a signal of spatial uncertainty, and builds a directional probability map from those scores. This map drives smart region proposals and a final consistency check via visual prompts. A sympathetic reader would care because the approach is training-free and works on both general and GUI-specialized models, directly addressing the resolution gap that currently limits reliable GUI agents.

Core claim

AutoFocus samples multiple coordinate hypotheses from a VLM, converts their axial perplexities into an anisotropic Gaussian spatial probability field that models directional uncertainty, generates global and local region proposals using Shape-Aware Zooming, and selects the most consistent prediction through structured visual-prompt comparison.

What carries the argument

The anisotropic Gaussian spatial probability field built from axial perplexities of sampled coordinates, which supplies directional uncertainty estimates to generate adaptive region proposals.

If this is right

  • Grounding accuracy rises consistently on high-resolution GUI benchmarks for both general-purpose and specialized VLMs without any retraining.
  • Region proposals adapt to the model's actual uncertainty instead of relying on fixed anchors or heuristic grids.
  • Shape-Aware Zooming preserves context while tightening localization around uncertain areas.
  • Visual-prompt aggregation selects the prediction with highest internal consistency across proposals.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same perplexity-to-Gaussian conversion could be tested on other coordinate-output tasks such as robotic manipulation or document layout analysis.
  • If the uncertainty signal proves reliable, black-box APIs that hide token probabilities would need alternative proxies to use the method.
  • The framework suggests a broader pattern where model-internal uncertainty replaces external search heuristics in visual grounding pipelines.

Load-bearing premise

Token-level perplexity during coordinate generation directly reflects and can be mapped to spatial uncertainty in the predicted screen location.

What would settle it

If experiments on ScreenSpot-Pro or ScreenSpot-V2 show no accuracy improvement over single-shot prediction or fixed-grid zooming when perplexity is used to drive proposals, the central mechanism is refuted.

Figures

Figures reproduced from arXiv: 2605.02630 by Ruilin Yao, Shegnwu Xiong, Shili Xiong, Tianyu Zou, Yi Rong.

Figure 1
Figure 1. Figure 1: Comparison with other zoom-in methods. Recent works have explored multi-stage “zoom-in” strategies to mitigate this problem. How￾ever, existing approaches often rely on fixed anchors, heuristic grids, or zooming around a sin￾gle initial prediction (as shown in view at source ↗
Figure 2
Figure 2. Figure 2: Perplexity (PPL) distribution of grounding results on ScreenSpot-Pro. We com￾pare the PPL across multiple models, with green and red areas indicating correct and incorrect grounding predictions, respectively. Higher PPL values for incorrect results suggest a strong correlation between model uncertainty and grounding failure. exhibits high ambiguity, the core Gaussian Dynamic Focusing (GDF) module is activa… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of AutoFocus. The model first samples multiple coordinate hypothe￾ses with axial perplexities from the initial prediction. Gaussian Dynamic Focusing constructs an anisotropic spatial field to generate global and local proposals. High￾resolution zoom-in predictions are performed on candidate regions, and a structured aggregation step selects the final coordinate. 3 Method In this section, we presen… view at source ↗
Figure 4
Figure 4. Figure 4: Effect of Sampling Budget N. Sampling budget N view at source ↗
read the original abstract

Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding with VLMs. The core idea is that token-level perplexity during coordinate generation reflects spatial uncertainty; this is used to sample hypotheses, construct an anisotropic Gaussian spatial probability field from axial perplexities, generate global/local region proposals, apply Shape-Aware Zooming, and aggregate predictions via visual-prompt comparison. Experiments on ScreenSpot-Pro and ScreenSpot-V2 report consistent gains across general-purpose and GUI-specialized VLMs.

Significance. If the perplexity-to-uncertainty mapping is reliable, the work supplies a principled, parameter-free mechanism for adaptive refinement in high-resolution GUI interfaces, avoiding the limitations of fixed anchors or heuristic grids. It leverages existing VLM outputs without retraining and could improve autonomous agents by explicitly modeling directional uncertainty.

major comments (2)
  1. [Abstract] Abstract: The central claim that 'token-level perplexity in coordinate generation naturally reflects spatial uncertainty' and can be 'directly converted' into an anisotropic Gaussian is asserted without derivation, correlation analysis, or independent verification against localization error. This assumption is load-bearing for the region-proposal and zooming steps; if axial perplexities are only loosely correlated with true spatial error (e.g., due to joint token generation or coarse number tokenization), the framework reduces to a heuristic zoom strategy.
  2. [Method] Method description: No explicit mapping function, normalization, or covariance construction details are supplied for converting axial perplexities to the Gaussian parameters. Without these, reproducibility and the claim of a 'principled' rather than heuristic approach cannot be assessed.
minor comments (1)
  1. [Abstract] The abstract would be strengthened by briefly stating the magnitude of improvements and the specific baselines compared against.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We have addressed each of the major comments in detail below and will incorporate the suggested revisions into the next version of the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that 'token-level perplexity in coordinate generation naturally reflects spatial uncertainty' and can be 'directly converted' into an anisotropic Gaussian is asserted without derivation, correlation analysis, or independent verification against localization error. This assumption is load-bearing for the region-proposal and zooming steps; if axial perplexities are only loosely correlated with true spatial error (e.g., due to joint token generation or coarse number tokenization), the framework reduces to a heuristic zoom strategy.

    Authors: We acknowledge the referee's concern that the abstract states the core assumption without sufficient supporting evidence. While the full paper elaborates on the intuition in Section 3, we agree that a more rigorous justification is warranted. In the revised manuscript, we will add a dedicated subsection in the method that derives the perplexity-to-uncertainty mapping from the principles of autoregressive token prediction, where higher perplexity indicates greater ambiguity in the coordinate token. Additionally, we will include an empirical correlation analysis between axial perplexities and ground-truth localization errors on the ScreenSpot datasets to verify the assumption. This will strengthen the claim that the approach is principled rather than purely heuristic. revision: yes

  2. Referee: [Method] Method description: No explicit mapping function, normalization, or covariance construction details are supplied for converting axial perplexities to the Gaussian parameters. Without these, reproducibility and the claim of a 'principled' rather than heuristic approach cannot be assessed.

    Authors: We apologize for the omission of explicit implementation details in the method section, which is critical for reproducibility. In the revised paper, we will provide the complete mapping function, including the normalization of axial perplexities (p_x, p_y) to a standard range, the construction of the covariance matrix as an anisotropic diagonal matrix scaled by the normalized perplexities, and the exact formula for the spatial probability field. We will also include pseudocode and a step-by-step example to clarify how the Gaussian parameters are derived from the VLM's token outputs. These additions will allow readers to fully assess and replicate the framework. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation uses VLM perplexity outputs directly without self-referential reduction

full rationale

The paper's central step states that token-level perplexity 'naturally reflects spatial uncertainty' and is 'directly converted' into an anisotropic Gaussian field to guide proposals and Shape-Aware Zooming. This is presented as an insight applied to existing VLM token outputs rather than a quantity fitted or defined in terms of the downstream predictions. No equations reduce the uncertainty field back to the input coordinates by construction, no parameters are fitted on a subset and then 'predicted,' and no load-bearing self-citations or uniqueness theorems are invoked. The subsequent aggregation and zooming steps operate on the constructed field but remain independent of it. The chain is therefore self-contained and does not collapse to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on one domain assumption about perplexity reflecting spatial uncertainty. No free parameters or new invented entities are introduced in the abstract description.

axioms (1)
  • domain assumption Token-level perplexity in coordinate generation naturally reflects spatial uncertainty
    This is explicitly stated as the key insight that enables conversion to an anisotropic Gaussian field.

pith-pipeline@v0.9.0 · 5506 in / 1222 out tokens · 82905 ms · 2026-05-08T18:34:25.678986+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 25 canonical work pages · 5 internal anchors

  1. [1]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 10, 11, 12

  2. [2]

    Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Hu, W., Mao, Y., et al.: Spider2-v: How far are multimodal agents from automating data science and engineering workflows? Advances in Neural Information Processing Systems37, 107703–107744 (2024) 4

  3. [3]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Chen, G., Zhou, X., Shao, R., Lyu, Y., Zhou, K., Wang, S., Li, W., Li, Y., Qi, Z., Nie, L.: Less is more: Empowering gui agent with context-aware simplification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5901–5911 (2025) 4

  4. [4]

    In: Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers)

    Chen, W., Cui, J., Hu, J., Qin, Y., Fang, J., Zhao, Y., Wang, C., Liu, J., Chen, G., Huo, Y., et al.: Guicourse: From general vision language model to versatile gui agent. In: Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). pp. 21936–21959 (2025) 4

  5. [5]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., Wu, Z.: Seeclick: Harnessing gui grounding for advanced visual gui agents. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9313–9332 (2024) 11, 12

  6. [6]

    Advances in Neural Information Processing Systems36, 28091–28114 (2023) 4

    Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36, 28091–28114 (2023) 4

  7. [7]

    Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu

    Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., Su, Y.: Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243 (2024) 4

  8. [8]

    Ui-venus technical report: Building high-performance ui agents with rft

    Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al.: Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 (2025) 2, 10, 11, 12

  9. [9]

    Seed1.5-VL Technical Report

    Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al.: Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062 (2025) 11

  10. [10]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) 4

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14281–14290 (2024) 11

  12. [12]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Huang, Z., Cheng, Z., Pan, J., Hou, Z., Zhan, M.: Spiritsight agent: Advanced gui agent with one look. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29490–29500 (2025) 1

  13. [13]

    GPT-4o System Card

    Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 11

  14. [14]

    Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding.arXiv preprint arXiv:2512.05941, 2025

    Jiang, Z., Xie, S., Li, W., Zu, W., Li, P., Qiu, J., Pei, S., Ma, L., Huang, T., Wang, M., et al.: Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding. arXiv preprint arXiv:2512.05941 (2025) 2 16 F. Author et al

  15. [15]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

    Koh, J.Y., Lo, R., Jang, L., Duvvur, V., Lim, M., Huang, P.Y., Neubig, G., Zhou, S., Salakhutdinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 881–905 (2024) 4

  16. [16]

    Gui- spotlight: Adaptive iterative focus refinement for enhanced gui visual grounding.arXiv preprint arXiv:2510.04039, 2025

    Lei, B., Xu, N., Payani, A., Hong, M., Liao, C., Cao, Y., Ding, C.:\textsc{GUI- Spotlight}: Adaptive iterative focus refinement for enhanced gui visual grounding. arXiv preprint arXiv:2510.04039 (2025) 2

  17. [17]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9098–9108 (2025) 4

  18. [18]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.S.: Screenspot-pro: Gui grounding for professional high-resolution computer use. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 8778– 8786 (2025) 4, 10, 11

  19. [19]

    In: Proceed- ings of the Computer Vision and Pattern Recognition Conference

    Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 19498–19508 (2025) 11

  20. [20]

    arXiv preprint arXiv:2504.14239 , year=

    Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., Wu, F.: Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239 (2025) 11

  21. [21]

    Yuhang Liu, Pengxiang Li, Zishu Wei, Congkai Xie, Xueyu Hu, Xinchen Xu, Shengyu Zhang, Xiaotian Han, Hongxia Yang, and Fei Wu

    Liu, Y., Liu, Z., Zhu, S., Li, P., Xie, C., Wang, J., Hu, X., Han, X., Yuan, J., Wang, X., et al.: Infigui-g1: Advancing gui grounding with adaptive exploration policy optimization. arXiv preprint arXiv:2508.05731 (2025) 11

  22. [22]

    arXiv preprint arXiv:2505.00684 , year=

    Luo, T., Logeswaran, L., Johnson, J., Lee, H.: Visual test-time scaling for gui agent grounding. arXiv preprint arXiv:2505.00684 (2025) 2, 4, 5, 11

  23. [23]

    In: Findings of the Association for Computational Linguistics: ACL

    Pahuja, V., Lu, Y., Rosset, C., Gou, B., Mitra, A., Whitehead, S., Su, Y., Hassan, A.: Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. In: Findings of the Association for Computational Linguistics: ACL

  24. [24]

    6300–6323 (2025) 4

    pp. 6300–6323 (2025) 4

  25. [25]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    Park, J., Tang, P., Das, S., Appalaraju, S., Singh, K.Y., Manmatha, R., Ghadar, S.: R-vlm: Region-aware vision language model for precise gui grounding. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 9669–9685 (2025) 2

  26. [26]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326 (2025) 4, 10, 11, 12

  27. [27]

    Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers

    Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025) 4

  28. [28]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Sun, Y., Zhao, S., Yu, T., Wen, H., Va, S., Xu, M., Li, Y., Zhang, C.: Gui-xplore: Empowering generalizable gui agents with one exploration. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19477–19486 (2025) 1

  29. [29]

    Advances in Neural Information Processing Systems37, 2686–2710 (2024) 4 Abbreviated paper title 17

    Wang, J., Xu, H., Jia, H., Zhang, X., Yan, M., Shen, W., Zhang, J., Huang, F., Sang, J.: Mobile-agent-v2: Mobile device operation assistant with effective navi- gation via multi-agent collaboration. Advances in Neural Information Processing Systems37, 2686–2710 (2024) 4 Abbreviated paper title 17

  30. [30]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 11

  31. [31]

    In: Pro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)

    Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: Aligning language models with self-generated instructions. In: Pro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). pp. 13484–13508 (2023) 4

  32. [32]

    Advances in neural information processing systems35, 24824–24837 (2022) 4

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) 4

  33. [33]

    arXiv preprint arXiv:2602.11858 , year=

    Wei, L., He, L., Lan, J., Dong, L., Cai, Y., Li, S., Zhu, H., Wang, W., Kong, L., Wang, Y., et al.: Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception. arXiv preprint arXiv:2602.11858 (2026) 4

  34. [34]

    Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning.arXiv, abs/2507.00008, 2025

    Wu, H., Chen, H., Cai, Y., Liu, C., Ye, Q., Yang, M.H., Wang, Y.: Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning. arXiv preprint arXiv:2507.00008 (2025) 2, 4

  35. [35]

    arXiv preprint arXiv:2506.03143 , year=

    Wu, Q., Cheng, K., Yang, R., Zhang, C., Yang, J., Jiang, H., Mu, J., Peng, B., Qiao, B., Tan, R., et al.: Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143 (2025) 12

  36. [36]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P.P., et al.: Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218 (2024) 4, 10, 11, 12

  37. [37]

    Scaling computer-use grounding via user interface decomposition and synthesis.arXiv preprint arXiv:2505.13227, 2025

    Xie, T., Deng, J., Li, X., Yang, J., Wu, H., Chen, J., Hu, W., Wang, X., Xu, Y., Wang, Z., et al.: Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227 (2025) 11, 12

  38. [38]

    java21" shown on the file path of the file manager. Text 1 between text Click once at the position before

    Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., et al.: Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791 (2025) 2, 10, 11

  39. [39]

    In: Findings of the Association for Computational Linguistics: ACL 2025

    Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., Li, J.: Aria-ui: Visual grounding for gui instructions. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 22418–22433 (2025) 11

  40. [40]

    Advances in neural information processing systems36, 11809–11822 (2023) 4

    Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023) 4

  41. [41]

    In: Findings of the Association for Computational Linguistics: ACL 2024

    Ye, Q., Ahmed, M., Pryzant, R., Khani, F.: Prompt engineering a prompt engineer. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 355– 385 (2024)

  42. [42]

    arXiv preprint arXiv:2509.15532 (2025) 2

    Ye, X., Li, Y., Dai, W., Liu, M., Chen, Z., Han, Z., Min, H., Ren, J., Zhang, X., Yang, W., et al.: Gui-arp: Enhancing grounding with adaptive region perception for gui agents. arXiv preprint arXiv:2509.15532 (2025) 2

  43. [43]

    OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models

    Yu, W., Yang, Z., Wan, J., Song, S., Tang, J., Cheng, W., Liu, Y., Bai, X.: Om- niparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models. arXiv preprint arXiv:2502.16161 (2025) 4, 12

  44. [44]

    In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 1

    Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., Li, B.: SE-GUI: Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 1

  45. [45]

    Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning.arXiv preprint arXiv:2505.12370, 2025

    Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., et al.: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370 (2025) 11 18 F. Author et al

  46. [46]

    Mllms know where to look: Training-free per- ception of small visual details with multimodal llms.arXiv preprint arXiv:2502.17422, 2025

    Zhang, J., Khayatkhoei, M., Chhikara, P., Ilievski, F.: Mllms know where to look: Training-free perception of small visual details with multimodal llms. arXiv preprint arXiv:2502.17422 (2025) 4

  47. [47]

    In: International Conference on Machine Learning

    Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v (ision) is a generalist web agent, if grounded. In: International Conference on Machine Learning. pp. 61349–61385. PMLR (2024) 4

  48. [48]

    DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

    Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025) 4

  49. [49]

    Gui-g1: Understanding r1-zero-like training for visual grounding in gui agents.arXiv preprint arXiv:2505.15810, 2025

    Zhou, Y., Dai, S., Wang, S., Zhou, K., Jia, Q., Xu, J.: Gui-g1: Understand- ing r1-zero-like training for visual grounding in gui agents. arXiv preprint arXiv:2505.15810 (2025) 11