Recognition: 2 theorem links
· Lean TheoremAutoFocus: Uncertainty-Aware Active Visual Search for GUI Grounding
Pith reviewed 2026-05-08 18:34 UTC · model grok-4.3
The pith
AutoFocus converts token perplexity into an anisotropic Gaussian field to adaptively zoom and ground GUI coordinates.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AutoFocus samples multiple coordinate hypotheses from a VLM, converts their axial perplexities into an anisotropic Gaussian spatial probability field that models directional uncertainty, generates global and local region proposals using Shape-Aware Zooming, and selects the most consistent prediction through structured visual-prompt comparison.
What carries the argument
The anisotropic Gaussian spatial probability field built from axial perplexities of sampled coordinates, which supplies directional uncertainty estimates to generate adaptive region proposals.
If this is right
- Grounding accuracy rises consistently on high-resolution GUI benchmarks for both general-purpose and specialized VLMs without any retraining.
- Region proposals adapt to the model's actual uncertainty instead of relying on fixed anchors or heuristic grids.
- Shape-Aware Zooming preserves context while tightening localization around uncertain areas.
- Visual-prompt aggregation selects the prediction with highest internal consistency across proposals.
Where Pith is reading between the lines
- The same perplexity-to-Gaussian conversion could be tested on other coordinate-output tasks such as robotic manipulation or document layout analysis.
- If the uncertainty signal proves reliable, black-box APIs that hide token probabilities would need alternative proxies to use the method.
- The framework suggests a broader pattern where model-internal uncertainty replaces external search heuristics in visual grounding pipelines.
Load-bearing premise
Token-level perplexity during coordinate generation directly reflects and can be mapped to spatial uncertainty in the predicted screen location.
What would settle it
If experiments on ScreenSpot-Pro or ScreenSpot-V2 show no accuracy improvement over single-shot prediction or fixed-grid zooming when perplexity is used to drive proposals, the central mechanism is refuted.
Figures
read the original abstract
Vision-Language Models (VLMs) have enabled autonomous GUI agents that translate natural language instructions into executable screen coordinates. However, grounding performance degrades in high-resolution interfaces, where dense layouts and small interactive elements expose a resolution gap between modern displays and model input constraints. Existing zoom-in strategies rely on fixed anchors, heuristic grids, or reinforcement learning, lacking a principled mechanism to adaptively determine where refinement is needed and how much spatial uncertainty should be explored. We propose AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding. Our key insight is that token-level perplexity in coordinate generation naturally reflects spatial uncertainty. Rather than committing to a single prediction, AutoFocus samples multiple coordinate hypotheses and converts their axial perplexities into an anisotropic gaussian spatial probability field, explicitly modeling directional uncertainty. Based on this field, we generate global and local region proposals and introduce Shape-Aware Zooming to balance tight localization with contextual preservation. A visual prompt-based aggregation step then selects the most consistent prediction via structured comparison. Extensive experiments on ScreenSpot-Pro and ScreenSpot-V2 demonstrate consistent improvements across both general-purpose and GUI-specialized VLMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces AutoFocus, a training-free, uncertainty-aware active visual search framework for GUI grounding with VLMs. The core idea is that token-level perplexity during coordinate generation reflects spatial uncertainty; this is used to sample hypotheses, construct an anisotropic Gaussian spatial probability field from axial perplexities, generate global/local region proposals, apply Shape-Aware Zooming, and aggregate predictions via visual-prompt comparison. Experiments on ScreenSpot-Pro and ScreenSpot-V2 report consistent gains across general-purpose and GUI-specialized VLMs.
Significance. If the perplexity-to-uncertainty mapping is reliable, the work supplies a principled, parameter-free mechanism for adaptive refinement in high-resolution GUI interfaces, avoiding the limitations of fixed anchors or heuristic grids. It leverages existing VLM outputs without retraining and could improve autonomous agents by explicitly modeling directional uncertainty.
major comments (2)
- [Abstract] Abstract: The central claim that 'token-level perplexity in coordinate generation naturally reflects spatial uncertainty' and can be 'directly converted' into an anisotropic Gaussian is asserted without derivation, correlation analysis, or independent verification against localization error. This assumption is load-bearing for the region-proposal and zooming steps; if axial perplexities are only loosely correlated with true spatial error (e.g., due to joint token generation or coarse number tokenization), the framework reduces to a heuristic zoom strategy.
- [Method] Method description: No explicit mapping function, normalization, or covariance construction details are supplied for converting axial perplexities to the Gaussian parameters. Without these, reproducibility and the claim of a 'principled' rather than heuristic approach cannot be assessed.
minor comments (1)
- [Abstract] The abstract would be strengthened by briefly stating the magnitude of improvements and the specific baselines compared against.
Simulated Author's Rebuttal
We thank the referee for their thorough review and valuable feedback on our manuscript. We have addressed each of the major comments in detail below and will incorporate the suggested revisions into the next version of the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that 'token-level perplexity in coordinate generation naturally reflects spatial uncertainty' and can be 'directly converted' into an anisotropic Gaussian is asserted without derivation, correlation analysis, or independent verification against localization error. This assumption is load-bearing for the region-proposal and zooming steps; if axial perplexities are only loosely correlated with true spatial error (e.g., due to joint token generation or coarse number tokenization), the framework reduces to a heuristic zoom strategy.
Authors: We acknowledge the referee's concern that the abstract states the core assumption without sufficient supporting evidence. While the full paper elaborates on the intuition in Section 3, we agree that a more rigorous justification is warranted. In the revised manuscript, we will add a dedicated subsection in the method that derives the perplexity-to-uncertainty mapping from the principles of autoregressive token prediction, where higher perplexity indicates greater ambiguity in the coordinate token. Additionally, we will include an empirical correlation analysis between axial perplexities and ground-truth localization errors on the ScreenSpot datasets to verify the assumption. This will strengthen the claim that the approach is principled rather than purely heuristic. revision: yes
-
Referee: [Method] Method description: No explicit mapping function, normalization, or covariance construction details are supplied for converting axial perplexities to the Gaussian parameters. Without these, reproducibility and the claim of a 'principled' rather than heuristic approach cannot be assessed.
Authors: We apologize for the omission of explicit implementation details in the method section, which is critical for reproducibility. In the revised paper, we will provide the complete mapping function, including the normalization of axial perplexities (p_x, p_y) to a standard range, the construction of the covariance matrix as an anisotropic diagonal matrix scaled by the normalized perplexities, and the exact formula for the spatial probability field. We will also include pseudocode and a step-by-step example to clarify how the Gaussian parameters are derived from the VLM's token outputs. These additions will allow readers to fully assess and replicate the framework. revision: yes
Circularity Check
No circularity; derivation uses VLM perplexity outputs directly without self-referential reduction
full rationale
The paper's central step states that token-level perplexity 'naturally reflects spatial uncertainty' and is 'directly converted' into an anisotropic Gaussian field to guide proposals and Shape-Aware Zooming. This is presented as an insight applied to existing VLM token outputs rather than a quantity fitted or defined in terms of the downstream predictions. No equations reduce the uncertainty field back to the input coordinates by construction, no parameters are fitted on a subset and then 'predicted,' and no load-bearing self-citations or uniqueness theorems are invoked. The subsequent aggregation and zooming steps operate on the constructed field but remain independent of it. The chain is therefore self-contained and does not collapse to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Token-level perplexity in coordinate generation naturally reflects spatial uncertainty
Lean theorems connected to this paper
-
IndisputableMonolith/Cost (J(x)=½(x+x⁻¹)−1)washburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we map the x- and y-axis perplexities into anisotropic variances (σ_x, σ_y), forming a continuous 2D Gaussian probability field ... σ_{z,i} = β · PPL^(i)_z
-
Foundation/AlphaCoordinateFixation (parameter-free α=1 pin)alpha_pin_under_high_calibration unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We set the sampling temperature to τ=0.75 ... β=50 ... β=80 ... α∈{5,8} ... λ=0.5 ... K_local=3, K_global=2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 10, 11, 12
work page Pith review arXiv 2025
-
[2]
Cao, R., Lei, F., Wu, H., Chen, J., Fu, Y., Gao, H., Xiong, X., Zhang, H., Hu, W., Mao, Y., et al.: Spider2-v: How far are multimodal agents from automating data science and engineering workflows? Advances in Neural Information Processing Systems37, 107703–107744 (2024) 4
2024
-
[3]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Chen, G., Zhou, X., Shao, R., Lyu, Y., Zhou, K., Wang, S., Li, W., Li, Y., Qi, Z., Nie, L.: Less is more: Empowering gui agent with context-aware simplification. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 5901–5911 (2025) 4
2025
-
[4]
In: Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers)
Chen, W., Cui, J., Hu, J., Qin, Y., Fang, J., Zhao, Y., Wang, C., Liu, J., Chen, G., Huo, Y., et al.: Guicourse: From general vision language model to versatile gui agent. In: Proceedings of the 63rd Annual Meeting of the Association for Compu- tational Linguistics (Volume 1: Long Papers). pp. 21936–21959 (2025) 4
2025
-
[5]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Cheng, K., Sun, Q., Chu, Y., Xu, F., YanTao, L., Zhang, J., Wu, Z.: Seeclick: Harnessing gui grounding for advanced visual gui agents. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 9313–9332 (2024) 11, 12
2024
-
[6]
Advances in Neural Information Processing Systems36, 28091–28114 (2023) 4
Deng, X., Gu, Y., Zheng, B., Chen, S., Stevens, S., Wang, B., Sun, H., Su, Y.: Mind2web: Towards a generalist agent for the web. Advances in Neural Information Processing Systems36, 28091–28114 (2023) 4
2023
-
[7]
Hongliang He, Wenlin Yao, Kaixin Ma, Wenhao Yu, Yong Dai, Hongming Zhang, Zhenzhong Lan, and Dong Yu
Gou, B., Wang, R., Zheng, B., Xie, Y., Chang, C., Shu, Y., Sun, H., Su, Y.: Navigating the digital world as humans do: Universal visual grounding for gui agents. arXiv preprint arXiv:2410.05243 (2024) 4
-
[8]
Ui-venus technical report: Building high-performance ui agents with rft
Gu, Z., Zeng, Z., Xu, Z., Zhou, X., Shen, S., Liu, Y., Zhou, B., Meng, C., Xia, T., Chen, W., et al.: Ui-venus technical report: Building high-performance ui agents with rft. arXiv preprint arXiv:2508.10833 (2025) 2, 10, 11, 12
-
[9]
Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al.: Seed1. 5-vl technical report. arXiv preprint arXiv:2505.07062 (2025) 11
work page internal anchor Pith review arXiv 2025
-
[10]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Gupta, T., Kembhavi, A.: Visual programming: Compositional visual reasoning without training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 14953–14962 (2023) 4
2023
-
[11]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Hong, W., Wang, W., Lv, Q., Xu, J., Yu, W., Ji, J., Wang, Y., Wang, Z., Dong, Y., Ding, M., et al.: Cogagent: A visual language model for gui agents. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 14281–14290 (2024) 11
2024
-
[12]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Huang, Z., Cheng, Z., Pan, J., Hou, Z., Zhan, M.: Spiritsight agent: Advanced gui agent with one look. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 29490–29500 (2025) 1
2025
-
[13]
Hurst, A., Lerer, A., Goucher, A.P., Perelman, A., Ramesh, A., Clark, A., Os- trow, A., Welihinda, A., Hayes, A., Radford, A., et al.: Gpt-4o system card. arXiv preprint arXiv:2410.21276 (2024) 11
work page Pith review arXiv 2024
-
[14]
Jiang, Z., Xie, S., Li, W., Zu, W., Li, P., Qiu, J., Pei, S., Ma, L., Huang, T., Wang, M., et al.: Zoom in, click out: Unlocking and evaluating the potential of zooming for gui grounding. arXiv preprint arXiv:2512.05941 (2025) 2 16 F. Author et al
-
[15]
In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Koh, J.Y., Lo, R., Jang, L., Duvvur, V., Lim, M., Huang, P.Y., Neubig, G., Zhou, S., Salakhutdinov, R., Fried, D.: Visualwebarena: Evaluating multimodal agents on realistic visual web tasks. In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 881–905 (2024) 4
2024
-
[16]
Lei, B., Xu, N., Payani, A., Hong, M., Liao, C., Cao, Y., Ding, C.:\textsc{GUI- Spotlight}: Adaptive iterative focus refinement for enhanced gui visual grounding. arXiv preprint arXiv:2510.04039 (2025) 2
-
[17]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Li, G., Xu, J., Zhao, Y., Peng, Y.: Dyfo: A training-free dynamic focus visual search for enhancing lmms in fine-grained visual understanding. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 9098–9108 (2025) 4
2025
-
[18]
In: Proceedings of the 33rd ACM International Conference on Multimedia
Li, K., Meng, Z., Lin, H., Luo, Z., Tian, Y., Ma, J., Huang, Z., Chua, T.S.: Screenspot-pro: Gui grounding for professional high-resolution computer use. In: Proceedings of the 33rd ACM International Conference on Multimedia. pp. 8778– 8786 (2025) 4, 10, 11
2025
-
[19]
In: Proceed- ings of the Computer Vision and Pattern Recognition Conference
Lin, K.Q., Li, L., Gao, D., Yang, Z., Wu, S., Bai, Z., Lei, S.W., Wang, L., Shou, M.Z.: Showui: One vision-language-action model for gui visual agent. In: Proceed- ings of the Computer Vision and Pattern Recognition Conference. pp. 19498–19508 (2025) 11
2025
-
[20]
arXiv preprint arXiv:2504.14239 , year=
Liu, Y., Li, P., Xie, C., Hu, X., Han, X., Zhang, S., Yang, H., Wu, F.: Infigui-r1: Advancing multimodal gui agents from reactive actors to deliberative reasoners. arXiv preprint arXiv:2504.14239 (2025) 11
-
[21]
Liu, Y., Liu, Z., Zhu, S., Li, P., Xie, C., Wang, J., Hu, X., Han, X., Yuan, J., Wang, X., et al.: Infigui-g1: Advancing gui grounding with adaptive exploration policy optimization. arXiv preprint arXiv:2508.05731 (2025) 11
-
[22]
arXiv preprint arXiv:2505.00684 , year=
Luo, T., Logeswaran, L., Johnson, J., Lee, H.: Visual test-time scaling for gui agent grounding. arXiv preprint arXiv:2505.00684 (2025) 2, 4, 5, 11
-
[23]
In: Findings of the Association for Computational Linguistics: ACL
Pahuja, V., Lu, Y., Rosset, C., Gou, B., Mitra, A., Whitehead, S., Su, Y., Hassan, A.: Explorer: Scaling exploration-driven web trajectory synthesis for multimodal web agents. In: Findings of the Association for Computational Linguistics: ACL
-
[24]
6300–6323 (2025) 4
pp. 6300–6323 (2025) 4
2025
-
[25]
In: Findings of the Association for Computational Linguistics: ACL 2025
Park, J., Tang, P., Das, S., Appalaraju, S., Singh, K.Y., Manmatha, R., Ghadar, S.: R-vlm: Region-aware vision language model for precise gui grounding. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 9669–9685 (2025) 2
2025
-
[26]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Qin, Y., Ye, Y., Fang, J., Wang, H., Liang, S., Tian, S., Zhang, J., Li, J., Li, Y., Huang, S., et al.: Ui-tars: Pioneering automated gui interaction with native agents. arXiv preprint arXiv:2501.12326 (2025) 4, 10, 11, 12
work page Pith review arXiv 2025
-
[27]
Thinking with Images for Multimodal Reasoning: Foundations, Methods, and Future Frontiers
Su, Z., Xia, P., Guo, H., Liu, Z., Ma, Y., Qu, X., Liu, J., Li, Y., Zeng, K., Yang, Z., et al.: Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918 (2025) 4
work page internal anchor Pith review arXiv 2025
-
[28]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Sun, Y., Zhao, S., Yu, T., Wen, H., Va, S., Xu, M., Li, Y., Zhang, C.: Gui-xplore: Empowering generalizable gui agents with one exploration. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 19477–19486 (2025) 1
2025
-
[29]
Advances in Neural Information Processing Systems37, 2686–2710 (2024) 4 Abbreviated paper title 17
Wang, J., Xu, H., Jia, H., Zhang, X., Yan, M., Shen, W., Zhang, J., Huang, F., Sang, J.: Mobile-agent-v2: Mobile device operation assistant with effective navi- gation via multi-agent collaboration. Advances in Neural Information Processing Systems37, 2686–2710 (2024) 4 Abbreviated paper title 17
2024
-
[30]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 11
work page Pith review arXiv 2024
-
[31]
In: Pro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers)
Wang, Y., Kordi, Y., Mishra, S., Liu, A., Smith, N.A., Khashabi, D., Hajishirzi, H.: Self-instruct: Aligning language models with self-generated instructions. In: Pro- ceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers). pp. 13484–13508 (2023) 4
2023
-
[32]
Advances in neural information processing systems35, 24824–24837 (2022) 4
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022) 4
2022
-
[33]
arXiv preprint arXiv:2602.11858 , year=
Wei, L., He, L., Lan, J., Dong, L., Cai, Y., Li, S., Zhu, H., Wang, W., Kong, L., Wang, Y., et al.: Zooming without zooming: Region-to-image distillation for fine-grained multimodal perception. arXiv preprint arXiv:2602.11858 (2026) 4
-
[34]
Wu, H., Chen, H., Cai, Y., Liu, C., Ye, Q., Yang, M.H., Wang, Y.: Dimo-gui: Advancing test-time scaling in gui grounding via modality-aware visual reasoning. arXiv preprint arXiv:2507.00008 (2025) 2, 4
-
[35]
arXiv preprint arXiv:2506.03143 , year=
Wu, Q., Cheng, K., Yang, R., Zhang, C., Yang, J., Jiang, H., Mu, J., Peng, B., Qiao, B., Tan, R., et al.: Gui-actor: Coordinate-free visual grounding for gui agents. arXiv preprint arXiv:2506.03143 (2025) 12
-
[36]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Wu, Z., Wu, Z., Xu, F., Wang, Y., Sun, Q., Jia, C., Cheng, K., Ding, Z., Chen, L., Liang, P.P., et al.: Os-atlas: A foundation action model for generalist gui agents. arXiv preprint arXiv:2410.23218 (2024) 4, 10, 11, 12
work page internal anchor Pith review arXiv 2024
-
[37]
Xie, T., Deng, J., Li, X., Yang, J., Wu, H., Chen, J., Hu, W., Wang, X., Xu, Y., Wang, Z., et al.: Scaling computer-use grounding via user interface decomposition and synthesis. arXiv preprint arXiv:2505.13227 (2025) 11, 12
-
[38]
Yang, Y., Li, D., Dai, Y., Yang, Y., Luo, Z., Zhao, Z., Hu, Z., Huang, J., Saha, A., Chen, Z., et al.: Gta1: Gui test-time scaling agent. arXiv preprint arXiv:2507.05791 (2025) 2, 10, 11
-
[39]
In: Findings of the Association for Computational Linguistics: ACL 2025
Yang, Y., Wang, Y., Li, D., Luo, Z., Chen, B., Huang, C., Li, J.: Aria-ui: Visual grounding for gui instructions. In: Findings of the Association for Computational Linguistics: ACL 2025. pp. 22418–22433 (2025) 11
2025
-
[40]
Advances in neural information processing systems36, 11809–11822 (2023) 4
Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T., Cao, Y., Narasimhan, K.: Tree of thoughts: Deliberate problem solving with large language models. Advances in neural information processing systems36, 11809–11822 (2023) 4
2023
-
[41]
In: Findings of the Association for Computational Linguistics: ACL 2024
Ye, Q., Ahmed, M., Pryzant, R., Khani, F.: Prompt engineering a prompt engineer. In: Findings of the Association for Computational Linguistics: ACL 2024. pp. 355– 385 (2024)
2024
-
[42]
arXiv preprint arXiv:2509.15532 (2025) 2
Ye, X., Li, Y., Dai, W., Liu, M., Chen, Z., Han, Z., Min, H., Ren, J., Zhang, X., Yang, W., et al.: Gui-arp: Enhancing grounding with adaptive region perception for gui agents. arXiv preprint arXiv:2509.15532 (2025) 2
-
[43]
Yu, W., Yang, Z., Wan, J., Song, S., Tang, J., Cheng, W., Liu, Y., Bai, X.: Om- niparser v2: Structured-points-of-thought for unified visual text parsing and its generality to multimodal large language models. arXiv preprint arXiv:2502.16161 (2025) 4, 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 1
Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., Li, B.: SE-GUI: Enhancing visual grounding for GUI agents via self-evolutionary reinforcement learning. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2025) 1
2025
-
[45]
Yuan, X., Zhang, J., Li, K., Cai, Z., Yao, L., Chen, J., Wang, E., Hou, Q., Chen, J., Jiang, P.T., et al.: Enhancing visual grounding for gui agents via self-evolutionary reinforcement learning. arXiv preprint arXiv:2505.12370 (2025) 11 18 F. Author et al
-
[46]
Zhang, J., Khayatkhoei, M., Chhikara, P., Ilievski, F.: Mllms know where to look: Training-free perception of small visual details with multimodal llms. arXiv preprint arXiv:2502.17422 (2025) 4
-
[47]
In: International Conference on Machine Learning
Zheng, B., Gou, B., Kil, J., Sun, H., Su, Y.: Gpt-4v (ision) is a generalist web agent, if grounded. In: International Conference on Machine Learning. pp. 61349–61385. PMLR (2024) 4
2024
-
[48]
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning
Zheng, Z., Yang, M., Hong, J., Zhao, C., Xu, G., Yang, L., Shen, C., Yu, X.: Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362 (2025) 4
work page internal anchor Pith review arXiv 2025
-
[49]
Zhou, Y., Dai, S., Wang, S., Zhou, K., Jia, Q., Xu, J.: Gui-g1: Understand- ing r1-zero-like training for visual grounding in gui agents. arXiv preprint arXiv:2505.15810 (2025) 11
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.