pith. sign in

arxiv: 2505.23678 · v3 · pith:4RMMTDPLnew · submitted 2025-05-29 · 💻 cs.CV

Grounded Reinforcement Learning for Visual Reasoning

Pith reviewed 2026-05-22 01:00 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual reasoningreinforcement learningvision-language modelsvisual groundingspatial reasoningmulti-turn RLvisual searchGUI grounding
0
0 comments X

The pith

ViGoRL trains vision-language models with RL that anchors each reasoning step to specific visual coordinates, reaching 86.4 percent on V*Bench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a reinforcement learning method for vision-language models that forces each step in a reasoning chain to reference exact locations in the input image. This addresses the extra difficulty visual tasks pose compared to text-only problems, where models must actively direct attention and tie abstract thoughts to spatial evidence. A multi-turn extension lets the model zoom into those locations for finer detail when needed. Across benchmarks for spatial reasoning, visual search, and web element grounding, the grounded approach beats both supervised fine-tuning and standard RL that lacks explicit location anchors. The results indicate that building visual grounding into the learning process can improve attention, subgoal setting, and self-verification in visual decision making.

Core claim

ViGoRL is a vision-language model trained with reinforcement learning to explicitly anchor each reasoning step to predicted visual coordinates in the image. When detailed inspection is required, a multi-turn RL framework lets the model dynamically zoom into those coordinates as reasoning proceeds. This produces spatially grounded traces that guide attention to relevant regions and yields consistent gains over baselines without grounding on tasks including SAT-2, BLINK, V*Bench, ScreenSpot, and VisualWebArena.

What carries the argument

Spatially grounded reasoning traces produced by RL, in which every step is tied to specific visual coordinates, plus multi-turn interaction that enables dynamic zooming into predicted locations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same anchoring technique could be tested on tasks like medical image interpretation where precise location references reduce errors.
  • Grounded traces may increase user trust by making each reasoning step visibly linked to image evidence.
  • Scaling the multi-turn zoom mechanism to longer sequences or video inputs might extend the benefits to dynamic visual environments.

Load-bearing premise

Forcing the model to link every reasoning step to predicted visual coordinates and allowing dynamic zooming will deliver reliable gains without creating new errors in attention or reward design.

What would settle it

An ablation on V*Bench that removes coordinate anchoring and zooming, then measures whether accuracy falls substantially below 86.4 percent, would show whether the grounding mechanism is required for the reported improvements.

Figures

Figures reproduced from arXiv: 2505.23678 by Aviral Kumar, Ayush Jain, Gabriel Sarch, Katerina Fragkiadaki, Michael J. Tarr, Naitik Khandelwal, Snigdha Saha.

Figure 1
Figure 1. Figure 1: Grounded visual reasoning enables interpretable and accurate answers. ViGoRL decomposes the task into a sequence of natural language thoughts anchored in image regions. In contrast, Vanilla GRPO and SFT baselines produce ungrounded and incorrect responses. Abstract While reinforcement learning (RL) over chains of thought has significantly ad￾vanced language models in tasks such as mathematics and coding, v… view at source ↗
Figure 2
Figure 2. Figure 2: Without actively reinforcing visually grounded behaviors, RL collapses onto shortcuts that maximize immediate rewards at the expense of richer visual reasoning. Standard CoT and Vanilla GRPO (left and center) exhibit visually ungrounded reasoning, relying on vague references to scene elements (shown in yellow), which often results in incorrect answers (marked in red). In contrast, Visually Grounded RL (rig… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the ViGoRL approach. (Left) We use MCTS with a teacher model to generate reasoning chains grounded in specific image regions. (Middle) These reasoning trees are linearized and used for supervised fine-tuning (SFT) to train a base model. (Right) We apply GRPO with an outcome-based reward to further refine the grounded reasoning. 4.2 Warm-Start Data Generation via MCTS MCTS with Visual Grounding.… view at source ↗
Figure 4
Figure 4. Figure 4: Human evaluation of grounded rea￾soning. Participants judged the grounded pre￾dictions as both accurate and helpful when correct. To assess our model’s grounded reasoning traces, we conducted a human study evaluating whether pre￾dicted coordinates (1) correctly referred to the in￾tended image region, and (2) helped participants un￾derstand the associated reasoning step (details are shown in Appendix A7). F… view at source ↗
read the original abstract

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ViGoRL, a vision-language model trained with reinforcement learning to explicitly anchor each reasoning step to predicted visual coordinates. It proposes a novel multi-turn RL framework enabling dynamic zooming into those coordinates for fine-grained exploration. The approach is evaluated on spatial reasoning (SAT-2, BLINK), visual search (V*Bench, with 86.4% reported), and GUI/web grounding (ScreenSpot, VisualWebArena) benchmarks, where it outperforms supervised fine-tuning and conventional RL baselines lacking explicit grounding. Additional claims include amplification of behaviors such as region exploration and visual verification, supported by human evaluations of spatial accuracy and reasoning helpfulness.

Significance. If the results hold after addressing the isolation of the grounding mechanism, the work would represent a meaningful advance in visual reasoning for VLMs by demonstrating how explicit spatial grounding within RL can improve both performance and interpretability. Credit is due for the diverse benchmark coverage spanning spatial, search, and interactive tasks, as well as the human study confirming that visual references aid understanding of model steps. The multi-turn zooming framework addresses a practical limitation in handling small or detailed visual elements.

major comments (2)
  1. [Abstract] Abstract: The central claim that explicit coordinate grounding (rather than multi-turn interaction alone) drives the gains is not supported by a controlled ablation. No experiment is described that retains the identical multi-turn schedule, reward structure, and zoom mechanics while removing the requirement to produce coordinate-anchored reasoning traces. Without this isolation, the 86.4% V*Bench result and outperformance over 'conventional RL baselines' cannot be attributed specifically to grounding.
  2. [Experiments] Experiments section (benchmark tables and ablation studies): The reported performance numbers lack error bars or standard deviations, and no details are provided on how the grounding term is incorporated into the reward function or on ablations varying reward shaping. These omissions make it impossible to determine the reliability of the improvements or to rule out that other training choices, rather than grounding, are responsible for the observed differences.
minor comments (2)
  1. [Abstract] The abstract references human evaluations showing that visual references are 'spatially accurate' and 'helpful,' but provides no protocol details, participant count, or quantitative metrics for these assessments.
  2. [Method] Notation for the grounding mechanism and the multi-turn reward components could be clarified with explicit equations or pseudocode in the method section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claim that explicit coordinate grounding (rather than multi-turn interaction alone) drives the gains is not supported by a controlled ablation. No experiment is described that retains the identical multi-turn schedule, reward structure, and zoom mechanics while removing the requirement to produce coordinate-anchored reasoning traces. Without this isolation, the 86.4% V*Bench result and outperformance over 'conventional RL baselines' cannot be attributed specifically to grounding.

    Authors: We appreciate the referee's point regarding the need for a more controlled ablation to isolate the effect of explicit coordinate grounding. Our conventional RL baselines lack the grounding mechanism but may not perfectly match the multi-turn schedule in all cases. To rigorously address this, we will introduce a new ablation in the revised manuscript that uses the same multi-turn RL framework, reward structure, and zoom mechanics, but without requiring the model to output coordinate-anchored reasoning traces. This will help confirm that the performance gains, including the 86.4% on V*Bench, are attributable to the grounding component. revision: yes

  2. Referee: [Experiments] Experiments section (benchmark tables and ablation studies): The reported performance numbers lack error bars or standard deviations, and no details are provided on how the grounding term is incorporated into the reward function or on ablations varying reward shaping. These omissions make it impossible to determine the reliability of the improvements or to rule out that other training choices, rather than grounding, are responsible for the observed differences.

    Authors: Thank you for highlighting these omissions. We will revise the Experiments section to include error bars and standard deviations for all reported performance numbers, based on multiple training runs with different random seeds. We will also provide a detailed description of how the grounding term is integrated into the overall reward function, including its mathematical formulation and weighting. Furthermore, we will add ablation studies that vary the reward shaping to demonstrate the robustness of our results and to further isolate the contribution of grounding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training outcomes are independent of inputs

full rationale

The paper introduces ViGoRL as an RL training procedure that adds explicit coordinate anchoring and multi-turn zoom mechanics to a vision-language model. All performance numbers (e.g., 86.4% on V*Bench) are reported as measured results after training on the described benchmarks and comparing against supervised fine-tuning and standard RL baselines. No equations, uniqueness theorems, or self-citations are used to derive the method or results; the central claims rest on external empirical comparisons that do not reduce to the training inputs by construction. The work is therefore self-contained as an experimental contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the standard RL setup and the new ViGoRL training loop itself.

pith-pipeline@v0.9.0 · 5820 in / 1048 out tokens · 33110 ms · 2026-05-22T01:00:10.382918+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

    cs.CV 2026-05 unverdicted novelty 7.0

    Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object de...

  2. PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.

  3. Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 6.0

    Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.

  4. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.

  5. Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs

    cs.CV 2026-04 unverdicted novelty 6.0

    Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.

  6. Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs

    cs.CV 2025-05 unverdicted novelty 6.0

    Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.

  7. AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents

    cs.CV 2026-05 unverdicted novelty 5.0

    AtlasVA organizes VLM agent memory into spatial heatmaps, visual exemplars, and symbolic skills, evolving atlases from trajectories to act as potential-based shaping rewards in teacher-free reinforcement learning.

  8. DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding

    cs.AI 2026-05 unverdicted novelty 5.0

    DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without ...

  9. Perceptual Flow Network for Visually Grounded Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

  10. Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

    cs.CV 2026-04 unverdicted novelty 5.0

    Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.

  11. Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning

    cs.AI 2025-09 unverdicted novelty 5.0

    MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...

  12. OpenWorldLib: A Unified Codebase and Definition of Advanced World Models

    cs.CV 2026-04 unverdicted novelty 4.0

    OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 11 Pith papers · 39 internal anchors

  1. [1]

    GPT-4 Technical Report

    Openai. gpt-4 technical report.arXiv preprint arxiv:2303.08774, 2023

  2. [2]

    Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  4. [4]

    Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997

    Dana H Ballard, Mary M Hayhoe, Polly K Pook, and Rajesh PN Rao. Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997

  5. [5]

    Visual serial processing deficits explain divergences in human and vlm reasoning.arXiv preprint arXiv:2509.25142, 2025

    Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D Cohen, Taylor W Webb, and Thomas L Griffiths. Visual serial processing deficits explain divergences in human and vlm reasoning.arXiv preprint arXiv:2509.25142, 2025

  6. [6]

    Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

    Declan Campbell, Sunayana Rane, Tyler Giallanza, Camillo Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven Frankland, Tom Griffiths, Jonathan D Cohen, et al. Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

  7. [7]

    PaLI: A Jointly-Scaled Multilingual Language-Image Model

    Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

  8. [8]

    SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

  9. [9]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500

  10. [10]

    Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024

  11. [11]

    Attention over learned object embeddings enables complex visual reasoning

    Zhengyuan Ding, Yuwei Chen, Yichong Xu, Zhe Wang, Xintao Han, Dong Yu, and Zhou Yu. Attention over learned object embeddings enables complex visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  12. [12]

    Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025

    Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025. URLhttps://arxiv.org/abs/2411.14432

  13. [13]

    BLINK: Multimodal Large Language Models Can See but Not Perceive

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024

  14. [14]

    Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

    Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding, 2025. URLhttps://arxiv.org/abs/2501.05452. 12

  15. [15]

    Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

    Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

  16. [16]

    Grounded decoding with visual descriptions reduces hallucination in large vision-language models

    Sarthak Ghosh, Ben Lee, Jean-Baptiste Alayrac, Xuhong Zhai, Christoph Feichtenhofer, Joao Carreira, and Ishan Misra. Grounded decoding with visual descriptions reduces hallucination in large vision-language models. InInternational Conference on Learning Representations (ICLR), 2024

  17. [17]

    Navigating the digital world as humans do: Universal visual grounding for GUI agents

    Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT

  18. [18]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  19. [19]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  20. [20]

    Robust compositional visual reasoning via language-guided neural module networks

    Arjun Gupta, Xi Victoria Lin, Chunyuan Zhang, Michel Galley, Jianfeng Gao, and Car- los Guestrin Ferrer. Robust compositional visual reasoning via language-guided neural module networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

  21. [21]

    Visual programming: Compositional visual reason- ing without training

    Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training, 2022. URLhttps://arxiv.org/abs/2211.11559

  22. [22]

    The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990

    Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990

  23. [23]

    Cogagent: A visual language model for gui agents

    Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

  24. [24]

    Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019

    David Hoppe and Constantin A Rothkopf. Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019

  25. [26]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URLhttps://arxiv.org/abs/2406.09403

  26. [27]

    Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024

    Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024. URL https://arxiv.org/abs/ 2312.03052

  27. [28]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

  28. [29]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 13

  29. [30]

    Segment Anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023

  30. [31]

    VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

    Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649, 2024

  31. [32]

    Large language models are zero-shot reasoners, 2023

    Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205. 11916

  32. [33]

    Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025

    Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf. Preprint

  33. [34]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning, 2025. URLhttps://arxiv.org/abs/2504.06958

  34. [35]

    V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025

    Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025. URL https://arxiv.org/abs/2405.16919

  35. [36]

    Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

    Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

  36. [37]

    Improved baselines with visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

  37. [38]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https: //arxiv.org/abs/2503.20783

  38. [39]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning, 2025. URL https://arxiv.org/ abs/2503.01785

  39. [41]

    UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025

  40. [42]

    s1: Simple test-time scaling

    Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393

  41. [43]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin,...

  42. [44]

    Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

  43. [45]

    Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

    Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URLhttps://arxiv.org/abs/2408.07199

  44. [46]

    Cogcom: Compositional visual reasoning with chain-of-manipulations

    Jinyi Qi, Tao Zhang, Rui Chen, Xiaoxue Li, Yizhou Zhang, and Kai-Wei Chang. Cogcom: Compositional visual reasoning with chain-of-manipulations. InInternational Conference on Learning Representations (ICLR), 2025

  45. [47]

    UI-TARS: Pioneering Automated GUI Interaction with Native Agents

    Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

  46. [48]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

  47. [49]

    Vision language models are blind

    Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words, 2025. URLhttps://arxiv.org/abs/2407.06581

  48. [50]

    Sat: Spa- tial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. URL https://arxiv.org/abs/2412.07755

  49. [51]

    Vlm agents generate their own memories: Distilling experience into embodied programs of thought

    Gabriel Herbert Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  50. [52]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300

  51. [53]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/ abs/2504.07615

  52. [54]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  53. [55]

    RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics

    Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. To appear

  54. [56]

    ViperGPT: Visual Inference via Python Execution for Reasoning

    Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023

  55. [57]

    Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning, 2025. URL https://arxiv.org/abs/2503.20752

  56. [58]

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, and et. Al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URLhttps://arxiv.org/abs/2403.05530. 15

  57. [60]

    Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

  58. [61]

    Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, and et. Al. Kimi-VL technical report, 2025. URLhttps://arxiv.org/abs/2504.07491

  59. [62]

    Winoground: Probing vision and language models for visio-linguistic compositionality, 2022

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality, 2022. URLhttps://arxiv.org/abs/2204.03162

  60. [63]

    Treisman and Garry Gelade

    Anne M. Treisman and Garry Gelade. A feature-integration theory of attention.Cognitive Psychology, 12(1):97–136, 1980

  61. [64]

    Visual routines.Cognition, 18(1-3):97–159, 1984

    Shimon Ullman. Visual routines.Cognition, 18(1-3):97–159, 1984

  62. [65]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  63. [66]

    RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

    Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URLhttps://arxiv. org/abs/...

  64. [67]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

  65. [68]

    Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

    Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671

  66. [69]

    V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

    Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

  67. [70]

    Thinking llms: General instruction following with thought generation, 2024

    Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Thinking llms: General instruction following with thought generation, 2024. URL https: //arxiv.org/abs/2410.10630

  68. [71]

    Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,

    Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,

  69. [72]

    URLhttps://arxiv.org/abs/2404.03622. 16

  70. [73]

    OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

    Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

  71. [74]

    Grok-1.5 vision preview

    xAI. Grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024. Accessed: 2025-05- 21

  72. [75]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. URLhttps://arxiv.org/abs/2411.10440

  73. [76]

    Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

  74. [77]

    Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016

    Scott Cheng-Hsin Yang, Mate Lengyel, and Daniel M Wolpert. Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016

  75. [78]

    Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016

    Scott Cheng-Hsin Yang, Daniel M Wolpert, and Máté Lengyel. Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016

  76. [79]

    Mm-react: Prompting chatgpt for multimodal reasoning and action

    Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. 2023

  77. [80]

    Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

    Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, 2024. URL https://arxiv.org/abs/2412.18319

  78. [81]

    Yarbus.Eye Movements and Vision

    Alfred L. Yarbus.Eye Movements and Vision. Springer, 1967

  79. [82]

    Demystifying Long Chain-of-Thought Reasoning in LLMs

    Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

  80. [83]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476

Showing first 80 references.