Grounded Reinforcement Learning for Visual Reasoning
Pith reviewed 2026-05-22 01:00 UTC · model grok-4.3
The pith
ViGoRL trains vision-language models with RL that anchors each reasoning step to specific visual coordinates, reaching 86.4 percent on V*Bench.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ViGoRL is a vision-language model trained with reinforcement learning to explicitly anchor each reasoning step to predicted visual coordinates in the image. When detailed inspection is required, a multi-turn RL framework lets the model dynamically zoom into those coordinates as reasoning proceeds. This produces spatially grounded traces that guide attention to relevant regions and yields consistent gains over baselines without grounding on tasks including SAT-2, BLINK, V*Bench, ScreenSpot, and VisualWebArena.
What carries the argument
Spatially grounded reasoning traces produced by RL, in which every step is tied to specific visual coordinates, plus multi-turn interaction that enables dynamic zooming into predicted locations.
Where Pith is reading between the lines
- The same anchoring technique could be tested on tasks like medical image interpretation where precise location references reduce errors.
- Grounded traces may increase user trust by making each reasoning step visibly linked to image evidence.
- Scaling the multi-turn zoom mechanism to longer sequences or video inputs might extend the benefits to dynamic visual environments.
Load-bearing premise
Forcing the model to link every reasoning step to predicted visual coordinates and allowing dynamic zooming will deliver reliable gains without creating new errors in attention or reward design.
What would settle it
An ablation on V*Bench that removes coordinate anchoring and zooming, then measures whether accuracy falls substantially below 86.4 percent, would show whether the grounding mechanism is required for the reported improvements.
Figures
read the original abstract
While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ViGoRL, a vision-language model trained with reinforcement learning to explicitly anchor each reasoning step to predicted visual coordinates. It proposes a novel multi-turn RL framework enabling dynamic zooming into those coordinates for fine-grained exploration. The approach is evaluated on spatial reasoning (SAT-2, BLINK), visual search (V*Bench, with 86.4% reported), and GUI/web grounding (ScreenSpot, VisualWebArena) benchmarks, where it outperforms supervised fine-tuning and conventional RL baselines lacking explicit grounding. Additional claims include amplification of behaviors such as region exploration and visual verification, supported by human evaluations of spatial accuracy and reasoning helpfulness.
Significance. If the results hold after addressing the isolation of the grounding mechanism, the work would represent a meaningful advance in visual reasoning for VLMs by demonstrating how explicit spatial grounding within RL can improve both performance and interpretability. Credit is due for the diverse benchmark coverage spanning spatial, search, and interactive tasks, as well as the human study confirming that visual references aid understanding of model steps. The multi-turn zooming framework addresses a practical limitation in handling small or detailed visual elements.
major comments (2)
- [Abstract] Abstract: The central claim that explicit coordinate grounding (rather than multi-turn interaction alone) drives the gains is not supported by a controlled ablation. No experiment is described that retains the identical multi-turn schedule, reward structure, and zoom mechanics while removing the requirement to produce coordinate-anchored reasoning traces. Without this isolation, the 86.4% V*Bench result and outperformance over 'conventional RL baselines' cannot be attributed specifically to grounding.
- [Experiments] Experiments section (benchmark tables and ablation studies): The reported performance numbers lack error bars or standard deviations, and no details are provided on how the grounding term is incorporated into the reward function or on ablations varying reward shaping. These omissions make it impossible to determine the reliability of the improvements or to rule out that other training choices, rather than grounding, are responsible for the observed differences.
minor comments (2)
- [Abstract] The abstract references human evaluations showing that visual references are 'spatially accurate' and 'helpful,' but provides no protocol details, participant count, or quantitative metrics for these assessments.
- [Method] Notation for the grounding mechanism and the multi-turn reward components could be clarified with explicit equations or pseudocode in the method section to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that explicit coordinate grounding (rather than multi-turn interaction alone) drives the gains is not supported by a controlled ablation. No experiment is described that retains the identical multi-turn schedule, reward structure, and zoom mechanics while removing the requirement to produce coordinate-anchored reasoning traces. Without this isolation, the 86.4% V*Bench result and outperformance over 'conventional RL baselines' cannot be attributed specifically to grounding.
Authors: We appreciate the referee's point regarding the need for a more controlled ablation to isolate the effect of explicit coordinate grounding. Our conventional RL baselines lack the grounding mechanism but may not perfectly match the multi-turn schedule in all cases. To rigorously address this, we will introduce a new ablation in the revised manuscript that uses the same multi-turn RL framework, reward structure, and zoom mechanics, but without requiring the model to output coordinate-anchored reasoning traces. This will help confirm that the performance gains, including the 86.4% on V*Bench, are attributable to the grounding component. revision: yes
-
Referee: [Experiments] Experiments section (benchmark tables and ablation studies): The reported performance numbers lack error bars or standard deviations, and no details are provided on how the grounding term is incorporated into the reward function or on ablations varying reward shaping. These omissions make it impossible to determine the reliability of the improvements or to rule out that other training choices, rather than grounding, are responsible for the observed differences.
Authors: Thank you for highlighting these omissions. We will revise the Experiments section to include error bars and standard deviations for all reported performance numbers, based on multiple training runs with different random seeds. We will also provide a detailed description of how the grounding term is integrated into the overall reward function, including its mathematical formulation and weighting. Furthermore, we will add ablation studies that vary the reward shaping to demonstrate the robustness of our results and to further isolate the contribution of grounding. revision: yes
Circularity Check
No circularity: empirical training outcomes are independent of inputs
full rationale
The paper introduces ViGoRL as an RL training procedure that adds explicit coordinate anchoring and multi-turn zoom mechanics to a vision-language model. All performance numbers (e.g., 86.4% on V*Bench) are reported as measured results after training on the described benchmarks and comparing against supervised fine-tuning and standard RL baselines. No equations, uniqueness theorems, or self-citations are used to derive the method or results; the central claims rest on external empirical comparisons that do not reduce to the training inputs by construction. The work is therefore self-contained as an experimental contribution.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ViGoRL learns to produce spatially grounded reasoning traces... multi-turn RL framework enables the model to dynamically zoom into predicted coordinates
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We employ MCTS to generate grounded reasoning traces... GRPO
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 12 Pith papers
-
Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs
Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object de...
-
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
-
Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning
Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.
-
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
-
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
-
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
-
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
AtlasVA organizes VLM agent memory into spatial heatmaps, visual exemplars, and symbolic skills, evolving atlases from trajectories to act as potential-based shaping rewards in teacher-free reinforcement learning.
-
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without ...
-
Perceptual Flow Network for Visually Grounded Reasoning
PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
-
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.
-
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...
-
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.
Reference graph
Works this paper leans on
-
[1]
Openai. gpt-4 technical report.arXiv preprint arxiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35: 23716–23736, 2022
work page 2022
-
[3]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[4]
Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997
Dana H Ballard, Mary M Hayhoe, Polly K Pook, and Rajesh PN Rao. Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997
work page 1997
-
[5]
Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D Cohen, Taylor W Webb, and Thomas L Griffiths. Visual serial processing deficits explain divergences in human and vlm reasoning.arXiv preprint arXiv:2509.25142, 2025
-
[6]
Declan Campbell, Sunayana Rane, Tyler Giallanza, Camillo Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven Frankland, Tom Griffiths, Jonathan D Cohen, et al. Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024
work page 2024
-
[7]
PaLI: A Jointly-Scaled Multilingual Language-Image Model
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500
work page 2023
-
[10]
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models
Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Attention over learned object embeddings enables complex visual reasoning
Zhengyuan Ding, Yuwei Chen, Yichong Xu, Zhe Wang, Xintao Han, Dong Yu, and Zhou Yu. Attention over learned object embeddings enables complex visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[12]
Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025
Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025. URLhttps://arxiv.org/abs/2411.14432
-
[13]
BLINK: Multimodal Large Language Models Can See but Not Perceive
Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding, 2025. URLhttps://arxiv.org/abs/2501.05452. 12
-
[15]
Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs
Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[16]
Grounded decoding with visual descriptions reduces hallucination in large vision-language models
Sarthak Ghosh, Ben Lee, Jean-Baptiste Alayrac, Xuhong Zhai, Christoph Feichtenhofer, Joao Carreira, and Ishan Misra. Grounded decoding with visual descriptions reduces hallucination in large vision-language models. InInternational Conference on Learning Representations (ICLR), 2024
work page 2024
-
[17]
Navigating the digital world as humans do: Universal visual grounding for GUI agents
Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT
work page 2025
-
[18]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Robust compositional visual reasoning via language-guided neural module networks
Arjun Gupta, Xi Victoria Lin, Chunyuan Zhang, Michel Galley, Jianfeng Gao, and Car- los Guestrin Ferrer. Robust compositional visual reasoning via language-guided neural module networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2021
work page 2021
-
[21]
Visual programming: Compositional visual reason- ing without training
Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training, 2022. URLhttps://arxiv.org/abs/2211.11559
-
[22]
The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990
Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990
work page 1990
-
[23]
Cogagent: A visual language model for gui agents
Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024
work page 2024
-
[24]
Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019
David Hoppe and Constantin A Rothkopf. Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019
work page 2019
-
[26]
Visual sketchpad: Sketching as a visual chain of thought for multimodal language models
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URLhttps://arxiv.org/abs/2406.09403
-
[27]
Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024. URL https://arxiv.org/abs/ 2312.03052
-
[28]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
Scaling up visual and vision-language representation learning with noisy text supervision
Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 13
work page 2021
-
[30]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks
Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Large language models are zero-shot reasoners, 2023
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205. 11916
work page 2023
-
[33]
Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025
Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf. Preprint
work page 2025
-
[34]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning, 2025. URLhttps://arxiv.org/abs/2504.06958
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[35]
V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025
Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025. URL https://arxiv.org/abs/2405.16919
-
[36]
Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024
Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024
-
[37]
Improved baselines with visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023
work page 2023
-
[38]
Understanding R1-Zero-Like Training: A Critical Perspective
Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https: //arxiv.org/abs/2503.20783
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[39]
Visual-RFT: Visual Reinforcement Fine-Tuning
Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning, 2025. URL https://arxiv.org/ abs/2503.01785
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[41]
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin,...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[44]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022
work page 2022
-
[45]
Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents
Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URLhttps://arxiv.org/abs/2408.07199
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
Cogcom: Compositional visual reasoning with chain-of-manipulations
Jinyi Qi, Tao Zhang, Rui Chen, Xiaoxue Li, Yizhou Zhang, and Kai-Wei Chang. Cogcom: Compositional visual reasoning with chain-of-manipulations. InInternational Conference on Learning Representations (ICLR), 2025
work page 2025
-
[47]
UI-TARS: Pioneering Automated GUI Interaction with Native Agents
Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021
work page 2021
-
[49]
Vision language models are blind
Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words, 2025. URLhttps://arxiv.org/abs/2407.06581
-
[50]
Sat: Spa- tial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. URL https://arxiv.org/abs/2412.07755
-
[51]
Vlm agents generate their own memories: Distilling experience into embodied programs of thought
Gabriel Herbert Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[52]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[53]
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/ abs/2504.07615
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[54]
HybridFlow: A Flexible and Efficient RLHF Framework
Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[55]
RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. To appear
work page 2025
-
[56]
ViperGPT: Visual Inference via Python Execution for Reasoning
Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[57]
Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025
Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning, 2025. URL https://arxiv.org/abs/2503.20752
-
[58]
Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, and et. Al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URLhttps://arxiv.org/abs/2403.05530. 15
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[60]
Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[61]
Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, and et. Al. Kimi-VL technical report, 2025. URLhttps://arxiv.org/abs/2504.07491
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Winoground: Probing vision and language models for visio-linguistic compositionality, 2022
Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality, 2022. URLhttps://arxiv.org/abs/2204.03162
-
[63]
Anne M. Treisman and Garry Gelade. A feature-integration theory of attention.Cognitive Psychology, 12(1):97–136, 1980
work page 1980
-
[64]
Visual routines.Cognition, 18(1-3):97–159, 1984
Shimon Ullman. Visual routines.Cognition, 18(1-3):97–159, 1984
work page 1984
-
[65]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning
Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URLhttps://arxiv. org/abs/...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[67]
Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[68]
Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023
-
[70]
Thinking llms: General instruction following with thought generation, 2024
Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Thinking llms: General instruction following with thought generation, 2024. URL https: //arxiv.org/abs/2410.10630
-
[71]
Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,
Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,
- [72]
-
[73]
OS-ATLAS: A Foundation Action Model for Generalist GUI Agents
Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[74]
xAI. Grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024. Accessed: 2025-05- 21
work page 2024
-
[75]
LLaVA-CoT: Let Vision Language Models Reason Step-by-Step
Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. URLhttps://arxiv.org/abs/2411.10440
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024
work page Pith review arXiv 2024
-
[77]
Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016
Scott Cheng-Hsin Yang, Mate Lengyel, and Daniel M Wolpert. Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016
work page 2016
-
[78]
Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016
Scott Cheng-Hsin Yang, Daniel M Wolpert, and Máté Lengyel. Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016
work page 2016
-
[79]
Mm-react: Prompting chatgpt for multimodal reasoning and action
Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. 2023
work page 2023
-
[80]
Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, 2024. URL https://arxiv.org/abs/2412.18319
-
[81]
Yarbus.Eye Movements and Vision
Alfred L. Yarbus.Eye Movements and Vision. Springer, 1967
work page 1967
-
[82]
Demystifying Long Chain-of-Thought Reasoning in LLMs
Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[83]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.