Grounded Reinforcement Learning for Visual Reasoning

Aviral Kumar; Ayush Jain; Gabriel Sarch; Katerina Fragkiadaki; Michael J. Tarr; Naitik Khandelwal; Snigdha Saha

arxiv: 2505.23678 · v3 · pith:4RMMTDPLnew · submitted 2025-05-29 · 💻 cs.CV

Grounded Reinforcement Learning for Visual Reasoning

Gabriel Sarch , Snigdha Saha , Naitik Khandelwal , Ayush Jain , Michael J. Tarr , Aviral Kumar , Katerina Fragkiadaki This is my paper

Pith reviewed 2026-05-22 01:00 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual reasoningreinforcement learningvision-language modelsvisual groundingspatial reasoningmulti-turn RLvisual searchGUI grounding

0 comments

The pith

ViGoRL trains vision-language models with RL that anchors each reasoning step to specific visual coordinates, reaching 86.4 percent on V*Bench.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a reinforcement learning method for vision-language models that forces each step in a reasoning chain to reference exact locations in the input image. This addresses the extra difficulty visual tasks pose compared to text-only problems, where models must actively direct attention and tie abstract thoughts to spatial evidence. A multi-turn extension lets the model zoom into those locations for finer detail when needed. Across benchmarks for spatial reasoning, visual search, and web element grounding, the grounded approach beats both supervised fine-tuning and standard RL that lacks explicit location anchors. The results indicate that building visual grounding into the learning process can improve attention, subgoal setting, and self-verification in visual decision making.

Core claim

ViGoRL is a vision-language model trained with reinforcement learning to explicitly anchor each reasoning step to predicted visual coordinates in the image. When detailed inspection is required, a multi-turn RL framework lets the model dynamically zoom into those coordinates as reasoning proceeds. This produces spatially grounded traces that guide attention to relevant regions and yields consistent gains over baselines without grounding on tasks including SAT-2, BLINK, V*Bench, ScreenSpot, and VisualWebArena.

What carries the argument

Spatially grounded reasoning traces produced by RL, in which every step is tied to specific visual coordinates, plus multi-turn interaction that enables dynamic zooming into predicted locations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same anchoring technique could be tested on tasks like medical image interpretation where precise location references reduce errors.
Grounded traces may increase user trust by making each reasoning step visibly linked to image evidence.
Scaling the multi-turn zoom mechanism to longer sequences or video inputs might extend the benefits to dynamic visual environments.

Load-bearing premise

Forcing the model to link every reasoning step to predicted visual coordinates and allowing dynamic zooming will deliver reliable gains without creating new errors in attention or reward design.

What would settle it

An ablation on V*Bench that removes coordinate anchoring and zooming, then measures whether accuracy falls substantially below 86.4 percent, would show whether the grounding mechanism is required for the reported improvements.

Figures

Figures reproduced from arXiv: 2505.23678 by Aviral Kumar, Ayush Jain, Gabriel Sarch, Katerina Fragkiadaki, Michael J. Tarr, Naitik Khandelwal, Snigdha Saha.

**Figure 1.** Figure 1: Grounded visual reasoning enables interpretable and accurate answers. ViGoRL decomposes the task into a sequence of natural language thoughts anchored in image regions. In contrast, Vanilla GRPO and SFT baselines produce ungrounded and incorrect responses. Abstract While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, v… view at source ↗

**Figure 2.** Figure 2: Without actively reinforcing visually grounded behaviors, RL collapses onto shortcuts that maximize immediate rewards at the expense of richer visual reasoning. Standard CoT and Vanilla GRPO (left and center) exhibit visually ungrounded reasoning, relying on vague references to scene elements (shown in yellow), which often results in incorrect answers (marked in red). In contrast, Visually Grounded RL (rig… view at source ↗

**Figure 3.** Figure 3: Overview of the ViGoRL approach. (Left) We use MCTS with a teacher model to generate reasoning chains grounded in specific image regions. (Middle) These reasoning trees are linearized and used for supervised fine-tuning (SFT) to train a base model. (Right) We apply GRPO with an outcome-based reward to further refine the grounded reasoning. 4.2 Warm-Start Data Generation via MCTS MCTS with Visual Grounding.… view at source ↗

**Figure 4.** Figure 4: Human evaluation of grounded reasoning. Participants judged the grounded predictions as both accurate and helpful when correct. To assess our model’s grounded reasoning traces, we conducted a human study evaluating whether predicted coordinates (1) correctly referred to the intended image region, and (2) helped participants understand the associated reasoning step (details are shown in Appendix A7). F… view at source ↗

read the original abstract

While reinforcement learning (RL) over chains of thought has significantly advanced language models in tasks such as mathematics and coding, visual reasoning introduces added complexity by requiring models to direct visual attention, interpret perceptual inputs, and ground abstract reasoning in spatial evidence. We introduce ViGoRL (Visually Grounded Reinforcement Learning), a vision-language model trained with RL to explicitly anchor each reasoning step to specific visual coordinates. Inspired by human visual decision-making, ViGoRL learns to produce spatially grounded reasoning traces, guiding visual attention to task-relevant regions at each step. When fine-grained exploration is required, our novel multi-turn RL framework enables the model to dynamically zoom into predicted coordinates as reasoning unfolds. Across a diverse set of visual reasoning benchmarks--including SAT-2 and BLINK for spatial reasoning, V*bench for visual search, and ScreenSpot and VisualWebArena for web-based grounding--ViGoRL consistently outperforms both supervised fine-tuning and conventional RL baselines that lack explicit grounding mechanisms. Incorporating multi-turn RL with zoomed-in visual feedback significantly improves ViGoRL's performance on localizing small GUI elements and visual search, achieving 86.4% on V*Bench. Additionally, we find that grounding amplifies other visual behaviors such as region exploration, grounded subgoal setting, and visual verification. Finally, human evaluations show that the model's visual references are not only spatially accurate but also helpful for understanding model reasoning steps. Our results show that visually grounded RL is a strong paradigm for imbuing models with general-purpose visual reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Coordinate grounding in multi-turn RL for visual reasoning gives benchmark gains but the contribution of grounding versus zooming needs clearer isolation.

read the letter

ViGoRL is a way to train vision-language models with RL so that each reasoning step has to name specific visual coordinates, and the model can zoom into those areas over multiple turns when it needs more detail. The actual novelty is in combining that explicit coordinate output with a multi-turn RL setup that feeds back zoomed crops. This goes beyond language-only RL or basic fine-tuning by making the visual attention part of the learned policy. The results look decent: it beats supervised fine-tuning and regular RL on spatial tasks, visual search, and GUI grounding benchmarks. The multi-turn zooming helps a lot on small elements, hitting 86.4% on V*Bench. They also note that this grounding leads to more region exploration and visual verification, and humans say the references are accurate and make the reasoning easier to follow. The main soft spot is the missing isolation of the grounding effect. The stress-test point holds up based on what's in the abstract: we need to see if keeping the multi-turn zoom schedule but dropping the coordinate anchoring still gives most of the gains. Without that, or without reward ablations, it's not clear how much the explicit grounding is doing versus the iterative refinement. No error bars are mentioned either, which makes the numbers harder to interpret. These are fixable issues though, not core problems with the idea. This paper is for people working on visual reasoning agents and multimodal RL. Anyone trying to get models to look at the right parts of an image during reasoning would get something out of it. It has enough new technique and empirical movement to deserve a serious referee, even if it needs more controls in revision. I'd recommend sending it out for review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces ViGoRL, a vision-language model trained with reinforcement learning to explicitly anchor each reasoning step to predicted visual coordinates. It proposes a novel multi-turn RL framework enabling dynamic zooming into those coordinates for fine-grained exploration. The approach is evaluated on spatial reasoning (SAT-2, BLINK), visual search (V*Bench, with 86.4% reported), and GUI/web grounding (ScreenSpot, VisualWebArena) benchmarks, where it outperforms supervised fine-tuning and conventional RL baselines lacking explicit grounding. Additional claims include amplification of behaviors such as region exploration and visual verification, supported by human evaluations of spatial accuracy and reasoning helpfulness.

Significance. If the results hold after addressing the isolation of the grounding mechanism, the work would represent a meaningful advance in visual reasoning for VLMs by demonstrating how explicit spatial grounding within RL can improve both performance and interpretability. Credit is due for the diverse benchmark coverage spanning spatial, search, and interactive tasks, as well as the human study confirming that visual references aid understanding of model steps. The multi-turn zooming framework addresses a practical limitation in handling small or detailed visual elements.

major comments (2)

[Abstract] Abstract: The central claim that explicit coordinate grounding (rather than multi-turn interaction alone) drives the gains is not supported by a controlled ablation. No experiment is described that retains the identical multi-turn schedule, reward structure, and zoom mechanics while removing the requirement to produce coordinate-anchored reasoning traces. Without this isolation, the 86.4% V*Bench result and outperformance over 'conventional RL baselines' cannot be attributed specifically to grounding.
[Experiments] Experiments section (benchmark tables and ablation studies): The reported performance numbers lack error bars or standard deviations, and no details are provided on how the grounding term is incorporated into the reward function or on ablations varying reward shaping. These omissions make it impossible to determine the reliability of the improvements or to rule out that other training choices, rather than grounding, are responsible for the observed differences.

minor comments (2)

[Abstract] The abstract references human evaluations showing that visual references are 'spatially accurate' and 'helpful,' but provides no protocol details, participant count, or quantitative metrics for these assessments.
[Method] Notation for the grounding mechanism and the multi-turn reward components could be clarified with explicit equations or pseudocode in the method section to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive comments on our manuscript. We address each of the major comments below and outline the revisions we plan to make to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that explicit coordinate grounding (rather than multi-turn interaction alone) drives the gains is not supported by a controlled ablation. No experiment is described that retains the identical multi-turn schedule, reward structure, and zoom mechanics while removing the requirement to produce coordinate-anchored reasoning traces. Without this isolation, the 86.4% V*Bench result and outperformance over 'conventional RL baselines' cannot be attributed specifically to grounding.

Authors: We appreciate the referee's point regarding the need for a more controlled ablation to isolate the effect of explicit coordinate grounding. Our conventional RL baselines lack the grounding mechanism but may not perfectly match the multi-turn schedule in all cases. To rigorously address this, we will introduce a new ablation in the revised manuscript that uses the same multi-turn RL framework, reward structure, and zoom mechanics, but without requiring the model to output coordinate-anchored reasoning traces. This will help confirm that the performance gains, including the 86.4% on V*Bench, are attributable to the grounding component. revision: yes
Referee: [Experiments] Experiments section (benchmark tables and ablation studies): The reported performance numbers lack error bars or standard deviations, and no details are provided on how the grounding term is incorporated into the reward function or on ablations varying reward shaping. These omissions make it impossible to determine the reliability of the improvements or to rule out that other training choices, rather than grounding, are responsible for the observed differences.

Authors: Thank you for highlighting these omissions. We will revise the Experiments section to include error bars and standard deviations for all reported performance numbers, based on multiple training runs with different random seeds. We will also provide a detailed description of how the grounding term is integrated into the overall reward function, including its mathematical formulation and weighting. Furthermore, we will add ablation studies that vary the reward shaping to demonstrate the robustness of our results and to further isolate the contribution of grounding. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training outcomes are independent of inputs

full rationale

The paper introduces ViGoRL as an RL training procedure that adds explicit coordinate anchoring and multi-turn zoom mechanics to a vision-language model. All performance numbers (e.g., 86.4% on V*Bench) are reported as measured results after training on the described benchmarks and comparing against supervised fine-tuning and standard RL baselines. No equations, uniqueness theorems, or self-citations are used to derive the method or results; the central claims rest on external empirical comparisons that do not reduce to the training inputs by construction. The work is therefore self-contained as an experimental contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities beyond the standard RL setup and the new ViGoRL training loop itself.

pith-pipeline@v0.9.0 · 5820 in / 1048 out tokens · 33110 ms · 2026-05-22T01:00:10.382918+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ViGoRL learns to produce spatially grounded reasoning traces... multi-turn RL framework enables the model to dynamically zoom into predicted coordinates
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We employ MCTS to generate grounded reasoning traces... GRPO

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 12 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs
cs.CV 2026-05 unverdicted novelty 7.0

Proposes an equation-anchored tool-use method for MLLMs that writes the pinhole back-projection equation in Chain-of-Thought and substitutes retrieved camera intrinsics and depths to achieve robustness in 3D object de...
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Uni-Synergy: Bridging Understanding and Generation for Personalized Reasoning via Co-operative Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

Sync-R1 applies cooperative RL with Sync-GRPO and Dynamic Group Scaling to achieve superior cross-task personalized reasoning in multimodal models on the new UnifyBench++ dataset.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

Chain-of-Glimpse is a reinforcement learning framework that builds progressive, spatially grounded reasoning traces around task-relevant objects in videos to enable more accurate and interpretable multi-step decisions.
Don't Show Pixels, Show Cues: Unlocking Visual Tool Reasoning in Language Models via Perception Programs
cs.CV 2026-04 unverdicted novelty 6.0

Perception Programs rewrite dense visual tool outputs into language-native summaries, boosting MLLM accuracy by 15-45% absolute on BLINK perception tasks and setting new state-of-the-art results.
Adaptive Chain-of-Focus Reasoning via Dynamic Visual Search and Zooming for Efficient VLMs
cs.CV 2025-05 unverdicted novelty 6.0

Chain-of-Focus enables VLMs to adaptively search and zoom on important image areas via a two-stage SFT and RL pipeline on a custom 3K-sample dataset, yielding 5% gains on the V* benchmark across resolutions from 224 to 4K.
AtlasVA: Self-Evolving Visual Skill Memory for Teacher-Free VLM Agents
cs.CV 2026-05 unverdicted novelty 5.0

AtlasVA organizes VLM agent memory into spatial heatmaps, visual exemplars, and symbolic skills, evolving atlases from trajectories to act as potential-based shaping rewards in teacher-free reinforcement learning.
DRS-GUI: Dynamic Region Search for Training-Free GUI Grounding
cs.AI 2026-05 unverdicted novelty 5.0

DRS-GUI introduces a dynamic region search method with Focus/Shift/Scatter actions and MCTS-based planning that improves GUI grounding accuracy by 14% on ScreenSpot-Pro for both general and GUI-specific MLLMs without ...
Perceptual Flow Network for Visually Grounded Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding
cs.CV 2026-04 unverdicted novelty 5.0

Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning
cs.AI 2025-09 unverdicted novelty 5.0

MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gain...
OpenWorldLib: A Unified Codebase and Definition of Advanced World Models
cs.CV 2026-04 unverdicted novelty 4.0

OpenWorldLib offers a standardized codebase and definition for world models that combine perception, interaction, and memory to understand and predict the world.

Reference graph

Works this paper leans on

94 extracted references · 94 canonical work pages · cited by 11 Pith papers · 39 internal anchors

[1]

GPT-4 Technical Report

Openai. gpt-4 technical report.arXiv preprint arxiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

work page 2022
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997

Dana H Ballard, Mary M Hayhoe, Polly K Pook, and Rajesh PN Rao. Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997

work page 1997
[5]

Visual serial processing deficits explain divergences in human and vlm reasoning.arXiv preprint arXiv:2509.25142, 2025

Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D Cohen, Taylor W Webb, and Thomas L Griffiths. Visual serial processing deficits explain divergences in human and vlm reasoning.arXiv preprint arXiv:2509.25142, 2025

work page arXiv 2025
[6]

Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

Declan Campbell, Sunayana Rane, Tyler Giallanza, Camillo Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven Frankland, Tom Griffiths, Jonathan D Cohen, et al. Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

work page 2024
[7]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500

work page 2023
[10]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Attention over learned object embeddings enables complex visual reasoning

Zhengyuan Ding, Yuwei Chen, Yichong Xu, Zhe Wang, Xintao Han, Dong Yu, and Zhou Yu. Attention over learned object embeddings enables complex visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[12]

Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025. URLhttps://arxiv.org/abs/2411.14432

work page arXiv 2025
[13]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding, 2025. URLhttps://arxiv.org/abs/2501.05452. 12

work page arXiv 2025
[15]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Grounded decoding with visual descriptions reduces hallucination in large vision-language models

Sarthak Ghosh, Ben Lee, Jean-Baptiste Alayrac, Xuhong Zhai, Christoph Feichtenhofer, Joao Carreira, and Ishan Misra. Grounded decoding with visual descriptions reduces hallucination in large vision-language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[17]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT

work page 2025
[18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Robust compositional visual reasoning via language-guided neural module networks

Arjun Gupta, Xi Victoria Lin, Chunyuan Zhang, Michel Galley, Jianfeng Gao, and Car- los Guestrin Ferrer. Robust compositional visual reasoning via language-guided neural module networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021
[21]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training, 2022. URLhttps://arxiv.org/abs/2211.11559

work page arXiv 2022
[22]

The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990

Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990

work page 1990
[23]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

work page 2024
[24]

Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019

David Hoppe and Constantin A Rothkopf. Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019

work page 2019
[26]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URLhttps://arxiv.org/abs/2406.09403

work page arXiv 2024
[27]

Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024. URL https://arxiv.org/abs/ 2312.03052

work page arXiv 2024
[28]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[29]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 13

work page 2021
[30]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Large language models are zero-shot reasoners, 2023

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205. 11916

work page 2023
[33]

Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf. Preprint

work page 2025
[34]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning, 2025. URLhttps://arxiv.org/abs/2504.06958

work page internal anchor Pith review Pith/arXiv arXiv 2025
[35]

V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025. URL https://arxiv.org/abs/2405.16919

work page arXiv 2025
[36]

Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

work page arXiv 2024
[37]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023
[38]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https: //arxiv.org/abs/2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning, 2025. URL https://arxiv.org/ abs/2503.01785

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022
[45]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URLhttps://arxiv.org/abs/2408.07199

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Cogcom: Compositional visual reasoning with chain-of-manipulations

Jinyi Qi, Tao Zhang, Rui Chen, Xiaoxue Li, Yizhou Zhang, and Kai-Wei Chang. Cogcom: Compositional visual reasoning with chain-of-manipulations. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[47]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021
[49]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words, 2025. URLhttps://arxiv.org/abs/2407.06581

work page arXiv 2025
[50]

Sat: Spa- tial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. URL https://arxiv.org/abs/2412.07755

work page arXiv 2025
[51]

Vlm agents generate their own memories: Distilling experience into embodied programs of thought

Gabriel Herbert Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[52]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[53]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/ abs/2504.07615

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. To appear

work page 2025
[56]

ViperGPT: Visual Inference via Python Execution for Reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning, 2025. URL https://arxiv.org/abs/2503.20752

work page arXiv 2025
[58]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, and et. Al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URLhttps://arxiv.org/abs/2403.05530. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024
[60]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, and et. Al. Kimi-VL technical report, 2025. URLhttps://arxiv.org/abs/2504.07491

work page internal anchor Pith review Pith/arXiv arXiv 2025
[62]

Winoground: Probing vision and language models for visio-linguistic compositionality, 2022

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality, 2022. URLhttps://arxiv.org/abs/2204.03162

work page arXiv 2022
[63]

Treisman and Garry Gelade

Anne M. Treisman and Garry Gelade. A feature-integration theory of attention.Cognitive Psychology, 12(1):97–136, 1980

work page 1980
[64]

Visual routines.Cognition, 18(1-3):97–159, 1984

Shimon Ullman. Visual routines.Cognition, 18(1-3):97–159, 1984

work page 1984
[65]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URLhttps://arxiv. org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[68]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

work page arXiv 2023
[70]

Thinking llms: General instruction following with thought generation, 2024

Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Thinking llms: General instruction following with thought generation, 2024. URL https: //arxiv.org/abs/2410.10630

work page arXiv 2024
[71]

Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,

work page
[72]

URLhttps://arxiv.org/abs/2404.03622. 16

work page arXiv
[73]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[74]

Grok-1.5 vision preview

xAI. Grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024. Accessed: 2025-05- 21

work page 2024
[75]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. URLhttps://arxiv.org/abs/2411.10440

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

work page Pith review arXiv 2024
[77]

Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016

Scott Cheng-Hsin Yang, Mate Lengyel, and Daniel M Wolpert. Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016

work page 2016
[78]

Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016

Scott Cheng-Hsin Yang, Daniel M Wolpert, and Máté Lengyel. Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016

work page 2016
[79]

Mm-react: Prompting chatgpt for multimodal reasoning and action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. 2023

work page 2023
[80]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, 2024. URL https://arxiv.org/abs/2412.18319

work page arXiv 2024
[81]

Yarbus.Eye Movements and Vision

Alfred L. Yarbus.Eye Movements and Vision. Springer, 1967

work page 1967
[82]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

work page internal anchor Pith review Pith/arXiv arXiv 2025
[83]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Openai. gpt-4 technical report.arXiv preprint arxiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in Neural Information Processing Systems, 35: 23716–23736, 2022

work page 2022

[3] [3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997

Dana H Ballard, Mary M Hayhoe, Polly K Pook, and Rajesh PN Rao. Deictic codes for the embodiment of cognition.Behavioral and Brain Sciences, 20(4):723–742, 1997

work page 1997

[5] [5]

Visual serial processing deficits explain divergences in human and vlm reasoning.arXiv preprint arXiv:2509.25142, 2025

Nicholas Budny, Kia Ghods, Declan Campbell, Raja Marjieh, Amogh Joshi, Sreejan Kumar, Jonathan D Cohen, Taylor W Webb, and Thomas L Griffiths. Visual serial processing deficits explain divergences in human and vlm reasoning.arXiv preprint arXiv:2509.25142, 2025

work page arXiv 2025

[6] [6]

Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

Declan Campbell, Sunayana Rane, Tyler Giallanza, Camillo Nicolò De Sabbata, Kia Ghods, Amogh Joshi, Alexander Ku, Steven Frankland, Tom Griffiths, Jonathan D Cohen, et al. Under- standing the limits of vision language models through the lens of the binding problem.Advances in Neural Information Processing Systems, 37:113436–113460, 2024

work page 2024

[7] [7]

PaLI: A Jointly-Scaled Multilingual Language-Image Model

Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Piergiovanni, Piotr Padlewski, Daniel Salz, Sebastian Goodman, Adam Grycner, Basil Mustafa, Lucas Beyer, Alexander Kolesnikov, Joan Puigcerver, Nan Ding, Keran Rong, Hassan Akbari, Gaurav Mishra, Linting Xue, Ashish Thapliyal, James Bradbury, Weicheng Kuo, Mojtaba Seyedhosseini, Chao Jia, Burcu Karagol Ayan, Carl...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu. Seeclick: Harnessing gui grounding for advanced visual gui agents.arXiv preprint arXiv:2401.10935, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. URL https://arxiv.org/abs/2305. 06500

work page 2023

[10] [10]

Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Vision-Language Models

Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art multimodal models.arXiv preprint arXiv:2409.17146, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Attention over learned object embeddings enables complex visual reasoning

Zhengyuan Ding, Yuwei Chen, Yichong Xu, Zhe Wang, Xintao Han, Dong Yu, and Zhou Yu. Attention over learned object embeddings enables complex visual reasoning. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[12] [12]

Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025

Yuhao Dong, Zuyan Liu, Hai-Long Sun, Jingkang Yang, Winston Hu, Yongming Rao, and Ziwei Liu. Insight-v: Exploring long-chain visual reasoning with multimodal large language models, 2025. URLhttps://arxiv.org/abs/2411.14432

work page arXiv 2025

[13] [13]

BLINK: Multimodal Large Language Models Can See but Not Perceive

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Refocus: Visual editing as a chain of thought for structured image understanding.arXiv preprint arXiv:2501.05452,

Xingyu Fu, Minqian Liu, Zhengyuan Yang, John Corring, Yijuan Lu, Jianwei Yang, Dan Roth, Dinei Florencio, and Cha Zhang. Refocus: Visual editing as a chain of thought for structured image understanding, 2025. URLhttps://arxiv.org/abs/2501.05452. 12

work page arXiv 2025

[15] [15]

Cognitive Behaviors that Enable Self-Improving Reasoners, or, Four Habits of Highly Effective STaRs

Kanishk Gandhi, Ayush Chakravarthy, Anikait Singh, Nathan Lile, and Noah D Goodman. Cognitive behaviors that enable self-improving reasoners, or, four habits of highly effective stars.arXiv preprint arXiv:2503.01307, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Grounded decoding with visual descriptions reduces hallucination in large vision-language models

Sarthak Ghosh, Ben Lee, Jean-Baptiste Alayrac, Xuhong Zhai, Christoph Feichtenhofer, Joao Carreira, and Ishan Misra. Grounded decoding with visual descriptions reduces hallucination in large vision-language models. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[17] [17]

Navigating the digital world as humans do: Universal visual grounding for GUI agents

Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su. Navigating the digital world as humans do: Universal visual grounding for GUI agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=kxnoqaisCT

work page 2025

[18] [18]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Robust compositional visual reasoning via language-guided neural module networks

Arjun Gupta, Xi Victoria Lin, Chunyuan Zhang, Michel Galley, Jianfeng Gao, and Car- los Guestrin Ferrer. Robust compositional visual reasoning via language-guided neural module networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2021

work page 2021

[21] [21]

Visual programming: Compositional visual reason- ing without training

Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training, 2022. URLhttps://arxiv.org/abs/2211.11559

work page arXiv 2022

[22] [22]

The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990

Stevan Harnad. The symbol grounding problem.Physica D: Nonlinear Phenomena, 42(1-3): 335–346, 1990

work page 1990

[23] [23]

Cogagent: A visual language model for gui agents

Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14281–14290, 2024

work page 2024

[24] [24]

Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019

David Hoppe and Constantin A Rothkopf. Multi-step planning of eye movements in visual search.Scientific reports, 9(1):144, 2019

work page 2019

[25] [26]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models, 2024. URLhttps://arxiv.org/abs/2406.09403

work page arXiv 2024

[26] [27]

Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024

Yushi Hu, Otilia Stretcu, Chun-Ta Lu, Krishnamurthy Viswanathan, Kenji Hata, Enming Luo, Ranjay Krishna, and Ariel Fuxman. Visual program distillation: Distilling tools and programmatic reasoning into vision-language models, 2024. URL https://arxiv.org/abs/ 2312.03052

work page arXiv 2024

[27] [28]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [29]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, pages 4904–4916. PMLR, 2021. 13

work page 2021

[29] [30]

Segment Anything

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything.arXiv preprint arXiv:2304.02643, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[30] [31]

VisualWebArena: Evaluating Multimodal Agents on Realistic Visual Web Tasks

Jing Yu Koh, Robert Lo, Lawrence Jang, Vikram Duvvur, Ming Chong Lim, Po-Yu Huang, Graham Neubig, Shuyan Zhou, Ruslan Salakhutdinov, and Daniel Fried. Visualwebarena: Evaluating multimodal agents on realistic visual web tasks.arXiv preprint arXiv:2401.13649, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[31] [32]

Large language models are zero-shot reasoners, 2023

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners, 2023. URL https://arxiv.org/abs/2205. 11916

work page 2023

[32] [33]

Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025

Kaixin Li, Ziyang Meng, Hongzhan Lin, Ziyang Luo, Yuchen Tian, Jing Ma, Zhiyong Huang, and Tat-Seng Chua. Screenspot-pro: Gui grounding for professional high-resolution computer use, 2025. URL https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf. Preprint

work page 2025

[33] [34]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning, 2025. URLhttps://arxiv.org/abs/2504.06958

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [35]

V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025

Zejun Li, Ruipu Luo, Jiwen Zhang, Minghui Qiu, Xuanjing Huang, and Zhongyu Wei. V ocot: Unleashing visually grounded multi-step reasoning in large multi-modal models, 2025. URL https://arxiv.org/abs/2405.16919

work page arXiv 2025

[35] [36]

Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

Kevin Qinghong Lin, Linjie Li, Difei Gao, Zhengyuan Yang, Shiwei Wu, Zechen Bai, Weixian Lei, Lijuan Wang, and Mike Zheng Shou. Showui: One vision-language-action model for gui visual agent.arXiv preprint arXiv:2411.17465, 2024

work page arXiv 2024

[36] [37]

Improved baselines with visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning, 2023

work page 2023

[37] [38]

Understanding R1-Zero-Like Training: A Critical Perspective

Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective, 2025. URL https: //arxiv.org/abs/2503.20783

work page internal anchor Pith review Pith/arXiv arXiv 2025

[38] [39]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning, 2025. URL https://arxiv.org/ abs/2503.01785

work page internal anchor Pith review Pith/arXiv arXiv 2025

[39] [41]

UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

Zhengxi Lu, Yuxiang Chai, Yaxuan Guo, Xi Yin, Liang Liu, Hao Wang, Guanjing Xiong, and Hongsheng Li. Ui-r1: Enhancing action prediction of gui agents by reinforcement learning. arXiv preprint arXiv:2503.21620, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [42]

s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang, Weijia Shi, Xiang Lisa Li, Li Fei-Fei, Hannaneh Hajishirzi, Luke Zettlemoyer, Percy Liang, Emmanuel Candès, and Tatsunori Hashimoto. s1: Simple test-time scaling, 2025. URLhttps://arxiv.org/abs/2501.19393

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [43]

Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

NVIDIA, :, Alisson Azzolini, Hannah Brandon, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, Francesco Ferroni, Rama Govindaraju, Jinwei Gu, Siddharth Gururani, Imad El Hanafi, Zekun Hao, Jacob Huffman, Jingyi Jin, Brendan Johnson, Rizwan Khan, George Kurian, Elena Lantz, Nayeon Lee, Zhaoshuo Li, Xuan Li, Tsung-Yi Lin,...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [44]

Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in Neural Information Processing Systems, 35:27730–27744, 2022

work page 2022

[43] [45]

Agent Q: Advanced Reasoning and Learning for Autonomous AI Agents

Pranav Putta, Edmund Mills, Naman Garg, Sumeet Motwani, Chelsea Finn, Divyansh Garg, and Rafael Rafailov. Agent q: Advanced reasoning and learning for autonomous ai agents, 2024. URLhttps://arxiv.org/abs/2408.07199

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [46]

Cogcom: Compositional visual reasoning with chain-of-manipulations

Jinyi Qi, Tao Zhang, Rui Chen, Xiaoxue Li, Yizhou Zhang, and Kai-Wei Chang. Cogcom: Compositional visual reasoning with chain-of-manipulations. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[45] [47]

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

Yujia Qin, Yining Ye, Junjie Fang, Haoming Wang, Shihao Liang, Shizuo Tian, Junda Zhang, Jiahao Li, Yunxin Li, Shijue Huang, et al. Ui-tars: Pioneering automated gui interaction with native agents.arXiv preprint arXiv:2501.12326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [48]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021

work page 2021

[47] [49]

Vision language models are blind

Pooyan Rahmanzadehgervi, Logan Bolton, Mohammad Reza Taesiri, and Anh Totti Nguyen. Vision language models are blind: Failing to translate detailed visual features into words, 2025. URLhttps://arxiv.org/abs/2407.06581

work page arXiv 2025

[48] [50]

Sat: Spa- tial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Ellis Brown, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A. Plummer, Ranjay Krishna, Kuo-Hao Zeng, and Kate Saenko. Sat: Dynamic spatial aptitude training for multimodal language models, 2025. URL https://arxiv.org/abs/2412.07755

work page arXiv 2025

[49] [51]

Vlm agents generate their own memories: Distilling experience into embodied programs of thought

Gabriel Herbert Sarch, Lawrence Jang, Michael J Tarr, William W Cohen, Kenneth Marino, and Katerina Fragkiadaki. Vlm agents generate their own memories: Distilling experience into embodied programs of thought. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024

[50] [52]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024.URL https://arxiv. org/abs/2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[51] [53]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model, 2025. URL https://arxiv.org/ abs/2504.07615

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [54]

HybridFlow: A Flexible and Efficient RLHF Framework

Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[53] [55]

RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics

Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. RoboSpatial: Teaching spatial understanding to 2D and 3D vision-language models for robotics. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. To appear

work page 2025

[54] [56]

ViperGPT: Visual Inference via Python Execution for Reasoning

Dídac Surís, Sachit Menon, and Carl V ondrick. Vipergpt: Visual inference via python execution for reasoning.arXiv preprint arXiv:2303.08128, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [57]

Reason-rft: Reinforcement fine-tuning for visual reasoning.arXiv preprint arXiv:2503.20752, 2025

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning, 2025. URL https://arxiv.org/abs/2503.20752

work page arXiv 2025

[56] [58]

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, and et. Al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, 2024. URLhttps://arxiv.org/abs/2403.05530. 15

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [60]

Kimi Team, Angang Du, Bofei Gao, Bowei Xing, Changjiu Jiang, Cheng Chen, Cheng Li, Chenjun Xiao, Chenzhuang Du, Chonghua Liao, Chuning Tang, Congcong Wang, Dehao Zhang, Enming Yuan, Enzhe Lu, Fengxiang Tang, Flood Sung, Guangda Wei, Guokun Lai, Haiqing Guo, Han Zhu, Hao Ding, Hao Hu, Hao Yang, Hao Zhang, Haotian Yao, Haotian Zhao, Haoyu Lu, Haoze Li, Haoz...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[58] [61]

Kimi Team, Angang Du, Bohong Yin, Bowei Xing, Bowen Qu, and et. Al. Kimi-VL technical report, 2025. URLhttps://arxiv.org/abs/2504.07491

work page internal anchor Pith review Pith/arXiv arXiv 2025

[59] [62]

Winoground: Probing vision and language models for visio-linguistic compositionality, 2022

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality, 2022. URLhttps://arxiv.org/abs/2204.03162

work page arXiv 2022

[60] [63]

Treisman and Garry Gelade

Anne M. Treisman and Garry Gelade. A feature-integration theory of attention.Cognitive Psychology, 12(1):97–136, 1980

work page 1980

[61] [64]

Visual routines.Cognition, 18(1-3):97–159, 1984

Shimon Ullman. Visual routines.Cognition, 18(1-3):97–159, 1984

work page 1984

[62] [65]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[63] [66]

RAGEN: Understanding Self-Evolution in LLM Agents via Multi-Turn Reinforcement Learning

Zihan Wang, Kangrui Wang, Qineng Wang, Pingyue Zhang, Linjie Li, Zhengyuan Yang, Xing Jin, Kefan Yu, Minh Nhat Nguyen, Licheng Liu, Eli Gottlieb, Yiping Lu, Kyunghyun Cho, Jiajun Wu, Li Fei-Fei, Lijuan Wang, Yejin Choi, and Manling Li. Ragen: Understanding self-evolution in llm agents via multi-turn reinforcement learning, 2025. URLhttps://arxiv. org/abs/...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[64] [67]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Chi, Quoc Le, and Denny Zhou. Chain of thought prompting elicits reasoning in large language models.arXiv preprint arXiv:2201.11903, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[65] [68]

Visual ChatGPT: Talking, Drawing and Editing with Visual Foundation Models

Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong Wang, Zecheng Tang, and Nan Duan. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023. URL https://arxiv.org/abs/2303.04671

work page internal anchor Pith review Pith/arXiv arXiv 2023

[66] [69]

V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

Penghao Wu and Saining Xie. V*: Guided visual search as a core mechanism in multimodal llms.arXiv preprint arXiv:2312.14135, 2023

work page arXiv 2023

[67] [70]

Thinking llms: General instruction following with thought generation, 2024

Tianhao Wu, Janice Lan, Weizhe Yuan, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Thinking llms: General instruction following with thought generation, 2024. URL https: //arxiv.org/abs/2410.10630

work page arXiv 2024

[68] [71]

Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,

Wenshan Wu, Shaoguang Mao, Yadong Zhang, Yan Xia, Li Dong, Lei Cui, and Furu Wei. Mind’s eye of llms: Visualization-of-thought elicits spatial reasoning in large language models,

work page

[69] [72]

URLhttps://arxiv.org/abs/2404.03622. 16

work page arXiv

[70] [73]

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

Zhiyong Wu, Zhenyu Wu, Fangzhi Xu, Yian Wang, Qiushi Sun, Chengyou Jia, Kanzhi Cheng, Zichen Ding, Liheng Chen, Paul Pu Liang, et al. Os-atlas: A foundation action model for generalist gui agents.arXiv preprint arXiv:2410.23218, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[71] [74]

Grok-1.5 vision preview

xAI. Grok-1.5 vision preview. https://x.ai/blog/grok-1.5v, 2024. Accessed: 2025-05- 21

work page 2024

[72] [75]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Hao Li, Yibing Song, Lichao Sun, and Li Yuan. Llava-cot: Let vision language models reason step-by-step, 2025. URLhttps://arxiv.org/abs/2411.10440

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [76]

Thinking in Space: How Multimodal Large Language Models See, Remember, and Recall Spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces.arXiv preprint arXiv:2412.14171, 2024

work page Pith review arXiv 2024

[74] [77]

Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016

Scott Cheng-Hsin Yang, Mate Lengyel, and Daniel M Wolpert. Active sensing in the catego- rization of visual patterns.Elife, 5:e12215, 2016

work page 2016

[75] [78]

Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016

Scott Cheng-Hsin Yang, Daniel M Wolpert, and Máté Lengyel. Theoretical perspectives on active sensing.Current opinion in behavioral sciences, 11:100–108, 2016

work page 2016

[76] [79]

Mm-react: Prompting chatgpt for multimodal reasoning and action

Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu, Ce Liu, Michael Zeng, and Lijuan Wang. Mm-react: Prompting chatgpt for multimodal reasoning and action. 2023

work page 2023

[77] [80]

Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search

Huanjin Yao, Jiaxing Huang, Wenhao Wu, Jingyi Zhang, Yibo Wang, Shunyu Liu, Yingjie Wang, Yuxin Song, Haocheng Feng, Li Shen, and Dacheng Tao. Mulberry: Empowering mllm with o1-like reasoning and reflection via collective monte carlo tree search, 2024. URL https://arxiv.org/abs/2412.18319

work page arXiv 2024

[78] [81]

Yarbus.Eye Movements and Vision

Alfred L. Yarbus.Eye Movements and Vision. Springer, 1967

work page 1967

[79] [82]

Demystifying Long Chain-of-Thought Reasoning in LLMs

Edward Yeo, Yuxuan Tong, Morry Niu, Graham Neubig, and Xiang Yue. Demystifying long chain-of-thought reasoning in llms, 2025. URLhttps://arxiv.org/abs/2502.03373

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [83]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476

work page internal anchor Pith review Pith/arXiv arXiv 2025