R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

arxiv: 2503.05132 · v2 · pith:I5D2TXP4new · submitted 2025-03-07 · 💻 cs.AI · cs.CV· cs.LG

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Hengguang Zhou , Xirui Li , Ruochen Wang , Minhao Cheng , Tianyi Zhou , Cho-Jui Hsieh This is my paper

Pith reviewed 2026-05-19 07:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG

keywords visual reasoningreinforcement learningemergent reasoningmultimodal modelsself-reflectionnon-SFT modelaha momentCVBench

0 comments p. Extension

pith:I5D2TXP4 Add to your LaTeX paper

What is a Pith Number?

\usepackage{pith}
\pithnumber{I5D2TXP4}

Prints a linked pith:I5D2TXP4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Reinforcement learning on a non-SFT 2B vision-language model produces self-reflective visual reasoning and large accuracy gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the self-reflection and length growth seen in text-only R1 models can be reproduced in visual reasoning by running reinforcement learning directly on a 2B base model instead of an instruct-tuned one. Using Qwen2-VL-2B and the SAT dataset, the resulting model reaches 59.47 percent accuracy on CVBench, roughly 30 points above the starting checkpoint and a couple points above supervised fine-tuning runs. The work also records that the same RL recipe on instruct models yields only shallow reasoning paths and that adding a simple length bonus does not help.

Core claim

Applying reinforcement learning with rule-based rewards directly to the base Qwen2-VL-2B model on the SAT dataset elicits emergent self-reflection, longer responses, and a jump to 59.47 percent accuracy on CVBench, beating the base model by about 30 percent and supervised fine-tuning runs by about 2 percent.

What carries the argument

Reinforcement learning with simple rule-based incentives applied straight to the non-SFT Qwen2-VL-2B checkpoint on the SAT dataset.

If this is right

CVBench accuracy reaches 59.47 percent, an improvement of roughly 30 percent over the base model.
Self-reflective reasoning trajectories and longer responses appear during the RL training run.
RL applied to already-instruct-tuned models produces only trivial reasoning chains.
Adding a naive length reward does not reliably elicit genuine reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Base models that have never seen supervised fine-tuning may keep more capacity for emergent complex behavior under RL.
The same direct-RL recipe could be tested on other small vision-language models to see whether the aha moment generalizes.
Dataset choice may interact with the absence of SFT in ways that make self-reflection easier to surface.

Load-bearing premise

The observed accuracy rise and self-reflective trajectories are produced by the reinforcement learning process rather than by the particular SAT dataset, random seeds, or other implementation choices.

What would settle it

Re-running the identical RL procedure on the same base model but with a fresh random seed or a different training set that yields neither self-reflection nor the reported accuracy increase would falsify the claim.

read the original abstract

Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This report gets R1-style self-reflection and accuracy gains out of a plain 2B VL model with direct RL on SAT data, but the causal link to the RL step versus the dataset is not isolated.

read the letter

The main thing to know is that they ran RL directly on the SAT dataset starting from Qwen2-VL-2B and produced longer, self-reflective trajectories that were missing in prior multimodal attempts. Accuracy on CVBench reached 59.47 percent, roughly 30 points above the base model and a couple points above their SFT baseline. They also document that instruct-tuned versions collapse to short trivial outputs and that naive length rewards do not elicit the same behavior. The code release is the most immediately useful part for anyone who wants to inspect the actual trajectories or try the setup themselves.

Referee Report

2 major / 2 minor

Summary. The paper claims the first successful replication of DeepSeek R1's emergent 'aha moment' (self-reflection and increased response length) in multimodal visual reasoning. Starting from the non-SFT Qwen2-VL-2B model and applying RL with rule-based rewards directly on the SAT dataset, the resulting model reaches 59.47% accuracy on CVBench (approximately 30% above the base model and 2% above an SFT baseline). The authors also document failed attempts when applying the same RL procedure to instruct-tuned models and release the training code.

Significance. If the central empirical findings hold, the work shows that R1-style reasoning behaviors can emerge in small (2B) multimodal models without prior SFT, lowering the resource barrier for visual reasoning systems. The explicit release of code at https://github.com/turningpoint-ai/VisualThinker-R1-Zero is a clear strength that enables direct reproduction and extension.

major comments (2)

[Results] Results section: The manuscript reports final accuracy and selected self-reflective trajectories after RL on the SAT dataset, yet provides no ablation that holds the SAT data fixed while changing the training objective (e.g., SFT on the identical SAT examples or RL with a non-reasoning reward). This omission leaves open the possibility that longer responses and self-reflection are driven by dataset statistics rather than the RL dynamics, directly weakening the attribution required by the central claim.
[Experiments] Experiments / Training procedure: No quantitative tracking of the 'aha moment' is supplied (e.g., plots of average response length or frequency of reflective phrases across training steps, with error bars or multiple seeds). The claim of emergent behavior therefore rests on qualitative trajectory examples whose reproducibility cannot be assessed from the reported data.

minor comments (2)

[Abstract] Abstract: The phrase 'approximately ~30%' contains a redundant symbol; 'approximately 30%' is sufficient.
[Method] The manuscript would benefit from a short appendix table listing the exact reward formulation, learning-rate schedule, and number of training steps so that the RL setup can be reproduced without inspecting the GitHub repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the attribution of emergent reasoning behaviors to the RL procedure. We address each major point below and will incorporate the suggested analyses in the revised manuscript to strengthen the central claims.

read point-by-point responses

Referee: [Results] Results section: The manuscript reports final accuracy and selected self-reflective trajectories after RL on the SAT dataset, yet provides no ablation that holds the SAT data fixed while changing the training objective (e.g., SFT on the identical SAT examples or RL with a non-reasoning reward). This omission leaves open the possibility that longer responses and self-reflection are driven by dataset statistics rather than the RL dynamics, directly weakening the attribution required by the central claim.

Authors: We agree that additional controlled ablations are necessary to isolate the contribution of RL dynamics from dataset effects. In the revised manuscript we will add (i) an SFT baseline trained on the identical SAT examples used for RL and (ii) an RL run that employs a reward based solely on final-answer correctness without length or reflection incentives. These results will be presented alongside the original findings to better support the attribution of the observed 'aha moment' to the RL objective. revision: yes
Referee: [Experiments] Experiments / Training procedure: No quantitative tracking of the 'aha moment' is supplied (e.g., plots of average response length or frequency of reflective phrases across training steps, with error bars or multiple seeds). The claim of emergent behavior therefore rests on qualitative trajectory examples whose reproducibility cannot be assessed from the reported data.

Authors: We acknowledge the value of quantitative evidence for reproducibility. The revised manuscript will include plots tracking average response length and the frequency of reflective phrases (such as 'wait' or 'reconsider') over training steps. Where computational resources permit, we will report results across multiple random seeds with error bars to allow assessment of the stability of the emergent behavior. revision: yes

Circularity Check

0 steps flagged

Empirical replication study with no definitional or self-referential derivations

full rationale

The paper reports direct experimental outcomes from applying reinforcement learning to the Qwen2-VL-2B base model on the SAT dataset, including measured accuracy of 59.47% on CVBench, response length increases, and selected trajectories. No equations, parameter fits presented as predictions, ansatzes, or mathematical derivations appear in the text. Central claims rest on observable benchmark results and code release rather than self-citations or reductions to inputs by construction. The work is self-contained as an empirical replication attempt, with any attribution questions addressable via external controls rather than internal definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the RL process on SAT data is the direct cause of the emergent behaviors and that the base model plus reward design are sufficient to produce them.

axioms (1)

domain assumption The SAT dataset supplies training signals that elicit genuine visual reasoning rather than superficial patterns.
Used as the sole training corpus for the RL stage.

pith-pipeline@v0.9.0 · 5768 in / 1122 out tokens · 32813 ms · 2026-05-19T07:09:55.100868+00:00 · methodology

discussion (0)

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning
cs.CV 2026-04 unverdicted novelty 7.0

A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning
cs.AI 2026-01 unverdicted novelty 7.0

Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
cs.CV 2026-05 unverdicted novelty 6.0

A masking-based think-answer distillation method for VLMs that selectively hides reasoning prefixes and uses self-paced scheduling to improve visual anchoring and benchmark performance.
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
cs.CV 2026-05 unverdicted novelty 6.0

A new distillation method uses token-wise salient reasoning-prefix masking and self-paced scheduling to anchor student VLM thinking on visual inputs, outperforming prior distillation approaches on multimodal reasoning...
Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation
cs.CV 2026-05 unverdicted novelty 6.0

A reasoning-prefix masking strategy during VLM distillation encourages students to anchor their thinking on visual evidence, yielding better multimodal reasoning than prior distillation baselines.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
cs.AI 2026-05 unverdicted novelty 6.0

CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 6.0

Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...
Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
cs.CV 2026-04 unverdicted novelty 6.0

ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training
cs.AI 2025-06 unverdicted novelty 6.0

Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model
cs.CV 2025-04 unverdicted novelty 6.0

VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning
cs.AI 2025-03 accept novelty 6.0

UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles
cs.CV 2025-03 conditional novelty 6.0

Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning
cs.AI 2025-03 conditional novelty 6.0

Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.
Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning
cs.CV 2025-09 unverdicted novelty 5.0

Geo-R1 uses reasoning-centric reinforcement fine-tuning to improve few-shot performance and generalization in geospatial referring expression understanding over supervised baselines.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search
cs.CV 2025-09 unverdicted novelty 5.0

Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
cs.CV 2025-04 unverdicted novelty 5.0

Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
Affordance Agent Harness: Verification-Gated Skill Orchestration
cs.RO 2026-05 unverdicted novelty 4.0

Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
cs.CV 2025-03 unverdicted novelty 2.0

The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 19 Pith papers · 2 internal anchors

[1]

R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025. Accessed: 2025-02-02

work page 2025
[2]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page 2025
[3]

open-r1-multimodal

EvolvingLMMs-Lab. open-r1-multimodal. https://github.com/EvolvingLMMs-Lab/ open-r1-multimodal, 2025. Accessed: March 6, 2025. 9

work page 2025
[4]

R1-multimodal-journey

FanqingM. R1-multimodal-journey. https://github.com/FanqingM/ R1-Multimodal-Journey, 2025. Accessed: March 6, 2025

work page 2025
[5]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive, 2024

work page 2024
[6]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems , 37:139348–139379, 2025

work page 2025
[7]

Visual spatial reasoning, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning, 2023

work page 2023
[8]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024

work page 2024
[9]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022
[10]

Sat: Spatial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024
[11]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017
[12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Vlm-r1: A stable and generalizable r1-style large vision-language model

Haozhan Shen, Zilun Zhang, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model. https://github.com/ om-ai-lab/VLM-R1 , 2025. Accessed: 2025-02-15

work page 2025
[14]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025

work page arXiv 2025
[15]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025

work page 2025
[16]

Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

work page 2024
[17]

Open-r1-video

Xiaodong Wang and Peixi Peng. Open-r1-video. https://github.com/ Wang-Xiaodong1899/Open-R1-Video , 2025

work page 2025
[18]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Rest-mcts*: Llm self-training via process reward guided tree search, 2024

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search, 2024

work page 2024
[20]

Easyr1: An efficient, scalable, multi-modality rl training framework

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng andDongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github. com/hiyouga/EasyR1, 2025. 10

work page 2025

[1] [1]

R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025. Accessed: 2025-02-02

work page 2025

[2] [2]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

work page 2025

[3] [3]

open-r1-multimodal

EvolvingLMMs-Lab. open-r1-multimodal. https://github.com/EvolvingLMMs-Lab/ open-r1-multimodal, 2025. Accessed: March 6, 2025. 9

work page 2025

[4] [4]

R1-multimodal-journey

FanqingM. R1-multimodal-journey. https://github.com/FanqingM/ R1-Multimodal-Journey, 2025. Accessed: March 6, 2025

work page 2025

[5] [5]

Smith, Wei-Chiu Ma, and Ranjay Krishna

Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive, 2024

work page 2024

[6] [6]

Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems , 37:139348–139379, 2025

work page 2025

[7] [7]

Visual spatial reasoning, 2023

Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning, 2023

work page 2023

[8] [8]

Learning to reason with llms, 2024

OpenAI. Learning to reason with llms, 2024

work page 2024

[9] [9]

Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

work page 2022

[10] [10]

Sat: Spatial aptitude training for multimodal language models

Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024

work page arXiv 2024

[11] [11]

Proximal policy optimization algorithms, 2017

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

work page 2017

[12] [12]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Vlm-r1: A stable and generalizable r1-style large vision-language model

Haozhan Shen, Zilun Zhang, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model. https://github.com/ om-ai-lab/VLM-R1 , 2025. Accessed: 2025-02-15

work page 2025

[14] [14]

Llamav-o1: Rethinking step-by-step visual reasoning in llms

Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025

work page arXiv 2025

[15] [15]

Cambrian-1: A fully open, vision-centric exploration of multimodal llms

Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025

work page 2025

[16] [16]

Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

work page 2024

[17] [17]

Open-r1-video

Xiaodong Wang and Peixi Peng. Open-r1-video. https://github.com/ Wang-Xiaodong1899/Open-R1-Video , 2025

work page 2025

[18] [18]

LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Rest-mcts*: Llm self-training via process reward guided tree search, 2024

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search, 2024

work page 2024

[20] [20]

Easyr1: An efficient, scalable, multi-modality rl training framework

Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng andDongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github. com/hiyouga/EasyR1, 2025. 10

work page 2025