pith. sign in

arxiv: 2503.05132 · v2 · pith:I5D2TXP4new · submitted 2025-03-07 · 💻 cs.AI · cs.CV· cs.LG

R1-Zero's "Aha Moment" in Visual Reasoning on a 2B Non-SFT Model

Pith reviewed 2026-05-19 07:09 UTC · model grok-4.3

classification 💻 cs.AI cs.CVcs.LG
keywords visual reasoningreinforcement learningemergent reasoningmultimodal modelsself-reflectionnon-SFT modelaha momentCVBench
0
0 comments X p. Extension
pith:I5D2TXP4 Add to your LaTeX paper What is a Pith Number?
\usepackage{pith}
\pithnumber{I5D2TXP4}

Prints a linked pith:I5D2TXP4 badge after your title and writes the identifier into PDF metadata. Compiles on arXiv with no extra files. Learn more

The pith

Reinforcement learning on a non-SFT 2B vision-language model produces self-reflective visual reasoning and large accuracy gains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the self-reflection and length growth seen in text-only R1 models can be reproduced in visual reasoning by running reinforcement learning directly on a 2B base model instead of an instruct-tuned one. Using Qwen2-VL-2B and the SAT dataset, the resulting model reaches 59.47 percent accuracy on CVBench, roughly 30 points above the starting checkpoint and a couple points above supervised fine-tuning runs. The work also records that the same RL recipe on instruct models yields only shallow reasoning paths and that adding a simple length bonus does not help.

Core claim

Applying reinforcement learning with rule-based rewards directly to the base Qwen2-VL-2B model on the SAT dataset elicits emergent self-reflection, longer responses, and a jump to 59.47 percent accuracy on CVBench, beating the base model by about 30 percent and supervised fine-tuning runs by about 2 percent.

What carries the argument

Reinforcement learning with simple rule-based incentives applied straight to the non-SFT Qwen2-VL-2B checkpoint on the SAT dataset.

If this is right

  • CVBench accuracy reaches 59.47 percent, an improvement of roughly 30 percent over the base model.
  • Self-reflective reasoning trajectories and longer responses appear during the RL training run.
  • RL applied to already-instruct-tuned models produces only trivial reasoning chains.
  • Adding a naive length reward does not reliably elicit genuine reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Base models that have never seen supervised fine-tuning may keep more capacity for emergent complex behavior under RL.
  • The same direct-RL recipe could be tested on other small vision-language models to see whether the aha moment generalizes.
  • Dataset choice may interact with the absence of SFT in ways that make self-reflection easier to surface.

Load-bearing premise

The observed accuracy rise and self-reflective trajectories are produced by the reinforcement learning process rather than by the particular SAT dataset, random seeds, or other implementation choices.

What would settle it

Re-running the identical RL procedure on the same base model but with a fresh random seed or a different training set that yields neither self-reflection nor the reported accuracy increase would falsify the claim.

read the original abstract

Recently DeepSeek R1 demonstrated how reinforcement learning with simple rule-based incentives can enable autonomous development of complex reasoning in large language models, characterized by the "aha moment", in which the model manifest self-reflection and increased response length during training. However, attempts to extend this success to multimodal reasoning often failed to reproduce these key characteristics. In this report, we present the first successful replication of these emergent characteristics for multimodal reasoning on only a non-SFT 2B model. Starting with Qwen2-VL-2B and applying reinforcement learning directly on the SAT dataset, our model achieves 59.47% accuracy on CVBench, outperforming the base model by approximately ~30% and exceeding both SFT setting by ~2%. In addition, we share our failed attempts and insights in attempting to achieve R1-like reasoning using RL with instruct models. aiming to shed light on the challenges involved. Our key observations include: (1) applying RL on instruct model often results in trivial reasoning trajectories, and (2) naive length reward are ineffective in eliciting reasoning capabilities. The project code is available at https://github.com/turningpoint-ai/VisualThinker-R1-Zero

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims the first successful replication of DeepSeek R1's emergent 'aha moment' (self-reflection and increased response length) in multimodal visual reasoning. Starting from the non-SFT Qwen2-VL-2B model and applying RL with rule-based rewards directly on the SAT dataset, the resulting model reaches 59.47% accuracy on CVBench (approximately 30% above the base model and 2% above an SFT baseline). The authors also document failed attempts when applying the same RL procedure to instruct-tuned models and release the training code.

Significance. If the central empirical findings hold, the work shows that R1-style reasoning behaviors can emerge in small (2B) multimodal models without prior SFT, lowering the resource barrier for visual reasoning systems. The explicit release of code at https://github.com/turningpoint-ai/VisualThinker-R1-Zero is a clear strength that enables direct reproduction and extension.

major comments (2)
  1. [Results] Results section: The manuscript reports final accuracy and selected self-reflective trajectories after RL on the SAT dataset, yet provides no ablation that holds the SAT data fixed while changing the training objective (e.g., SFT on the identical SAT examples or RL with a non-reasoning reward). This omission leaves open the possibility that longer responses and self-reflection are driven by dataset statistics rather than the RL dynamics, directly weakening the attribution required by the central claim.
  2. [Experiments] Experiments / Training procedure: No quantitative tracking of the 'aha moment' is supplied (e.g., plots of average response length or frequency of reflective phrases across training steps, with error bars or multiple seeds). The claim of emergent behavior therefore rests on qualitative trajectory examples whose reproducibility cannot be assessed from the reported data.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'approximately ~30%' contains a redundant symbol; 'approximately 30%' is sufficient.
  2. [Method] The manuscript would benefit from a short appendix table listing the exact reward formulation, learning-rate schedule, and number of training steps so that the RL setup can be reproduced without inspecting the GitHub repository.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the attribution of emergent reasoning behaviors to the RL procedure. We address each major point below and will incorporate the suggested analyses in the revised manuscript to strengthen the central claims.

read point-by-point responses
  1. Referee: [Results] Results section: The manuscript reports final accuracy and selected self-reflective trajectories after RL on the SAT dataset, yet provides no ablation that holds the SAT data fixed while changing the training objective (e.g., SFT on the identical SAT examples or RL with a non-reasoning reward). This omission leaves open the possibility that longer responses and self-reflection are driven by dataset statistics rather than the RL dynamics, directly weakening the attribution required by the central claim.

    Authors: We agree that additional controlled ablations are necessary to isolate the contribution of RL dynamics from dataset effects. In the revised manuscript we will add (i) an SFT baseline trained on the identical SAT examples used for RL and (ii) an RL run that employs a reward based solely on final-answer correctness without length or reflection incentives. These results will be presented alongside the original findings to better support the attribution of the observed 'aha moment' to the RL objective. revision: yes

  2. Referee: [Experiments] Experiments / Training procedure: No quantitative tracking of the 'aha moment' is supplied (e.g., plots of average response length or frequency of reflective phrases across training steps, with error bars or multiple seeds). The claim of emergent behavior therefore rests on qualitative trajectory examples whose reproducibility cannot be assessed from the reported data.

    Authors: We acknowledge the value of quantitative evidence for reproducibility. The revised manuscript will include plots tracking average response length and the frequency of reflective phrases (such as 'wait' or 'reconsider') over training steps. Where computational resources permit, we will report results across multiple random seeds with error bars to allow assessment of the stability of the emergent behavior. revision: yes

Circularity Check

0 steps flagged

Empirical replication study with no definitional or self-referential derivations

full rationale

The paper reports direct experimental outcomes from applying reinforcement learning to the Qwen2-VL-2B base model on the SAT dataset, including measured accuracy of 59.47% on CVBench, response length increases, and selected trajectories. No equations, parameter fits presented as predictions, ansatzes, or mathematical derivations appear in the text. Central claims rest on observable benchmark results and code release rather than self-citations or reductions to inputs by construction. The work is self-contained as an empirical replication attempt, with any attribution questions addressable via external controls rather than internal definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the RL process on SAT data is the direct cause of the emergent behaviors and that the base model plus reward design are sufficient to produce them.

axioms (1)
  • domain assumption The SAT dataset supplies training signals that elicit genuine visual reasoning rather than superficial patterns.
    Used as the sole training corpus for the RL stage.

pith-pipeline@v0.9.0 · 5768 in / 1122 out tokens · 32813 ms · 2026-05-19T07:09:55.100868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Enhancing MLLM Spatial Understanding via Active 3D Scene Exploration for Multi-Perspective Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    A training-free Visual Chain-of-Thought framework reconstructs high-fidelity 3D meshes from single images and iteratively synthesizes optimal novel views to enhance MLLM spatial comprehension on benchmarks like 3DSRBench.

  2. Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

    cs.AI 2026-01 unverdicted novelty 7.0

    Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.

  3. Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

    cs.CV 2026-05 unverdicted novelty 6.0

    A masking-based think-answer distillation method for VLMs that selectively hides reasoning prefixes and uses self-paced scheduling to improve visual anchoring and benchmark performance.

  4. Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

    cs.CV 2026-05 unverdicted novelty 6.0

    A new distillation method uses token-wise salient reasoning-prefix masking and self-paced scheduling to anchor student VLM thinking on visual inputs, outperforming prior distillation approaches on multimodal reasoning...

  5. Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

    cs.CV 2026-05 unverdicted novelty 6.0

    A reasoning-prefix masking strategy during VLM distillation encourages students to anchor their thinking on visual evidence, yielding better multimodal reasoning than prior distillation baselines.

  6. Confidence-Aware Alignment Makes Reasoning LLMs More Reliable

    cs.AI 2026-05 unverdicted novelty 6.0

    CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...

  7. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 6.0

    Affordance Agent Harness is a verification-gated orchestration system that unifies skills via an evidence store, episodic memory priors, an adaptive router, and a self-consistency verifier to improve accuracy-cost tra...

  8. Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification

    cs.CV 2026-04 unverdicted novelty 6.0

    ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...

  9. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  10. Reinforce to Learn, Elect to Reason: A Dual Paradigm for Video Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    RLER trains video-reasoning models with three task-driven RL rewards for evidence production and elects the best answer from a few candidates via evidence consistency scoring, yielding 6.3% average gains on eight benchmarks.

  11. The Landscape of Agentic Reinforcement Learning for LLMs: A Survey

    cs.AI 2025-09 accept novelty 6.0

    Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.

  12. Mobile-R1: Towards Interactive Capability for VLM-Based Mobile Agent via Systematic Training

    cs.AI 2025-06 unverdicted novelty 6.0

    Mobile-R1 introduces a hierarchical three-stage curriculum that combines format alignment, verifiable action feedback, and multi-turn environment training to improve exploration and self-correction in VLM-based mobile...

  13. VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    cs.CV 2025-04 unverdicted novelty 6.0

    VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.

  14. UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning

    cs.AI 2025-03 accept novelty 6.0

    UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.

  15. OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles

    cs.CV 2025-03 conditional novelty 6.0

    Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.

  16. Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    cs.AI 2025-03 conditional novelty 6.0

    Cosmos-Reason1-7B and 56B models are trained with physical common sense and embodied reasoning ontologies via supervised fine-tuning and reinforcement learning to produce next-step physical actions.

  17. Geo-R1: Improving Few-Shot Geospatial Referring Expression Understanding with Reinforcement Fine-Tuning

    cs.CV 2025-09 unverdicted novelty 5.0

    Geo-R1 uses reasoning-centric reinforcement fine-tuning to improve few-shot performance and generalization in geospatial referring expression understanding over supervised baselines.

  18. Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search

    cs.CV 2025-09 unverdicted novelty 5.0

    Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.

  19. VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    cs.CV 2025-04 unverdicted novelty 5.0

    Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

  20. Affordance Agent Harness: Verification-Gated Skill Orchestration

    cs.RO 2026-05 unverdicted novelty 4.0

    Affordance Agent Harness is a verification-gated orchestration framework that adaptively combines heterogeneous skills, retrieves episodic memories, and uses self-consistency checks to improve affordance grounding acc...

  21. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

  22. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey

    cs.CV 2025-03 unverdicted novelty 2.0

    The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.

Reference graph

Works this paper leans on

20 extracted references · 20 canonical work pages · cited by 19 Pith papers · 2 internal anchors

  1. [1]

    R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generaliza- tion ability in vision-language models with less than $3.https://github.com/Deep-Agent/ R1-V, 2025. Accessed: 2025-02-02

  2. [2]

    DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai D...

  3. [3]

    open-r1-multimodal

    EvolvingLMMs-Lab. open-r1-multimodal. https://github.com/EvolvingLMMs-Lab/ open-r1-multimodal, 2025. Accessed: March 6, 2025. 9

  4. [4]

    R1-multimodal-journey

    FanqingM. R1-multimodal-journey. https://github.com/FanqingM/ R1-Multimodal-Journey, 2025. Accessed: March 6, 2025

  5. [5]

    Smith, Wei-Chiu Ma, and Ranjay Krishna

    Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive, 2024

  6. [6]

    Visual sketchpad: Sketching as a visual chain of thought for multimodal language models

    Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems , 37:139348–139379, 2025

  7. [7]

    Visual spatial reasoning, 2023

    Fangyu Liu, Guy Emerson, and Nigel Collier. Visual spatial reasoning, 2023

  8. [8]

    Learning to reason with llms, 2024

    OpenAI. Learning to reason with llms, 2024

  9. [9]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback, 2022

  10. [10]

    Sat: Spatial aptitude training for multimodal language models

    Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 2024

  11. [11]

    Proximal policy optimization algorithms, 2017

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017

  12. [12]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

  13. [13]

    Vlm-r1: A stable and generalizable r1-style large vision-language model

    Haozhan Shen, Zilun Zhang, Qianqian Zhang, Ruochen Xu, and Tiancheng Zhao. Vlm-r1: A stable and generalizable r1-style large vision-language model. https://github.com/ om-ai-lab/VLM-R1 , 2025. Accessed: 2025-02-15

  14. [14]

    Llamav-o1: Rethinking step-by-step visual reasoning in llms

    Omkar Thawakar, Dinura Dissanayake, Ketan More, Ritesh Thawkar, Ahmed Heakl, Noor Ahsan, Yuhao Li, Mohammed Zumri, Jean Lahoud, Rao Muhammad Anwer, et al. Llamav-o1: Rethinking step-by-step visual reasoning in llms. arXiv preprint arXiv:2501.06186, 2025

  15. [15]

    Cambrian-1: A fully open, vision-centric exploration of multimodal llms

    Peter Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Adithya Jairam Vedagiri IYER, Sai Charitha Akula, Shusheng Yang, Jihan Yang, Manoj Middepogu, Ziteng Wang, et al. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. Advances in Neural Information Processing Systems, 37:87310–87356, 2025

  16. [16]

    Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision- language model’s perception of the world at any resolution, 2024

  17. [17]

    Open-r1-video

    Xiaodong Wang and Peixi Peng. Open-r1-video. https://github.com/ Wang-Xiaodong1899/Open-R1-Video , 2025

  18. [18]

    LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

    Guowei Xu, Peng Jin, Li Hao, Yibing Song, Lichao Sun, and Li Yuan. Llava-o1: Let vision language models reason step-by-step. arXiv preprint arXiv:2411.10440, 2024

  19. [19]

    Rest-mcts*: Llm self-training via process reward guided tree search, 2024

    Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: Llm self-training via process reward guided tree search, 2024

  20. [20]

    Easyr1: An efficient, scalable, multi-modality rl training framework

    Yaowei Zheng, Junting Lu, Shenzhi Wang, Zhangchi Feng andDongdong Kuang, and Yuwen Xiong. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github. com/hiyouga/EasyR1, 2025. 10