pith. machine review for the scientific record. sign in

arxiv: 2504.07615 · v2 · submitted 2025-04-10 · 💻 cs.CV · cs.CL

Recognition: 2 theorem links

· Lean Theorem

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:08 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords vision-language modelsreinforcement learningrule-based rewardsgeneralizationvisual reasoningobject detection
0
0 comments X

The pith

Reinforcement learning with rule-based rewards on visual ground truths produces VLMs that compete with supervised fine-tuning yet generalize more effectively.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates extending the R1 reinforcement learning approach from language models to vision-language models by relying on rule-based rewards computed from tasks that have clear, deterministic ground-truth answers. This design draws on the fact that many visual understanding problems, such as detection and classification, naturally supply exact correctness signals that support stable reward calculation without learned reward models. If the approach holds, it points to a simpler training route that can match supervised fine-tuning on standard benchmarks while producing models that adapt better to new or varied inputs. The work backs this with experiments plus ablations that track issues like reward hacking, emergent training behaviors, data quality effects, and how results scale with model size.

Core claim

VLM-R1 shows that R1-style RL applied to VLMs through rule-based rewards yields competitive performance on visual understanding tasks and surpasses SFT models in generalization ability, accompanied by ablation findings on reward hacking in object detection, the emergence of an OD aha moment, training data quality, and RL scaling across model sizes.

What carries the argument

The rule-based reward mechanism that uses deterministic ground-truth annotations from visual tasks to deliver precise, stable signals for RL training of vision-language models.

Load-bearing premise

A wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations that make them naturally compatible with rule-based reward mechanisms.

What would settle it

Evaluating the RL-trained VLM-R1 on visual tasks without unambiguous deterministic ground truths, such as open-ended subjective image description, to check whether the reported generalization advantage over SFT disappears.

read the original abstract

Recently DeepSeek R1 has shown that reinforcement learning (RL) can substantially improve the reasoning capabilities of Large Language Models (LLMs) through a simple yet effective design. The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations. This property makes them naturally compatible with rule-based reward mechanisms. Motivated by this observation, we investigate the extension of R1-style reinforcement learning to Vision-Language Models (VLMs), aiming to enhance their visual reasoning capabilities. To this end, we develop VLM-R1, a dedicated framework designed to harness RL for improving VLMs' performance on general vision-language tasks. Using this framework, we further explore the feasibility of applying RL to visual domain. Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability. Furthermore, we conduct comprehensive ablation studies that uncover a series of noteworthy insights, including the presence of reward hacking in object detection, the emergence of the "OD aha moment", the impact of training data quality, and the scaling behavior of RL across different model sizes. Through these analyses, we aim to deepen the understanding of how reinforcement learning enhances the capabilities of vision-language models, and we hope our findings and open-source contributions will support continued progress in the vision-language RL community. Our code and model are available at https://github.com/om-ai-lab/VLM-R1

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VLM-R1, a framework extending the R1-style reinforcement learning approach (rule-based rewards on deterministic ground-truth tasks) from LLMs to Vision-Language Models. It claims that RL-trained VLMs achieve competitive performance on visual understanding tasks while outperforming Supervised Fine-Tuning (SFT) in generalization ability. The work includes ablation studies highlighting phenomena such as reward hacking in object detection, an 'OD aha moment', effects of training data quality, and RL scaling across model sizes, with code and models released openly.

Significance. If the central experimental claims hold under fair comparisons, this would represent a meaningful extension of simple RL techniques to the vision-language domain, demonstrating that rule-based rewards can yield better generalization than SFT without needing learned reward models. The open-source release and ablation insights on reward dynamics and scaling would provide practical value to the VLM community for further exploration of RL in visual reasoning.

major comments (2)
  1. [Abstract] Abstract: The headline claim that the RL-based model 'surpasses Supervised Fine-Tuning (SFT) in generalization ability' is load-bearing for the paper's contribution, yet the abstract provides no details on whether the SFT baseline used identical data volume, task mix, optimization steps, or hyperparameters as the RL run. Without this, the reported gap cannot be confidently attributed to the RL mechanism rather than differences in training regime.
  2. [Abstract] Abstract: The paper acknowledges reward hacking in object detection and the emergence of an 'OD aha moment', which directly challenges the assumption that 'a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations' suitable for stable rule-based rewards. This internal observation requires quantitative analysis (e.g., failure rates or ambiguity metrics across tasks) to support the robustness of the approach.
minor comments (1)
  1. [Abstract] The abstract is dense and would benefit from explicit mention of the specific visual tasks, metrics (e.g., accuracy, mAP), and baseline models used in the main results to allow readers to assess competitiveness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our paper. We have prepared point-by-point responses to the major comments and will revise the manuscript to address the concerns raised about the abstract's claims and the need for quantitative analysis on reward stability.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that the RL-based model 'surpasses Supervised Fine-Tuning (SFT) in generalization ability' is load-bearing for the paper's contribution, yet the abstract provides no details on whether the SFT baseline used identical data volume, task mix, optimization steps, or hyperparameters as the RL run. Without this, the reported gap cannot be confidently attributed to the RL mechanism rather than differences in training regime.

    Authors: We agree that the abstract lacks sufficient detail on the baseline setup. In the main text, we specify that the SFT baseline was trained on the identical dataset with a comparable number of training steps and the same base model initialization. Hyperparameters were optimized separately for each paradigm but under similar computational budgets. To resolve this, we will revise the abstract to explicitly state that comparisons are conducted under matched data and training conditions, thereby strengthening the attribution to the RL approach. revision: yes

  2. Referee: [Abstract] Abstract: The paper acknowledges reward hacking in object detection and the emergence of an 'OD aha moment', which directly challenges the assumption that 'a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations' suitable for stable rule-based rewards. This internal observation requires quantitative analysis (e.g., failure rates or ambiguity metrics across tasks) to support the robustness of the approach.

    Authors: The referee raises a valid point regarding the robustness of our assumption. Our ablations do document reward hacking and the 'OD aha moment' as emergent behaviors, which indeed suggest that some visual tasks have inherent ambiguities or opportunities for exploitation. While we provide case studies and qualitative insights in the paper, we concur that quantitative measures are warranted. In the revised manuscript, we will add quantitative analysis including failure rates for reward hacking and metrics for ground-truth ambiguity across the evaluated tasks to better substantiate the stability of the rule-based reward approach. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical framework with independent experiments

full rationale

The manuscript describes an empirical extension of R1-style RL to VLMs via rule-based rewards on tasks possessing deterministic ground-truth annotations. It reports new training runs, performance comparisons against SFT baselines, and ablation studies (including reward hacking and 'OD aha moment' observations) without any equations, first-principles derivations, fitted-parameter predictions, or load-bearing self-citations. All central claims rest on externally verifiable experimental outcomes rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests primarily on the domain assumption that visual tasks provide deterministic ground truth suitable for rule-based rewards, with empirical results as validation; no invented entities or heavy free parameters are indicated in the abstract.

axioms (1)
  • domain assumption A wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations compatible with rule-based reward mechanisms.
    This is the explicit motivation in the abstract for extending R1-style RL to the visual domain.

pith-pipeline@v0.9.0 · 5642 in / 1409 out tokens · 50880 ms · 2026-05-13T01:08:22.649118+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    The core of R1 lies in its rule-based reward formulation, which leverages tasks with deterministic ground-truth answers to enable precise and stable reward computation. In the visual domain, we similarly observe that a wide range of visual understanding tasks are inherently equipped with well-defined ground-truth annotations.

  • IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Experimental results indicate that the RL-based model not only delivers competitive performance on visual understanding tasks but also surpasses Supervised Fine-Tuning (SFT) in generalization ability.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 36 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

    cs.CV 2026-05 unverdicted novelty 7.0

    CurveBench benchmark reveals that even leading VLMs like Gemini 3.1 Pro reach only 71.1% accuracy recovering containment trees on easy nested-curve images and 19.1% on hard versions, while fine-tuning lifts an open 8B...

  2. From Web to Pixels: Bringing Agentic Search into Visual Perception

    cs.CV 2026-05 unverdicted novelty 7.0

    WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

  3. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 7.0

    SciVQR is a new benchmark dataset for evaluating multimodal AI models on complex scientific reasoning tasks across six disciplines, including expert solutions for nearly half the items.

  4. AffectGPT-RL: Revealing Roles of Reinforcement Learning in Open-Vocabulary Emotion Recognition

    cs.HC 2026-05 unverdicted novelty 7.0

    AffectGPT-RL applies reinforcement learning to optimize non-differentiable emotion wheel metrics in open-vocabulary multimodal emotion recognition, yielding performance gains and state-of-the-art results on basic emot...

  5. Pest-Thinker: Learning to Think and Reason like Entomologists via Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    Pest-Thinker is a reinforcement learning framework that improves MLLMs' expert-level reasoning on pest morphology via synthesized CoT trajectories, GRPO optimization, and an LLM-judged feature reward on new benchmarks...

  6. GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

    cs.CV 2026-05 unverdicted novelty 7.0

    GRPO-TTA applies GRPO to test-time visual tuning of vision-language models via group-wise policy optimization on unlabeled class candidates, outperforming prior TTA methods especially under natural distribution shifts.

  7. Improving Vision-language Models with Perception-centric Process Reward Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.

  8. CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

    cs.CV 2026-04 unverdicted novelty 7.0

    CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...

  9. Can Multimodal Large Language Models Truly Understand Small Objects?

    cs.CV 2026-04 unverdicted novelty 7.0

    Current MLLMs show weak performance on small object understanding tasks, but fine-tuning with the new SOU-Train dataset measurably improves their capabilities.

  10. Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning

    cs.CL 2026-04 unverdicted novelty 7.0

    Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...

  11. Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning

    cs.LG 2026-04 unverdicted novelty 7.0

    This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.

  12. Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models

    cs.LG 2026-04 unverdicted novelty 7.0

    RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.

  13. Topo-R1: Detecting Topological Anomalies via Vision-Language Models

    cs.CV 2026-03 unverdicted novelty 7.0

    Topo-R1 fine-tunes a vision-language model using a topology-aware reward and GRPO to detect anomalies such as broken or spurious connections in tubular segmentation masks, outperforming standard VLMs.

  14. Dual-Pathway Circuits of Object Hallucination in Vision-Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Vision-language models contain identifiable grounding and hallucination pathways; suppressing the latter reduces object hallucinations by up to 76% while preserving accuracy.

  15. Learning to See What You Need: Gaze Attention for Multimodal Large Language Models

    cs.CV 2026-05 unverdicted novelty 6.0

    Gaze Attention groups visual embeddings into selectable regions and dynamically restricts attention to task-relevant ones, matching dense baselines with up to 90% fewer visual KV entries via added context tokens.

  16. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 unverdicted novelty 6.0

    Data curation alone raises VLM accuracy by 11+ points on average, improves reliability and OOD generalization, and achieves near-frontier results at far lower training and inference cost.

  17. 20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

    cs.LG 2026-05 conditional novelty 6.0

    Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

  18. LiteGUI: Distilling Compact GUI Agents with Reinforcement Learning

    cs.AI 2026-05 unverdicted novelty 6.0

    LiteGUI trains 2B/3B-scale GUI agents via SFT-free guided on-policy distillation and multi-solution dual-level GRPO to reach SOTA lightweight performance and compete with larger models.

  19. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

    cs.CV 2026-05 unverdicted novelty 6.0

    DeScore decouples CoT reasoning from reward scoring in video reward models using a two-stage training process to improve generalization and avoid optimization bottlenecks of coupled generative RMs.

  20. RemoteZero: Geospatial Reasoning with Zero Human Annotations

    cs.CV 2026-05 unverdicted novelty 6.0

    RemoteZero replaces coordinate supervision with intrinsic semantic verification to enable box-free GRPO training and self-evolution for geospatial reasoning.

  21. MHPR: Multidimensional Human Perception and Reasoning Benchmark for Large Vision-Languate Models

    cs.CV 2026-05 unverdicted novelty 6.0

    MHPR is a multidimensional benchmark for LVLM human-centric perception-reasoning with C-RD, SFT-D, RL-D, T-D data tiers and ACVG pipeline, showing training gains on Qwen2.5-VL-7B to near-parity with larger models.

  22. Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

    cs.CL 2026-05 unverdicted novelty 6.0

    A Group Relative Policy Optimization framework with concordance correlation coefficient rewards improves MLLM regression accuracy on long-tailed distributions, especially in medium- and few-shot regimes, without model...

  23. Injecting Distributional Awareness into MLLMs via Reinforcement Learning for Deep Imbalanced Regression

    cs.CL 2026-05 unverdicted novelty 6.0

    A plug-and-play RL method adds batch-level distributional supervision via CCC rewards to reduce regression-to-the-mean in MLLMs on imbalanced regression benchmarks.

  24. See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection

    cs.CV 2026-04 unverdicted novelty 6.0

    ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.

  25. Navigating the Clutter: Waypoint-Based Bi-Level Planning for Multi-Robot Systems

    cs.RO 2026-04 unverdicted novelty 6.0

    Waypoint-based bi-level planning with curriculum RLVR improves multi-robot task success rates in dense-obstacle benchmarks over motion-agnostic and VLA baselines.

  26. One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.

  27. CLEAR: Unlocking Generative Potential for Degraded Image Understanding in Unified Multimodal Models

    cs.CV 2026-04 unverdicted novelty 6.0

    CLEAR uses degradation-aware fine-tuning, a latent representation bridge, and interleaved reinforcement learning to connect generative and reasoning capabilities in multimodal models for better degraded image understanding.

  28. Discovering Failure Modes in Vision-Language Models using RL

    cs.CV 2026-04 unverdicted novelty 6.0

    An RL-based questioner agent adaptively generates queries to discover novel failure modes in VLMs without human intervention.

  29. ToolRL: Reward is All Tool Learning Needs

    cs.LG 2025-04 conditional novelty 6.0

    A principled reward design for tool selection and application in RL-trained LLMs delivers 17% gains over base models and 15% over SFT across benchmarks.

  30. SciVQR: A Multidisciplinary Multimodal Benchmark for Advanced Scientific Reasoning Evaluation

    cs.CV 2026-05 unverdicted novelty 5.0

    SciVQR is a new multimodal benchmark covering 54 scientific subfields that evaluates MLLMs on visual comprehension and multi-step reasoning, revealing significant limitations in leading models.

  31. Think, then Score: Decoupled Reasoning and Scoring for Video Reward Modeling

    cs.CV 2026-05 unverdicted novelty 5.0

    DeScore decouples explicit CoT reasoning from reward regression in video reward models via a two-stage cold-start plus dual-objective RL training pipeline.

  32. Perceptual Flow Network for Visually Grounded Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    PFlowNet decouples perception from reasoning, integrates multi-dimensional rewards with vicinal geometric shaping via variational RL, and reports new SOTA results on V* Bench (90.6%) and MME-RealWorld-lite (67.0%).

  33. Reward Hacking in the Era of Large Models: Mechanisms, Emergent Misalignment, Challenges

    cs.LG 2026-04 unverdicted novelty 5.0

    The paper introduces the Proxy Compression Hypothesis as a unifying framework explaining reward hacking in RLHF as an emergent result of compressing high-dimensional human objectives into proxy reward signals under op...

  34. Act Wisely: Cultivating Meta-Cognitive Tool Use in Agentic Multimodal Models

    cs.CV 2026-04 unverdicted novelty 5.0

    HDPO reframes tool efficiency as a conditional objective within accurate trajectories, enabling Metis to reduce tool invocations by orders of magnitude while raising reasoning accuracy.

  35. OpenVLThinkerV2: A Generalist Multimodal Reasoning Model for Multi-domain Visual Tasks

    cs.CV 2026-04 unverdicted novelty 5.0

    OpenVLThinkerV2 applies a new Gaussian GRPO training objective with response and entropy shaping to outperform prior open-source and proprietary models on 18 visual reasoning benchmarks.

  36. Learning to Focus and Precise Cropping: A Reinforcement Learning Framework with Information Gaps and Grounding Loss for MLLMs

    cs.CV 2026-03 unverdicted novelty 5.0

    A two-stage RL method with information gaps and grounding loss trains MLLMs to focus on and precisely crop relevant image regions, yielding SOTA results on high-resolution VQA benchmarks.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 32 Pith papers · 19 internal anchors

  1. [1]

    https://github.com/hiyouga/ EasyR1, 2025

    Easyr1: An efficient, scalable, multi-modality rl train- ing framework. https://github.com/hiyouga/ EasyR1, 2025. 4

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774 ,

  3. [3]

    Back to Basics: Revisiting REINFORCE Style Optimization for Learning from Human Feedback in LLMs

    Arash Ahmadian, Chris Cremer, Matthias Gall ´e, Marzieh Fadaee, Julia Kreutzer, Olivier Pietquin, Ahmet ¨Ust¨un, and Sara Hooker. Back to basics: Revisiting reinforce style op- timization for learning from human feedback in llms. arXiv preprint arXiv:2402.14740, 2024. 3

  4. [4]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736,

  5. [5]

    Concrete Problems in AI Safety

    Dario Amodei, Chris Olah, Jacob Steinhardt, Paul Chris- tiano, John Schulman, and Dan Man ´e. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016. 9

  6. [6]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025. 2, 3

  7. [7]

    Dragonfly: Multi-resolution zoom-in encoding enhances vision-language mod- els

    Kezhen Chen, Rahul Thapa, Rahul Chalamala, Ben Athi- waratkun, Shuaiwen Leon Song, and James Zou. Dragonfly: Multi-resolution zoom supercharges large visual-language model. arXiv preprint arXiv:2406.00977, 2024. 3

  8. [8]

    ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

    Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023. 3

  9. [9]

    R1-v: Reinforcing super generalization ability in vision- language models with less than $3

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision- language models with less than $3. https://github. com/Deep-Agent/R1-V , 2025. Accessed: 2025-02-02. 3, 4

  10. [10]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling. arXiv preprint arXiv:2412.05271, 2024. 2, 3

  11. [11]

    How Far Are We to GPT-4V? Closing the Gap to Commercial Multimodal Models with Open-Source Suites

    Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhang- wei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024. 2, 3

  12. [12]

    SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

    Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, and Yi Ma. Sft memorizes, rl generalizes: A compara- tive study of foundation model post-training. arXiv preprint arXiv:2501.17161, 2025. 11

  13. [13]

    Instructblip: Towards general- purpose vision-language models with instruction tuning,

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale 12 Fung, and Steven Hoi. Instructblip: Towards general- purpose vision-language models with instruction tuning,

  14. [14]

    Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning

    Huilin Deng, Ding Zou, Rui Ma, Hongchen Luo, Yang Cao, and Yu Kang. Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning. arXiv preprint arXiv:2503.07065, 2025. 3

  15. [15]

    Bowman, Ethan Perez, and Evan Hubinger

    Carson Denison, Monte MacDiarmid, Fazl Barez, David Du- venaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, et al. Syco- phancy to subterfuge: Investigating reward-tampering in large language models. arXiv preprint arXiv:2406.10162 ,

  16. [16]

    Open r1: A fully open reproduction of deepseek-r1, 2025

    Hugging Face. Open r1: A fully open reproduction of deepseek-r1, 2025. 3

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 3

  18. [18]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 2

  19. [19]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large lan- guage models. arXiv preprint arXiv:2503.06749, 2025. 3

  20. [20]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. arXiv preprint arXiv:2412.16720, 2024. 2

  21. [21]

    Chatrex: Tam- ing multimodal llm for joint perception and understanding,

    Qing Jiang, Gen luo, Yuqin Yang, Yuda Xiong, Yihao Chen, Zhaoyang Zeng, Tianhe Ren, and Lei Zhang. Chatrex: Tam- ing multimodal llm for joint perception and understanding,

  22. [22]

    Grounding language models to images for multimodal in- puts and outputs

    Jing Yu Koh, Ruslan Salakhutdinov, and Daniel Fried. Grounding language models to images for multimodal in- puts and outputs. In International Conference on Machine Learning, pages 17283–17300. PMLR, 2023. 3

  23. [23]

    Buy 4 reinforce samples, get a baseline for free! 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Buy 4 reinforce samples, get a baseline for free! 2019. 3

  24. [24]

    Lisa: Reasoning segmentation via large language model

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 9579–9589, 2024. 2, 6

  25. [25]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 3

  26. [26]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In In- ternational conference on machine learning , pages 19730– 19742. PMLR, 2023. 3

  27. [27]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer vision–ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13, pages 740–755. Springer, 2014. 2, 8

  28. [28]

    Skywork-reward: Bag of tricks for reward modeling in llms

    Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms. arXiv preprint arXiv:2410.18451, 2024. 2

  29. [29]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Pro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26296–26306, 2024. 3

  30. [30]

    Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024

    Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Im- proved reasoning, ocr, and world knowledge, 2024. 3

  31. [31]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 3

  32. [32]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In European Conference on Computer Vision , pages 38–55. Springer, 2024. 2, 8

  33. [33]

    Llms as narcissistic evaluators: When ego inflates evaluation scores

    Yiqi Liu, Nafise Sadat Moosavi, and Chenghua Lin. Llms as narcissistic evaluators: When ego inflates evaluation scores. arXiv preprint arXiv:2311.09766, 2023. 9

  34. [34]

    Understanding R1-Zero-Like Training: A Critical Perspective

    Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, and Min Lin. Understanding r1-zero-like training: A critical perspective. arXiv preprint arXiv:2503.20783, 2025. 11

  35. [35]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025. 3

  36. [36]

    MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts

    Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3

  37. [37]

    Generation and comprehension of unambiguous object descriptions

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan L Yuille, and Kevin Murphy. Generation and comprehension of unambiguous object descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 11–20, 2016. 2, 6

  38. [38]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Jun- jun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learn- ing. arXiv preprint arXiv:2503.07365, 2025. 3, 11

  39. [39]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2 13

  40. [40]

    arXiv preprint arXiv:2402.06627 , year=

    Alexander Pan, Erik Jones, Meena Jagadeesan, and Jacob Steinhardt. Feedback loops with language models drive in- context reward hacking. arXiv preprint arXiv:2402.06627 ,

  41. [41]

    Spon- taneous reward hacking in iterative self-refinement

    Jane Pan, He He, Samuel R Bowman, and Shi Feng. Spon- taneous reward hacking in iterative self-refinement. arXiv preprint arXiv:2407.04549, 2024. 9

  42. [42]

    arXiv preprint arXiv:2503.07536 , year =

    Yingzhe Peng, Gongrui Zhang, Miaosen Zhang, Zhiyuan You, Jie Liu, Qipeng Zhu, Kai Yang, Xingzhong Xu, Xin Geng, and Xu Yang. Lmm-r1: Empowering 3b lmms with strong reasoning abilities through two-stage rule-based rl. arXiv preprint arXiv:2503.07536, 2025. 3

  43. [43]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 3

  44. [44]

    Grounding dino 1.5: Advance the” edge” of open-set object detection

    Tianhe Ren, Qing Jiang, Shilong Liu, Zhaoyang Zeng, Wen- long Liu, Han Gao, Hongjie Huang, Zhengyu Ma, Xiaoke Jiang, Yihao Chen, et al. Grounding dino 1.5: Advance the” edge” of open-set object detection. arXiv preprint arXiv:2405.10300, 2024. 2

  45. [45]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Rad- ford, and Oleg Klimov. Proximal policy optimization algo- rithms. arXiv preprint arXiv:1707.06347, 2017. 5

  46. [46]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Y Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 2, 3, 4

  47. [47]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023. 3

  48. [48]

    Mea- suring multimodal mathematical reasoning with math-vision dataset

    Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Mea- suring multimodal mathematical reasoning with math-vision dataset. Advances in Neural Information Processing Sys- tems, 37:95095–95169, 2024. 3

  49. [49]

    Large language models are not fair evaluators

    Peiyi Wang, Lei Li, Liang Chen, Zefan Cai, Dawei Zhu, Binghuai Lin, Yunbo Cao, Qi Liu, Tianyu Liu, and Zhifang Sui. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926, 2023. 9

  50. [50]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191, 2024. 2, 3

  51. [51]

    Language models learn to mislead humans via rlhf

    Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R Bowman, He He, and Shi Feng. Language models learn to mislead humans via rlhf. arXiv preprint arXiv:2409.12822, 2024. 9

  52. [52]

    Described object detection: Liberating ob- ject detection with flexible expressions

    Chi Xie, Zhao Zhang, Yixuan Wu, Feng Zhu, Rui Zhao, and Shuang Liang. Described object detection: Liberating ob- ject detection with flexible expressions. Advances in Neural Information Processing Systems, 36:79095–79107, 2023. 7

  53. [53]

    R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization.arXiv preprint arXiv:2503.10615, 2025

    Yi Yang, Xiaoxuan He, Hongkun Pan, Xiyan Jiang, Yan Deng, Xingtao Yang, Haoyu Lu, Dacheng Yin, Fengyun Rao, Minfeng Zhu, et al. R1-onevision: Advancing gen- eralized multimodal reasoning through cross-modal formal- ization. arXiv preprint arXiv:2503.10615, 2025. 3

  54. [54]

    How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection

    Yiyang Yao, Peng Liu, Tiancheng Zhao, Qianqian Zhang, Jiajia Liao, Chunxin Fang, Kyusong Lee, and Qing Wang. How to evaluate the generalization of detection? a bench- mark for comprehensive open-vocabulary detection. arXiv preprint arXiv:2308.13177, 2023. 2, 8

  55. [55]

    Modeling context in referring expres- sions

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C Berg, and Tamara L Berg. Modeling context in referring expres- sions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer, 2016. 2, 6

  56. [56]

    Internlm-xcomposer2

    Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Ziyu Liu, Shengyuan Ding, Shenxi Wu, Yubo Ma, Haodong Duan, Wenwei Zhang, et al. Internlm-xcomposer2. 5-reward: A simple yet effective multi-modal reward model. arXiv preprint arXiv:2501.12368, 2025. 2

  57. [57]

    Sigmoid loss for language image pre-training,

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training,

  58. [58]

    Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186

    Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. Mathverse: Does your multi-modal llm truly see the diagrams in visual math problems? In European Conference on Computer Vision, pages 169–186. Springer, 2024. 3

  59. [59]

    Omdet: Language-aware object detection with large-scale vision-language multi-dataset pre-training

    Tiancheng Zhao, Peng Liu, Xiaopeng Lu, and Kyusong Lee. Omdet: Language-aware object detection with large-scale vision-language multi-dataset pre-training. CoRR, 2022. 2

  60. [60]

    Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network

    Tiancheng Zhao, Peng Liu, and Kyusong Lee. Omdet: Large-scale vision-language multi-dataset pre-training with multimodal detection network. IET Computer Vision, 18(5): 626–639, 2024. 2, 8

  61. [61]

    aha moment

    Hengguang Zhou, Xirui Li, Ruochen Wang, Minhao Cheng, Tianyi Zhou, and Cho-Jui Hsieh. R1-zero’s” aha moment” in visual reasoning on a 2b non-sft model. arXiv preprint arXiv:2503.05132, 2025. 3 14