Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Bohan Jia; Fei Zhao; Shaohui Lin; Shaosheng Cao; Wenxuan Huang; Xu Tang; Yao Hu; Zhe Xu; Zheyu Ye; Zijie Zhai

arxiv: 2503.06749 · v4 · submitted 2025-03-09 · 💻 cs.CV · cs.AI· cs.CL· cs.LG

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang , Bohan Jia , Zijie Zhai , Shaosheng Cao , Zheyu Ye , Fei Zhao , Zhe Xu , Xu Tang

show 2 more authors

Yao Hu Shaohui Lin

This is my paper

Pith reviewed 2026-05-11 08:53 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LG

keywords multimodal reasoningreinforcement learningchain of thoughtvision language modelsMathVista benchmarkcold start trainingvisual math problems

0 comments

The pith

Automatically built multimodal reasoning data followed by targeted RL training activates complex visual math reasoning in MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that reinforcement learning can elicit advanced reasoning behaviors in multimodal large language models, but only after an initial cold-start phase supplies suitable examples. The authors generate 200,000 chain-of-thought traces by bridging an existing vision-language model with a text reasoning model and applying filters to remove low-quality outputs. This dataset initializes the model, after which a progressive suppression strategy combined with group-relative policy optimization refines reasoning on a smaller set of math problems. If the method works, vision-language models gain the capacity to question, reflect, and solve image-based mathematical tasks without requiring extensive human-annotated reasoning data.

Core claim

We introduce Vision-R1, a multimodal large language model trained first on a 200K automatically constructed multimodal CoT dataset called Vision-R1-cold for initialization, followed by Progressive Thinking Suppression Training using Group Relative Policy Optimization with a hard formatting reward on a 10K multimodal math dataset. This process incentivizes the emergence of complex reasoning capabilities such as questioning and reflection, leading to an average improvement of approximately 6% across multimodal math reasoning benchmarks, with the 7B version achieving 73.5% on MathVista.

What carries the argument

The two-stage pipeline of cold-start initialization on the automatically generated 200K Vision-R1-cold multimodal CoT dataset followed by Progressive Thinking Suppression Training with GRPO.

If this is right

Larger-scale RL training with additional multimodal math data produces further accuracy gains, as demonstrated by the 32B and 72B variants reaching 76.4% and 78.2% on MathVista.
Direct application of RL without the preceding cold-start dataset fails to activate complex reasoning patterns in MLLMs.
The method enables performance within 0.4% of leading proprietary reasoning models on standard multimodal math benchmarks while using only automatically generated data.
Progressive suppression during RL mitigates overthinking and supports learning of correct reasoning paths on visual math problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same automatic dataset construction approach could be tested on non-math visual reasoning domains such as science diagrams or spatial puzzles.
Biases potentially introduced during automatic filtering might limit generalization to problem types underrepresented in the source models.
Combining the pipeline with longer context windows or additional modalities could extend the range of solvable multimodal tasks.

Load-bearing premise

The 200K multimodal CoT dataset constructed automatically via modality bridging and filtering must be of high enough quality to serve as effective cold-start data without introducing systematic errors or biases that would undermine later RL refinement.

What would settle it

Retraining the base model with RL directly on the 10K math dataset without the 200K cold-start dataset and observing no activation of questioning or reflection behaviors on MathVista would show the initialization step is not required for the reported gains.

read the original abstract

DeepSeek-R1-Zero has successfully demonstrated the emergence of reasoning capabilities in LLMs purely through Reinforcement Learning (RL). Inspired by this breakthrough, we explore how RL can be utilized to enhance the reasoning capability of MLLMs. However, direct training with RL struggles to activate complex reasoning capabilities such as questioning and reflection in MLLMs, due to the absence of substantial high-quality multimodal reasoning data. To address this issue, we propose the reasoning MLLM, Vision-R1, to improve multimodal reasoning capability. Specifically, we first construct a high-quality multimodal CoT dataset without human annotations by leveraging an existing MLLM and DeepSeek-R1 through modality bridging and data filtering to obtain a 200K multimodal CoT dataset, Vision-R1-cold dataset. It serves as cold-start initialization data for Vision-R1. To mitigate the optimization challenges caused by overthinking after cold start, we propose Progressive Thinking Suppression Training (PTST) strategy and employ Group Relative Policy Optimization (GRPO) with the hard formatting result reward function to gradually refine the model's ability to learn correct and complex reasoning processes on a 10K multimodal math dataset. Comprehensive experiments show our model achieves an average improvement of $\sim$6% across various multimodal math reasoning benchmarks. Vision-R1-7B achieves a 73.5% accuracy on the widely used MathVista benchmark, which is only 0.4% lower than the leading reasoning model, OpenAI O1. Scaling up the amount of multimodal math data in the RL training, Vision-R1-32B and Vison-R1-72B achieves 76.4% and 78.2% MathVista benchmark scores, respectively. The datasets and code will be released in: https://github.com/Osilly/Vision-R1 .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Vision-R1 shows solid benchmark gains on multimodal math via synthetic CoT cold-start plus PTST, but the unverified quality of the 200K dataset leaves the source of the improvement unclear.

read the letter

The main thing to know is that this paper gets a 7B MLLM to 73.5% on MathVista using a synthetic-data RL pipeline, within 0.4% of o1, with bigger models scaling to 78%. The average lift across benchmarks is around 6% after the full process. That is the concrete result worth noting first. What is actually new is the modality-bridging construction of the 200K Vision-R1-cold CoT set from an existing MLLM and DeepSeek-R1, followed by the Progressive Thinking Suppression Training step to curb overthinking before GRPO on the 10K math subset. The scaling runs to 32B and 72B and the plan to release data and code are also useful. The work does a reasonable job demonstrating that the RL-for-reasoning approach can be adapted to vision-language models without hand-labeled traces, and the reported numbers are consistent across several multimodal math benchmarks. The soft spot is the dataset. The central claim depends on the 200K synthetic traces being high-quality cold-start material, yet the paper gives no human validation, error statistics, or ablation that removes the cold-start phase. Without those checks it is difficult to tell whether the gains come from PTST and GRPO or from whatever reasoning quality was already present in the bridged data. The filtering criteria are also not quantified. This paper is for researchers working on RL-based reasoning in MLLMs, particularly for visual math and science tasks. A reader who wants to try similar pipelines will find the steps described clearly enough to start from. It deserves peer review because the empirical results and scaling are there and the method is reproducible in principle, even if the data-quality controls need tightening before the attribution is fully convincing. I would send it out.

Referee Report

2 major / 2 minor

Summary. The paper proposes Vision-R1, a multimodal large language model that first constructs a 200K synthetic multimodal chain-of-thought dataset (Vision-R1-cold) via modality bridging between an existing MLLM and DeepSeek-R1 followed by filtering, uses it for cold-start supervised fine-tuning, then applies Progressive Thinking Suppression Training (PTST) and Group Relative Policy Optimization (GRPO) with a hard formatting reward on a 10K multimodal math dataset. It reports an average ~6% improvement across multimodal math reasoning benchmarks, with the 7B variant reaching 73.5% on MathVista (0.4% below OpenAI o1) and larger 32B/72B variants reaching 76.4% and 78.2%.

Significance. If the synthetic cold-start data is shown to be high-quality and the gains are attributable to the RL stage rather than data artifacts, the work would demonstrate a practical, annotation-free route to eliciting complex multimodal reasoning via RL, with clear scaling behavior to larger models. This could meaningfully advance the field by reducing reliance on human-curated reasoning traces for MLLMs.

major comments (2)

[Vision-R1-cold dataset construction] Vision-R1-cold dataset construction (described in the method section following the abstract): No quantitative quality metrics, human validation error rates, or checks for factual accuracy, reasoning depth, or modality mismatches are reported for the 200K traces produced by modality bridging and filtering. This is load-bearing because the central claim attributes the ~6% benchmark gains and near-parity with o1 to the subsequent PTST+GRPO stage; without evidence that the cold-start data is free of systematic biases, downstream numbers alone cannot isolate the contribution of the proposed RL components.
[Experiments] Experiments section (ablation and training details): The manuscript contains no ablation that removes the cold-start phase on the 200K dataset or trains directly with GRPO on the 10K math set from a non-reasoning base model. Such an ablation is required to test whether the reported improvements arise from PTST/GRPO or from artifacts already present in the synthetic CoT data.

minor comments (2)

[Abstract] Abstract: 'Vison-R1-72B' is a typographical error and should read 'Vision-R1-72B'.
[Method] The filtering criteria and exact prompts used for modality bridging are described at a high level but lack sufficient detail (e.g., specific thresholds or example traces) for full reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our methodology and experimental design that require clarification and strengthening. We address each major comment below and commit to revisions that improve the manuscript's rigor without altering its core contributions.

read point-by-point responses

Referee: [Vision-R1-cold dataset construction] Vision-R1-cold dataset construction (described in the method section following the abstract): No quantitative quality metrics, human validation error rates, or checks for factual accuracy, reasoning depth, or modality mismatches are reported for the 200K traces produced by modality bridging and filtering. This is load-bearing because the central claim attributes the ~6% benchmark gains and near-parity with o1 to the subsequent PTST+GRPO stage; without evidence that the cold-start data is free of systematic biases, downstream numbers alone cannot isolate the contribution of the proposed RL components.

Authors: We acknowledge that the original manuscript does not report quantitative quality metrics or human validation results for the Vision-R1-cold dataset. The construction process, detailed in the methods, uses modality bridging between an existing MLLM and DeepSeek-R1 followed by automated filtering for coherence, relevance, and format consistency. While downstream benchmark improvements and scaling behavior provide indirect support for data quality, we agree that explicit validation is necessary to isolate the RL stage's contribution. In the revised manuscript, we will add a dedicated subsection with human evaluation on a 500-sample subset, reporting error rates for factual accuracy, reasoning depth, and modality mismatches, along with inter-annotator agreement. This addition will directly address potential systematic biases. revision: yes
Referee: [Experiments] Experiments section (ablation and training details): The manuscript contains no ablation that removes the cold-start phase on the 200K dataset or trains directly with GRPO on the 10K math set from a non-reasoning base model. Such an ablation is required to test whether the reported improvements arise from PTST/GRPO or from artifacts already present in the synthetic CoT data.

Authors: We agree that an ablation isolating the cold-start phase is valuable for attributing gains specifically to PTST and GRPO. The introduction notes that direct RL on MLLMs without reasoning initialization struggles to activate complex behaviors such as reflection. All reported RL results start from the cold-start model. To address this, the revised manuscript will include a new ablation attempting GRPO directly from the base non-reasoning MLLM on the 10K dataset, with results on training stability and final benchmark performance. This will demonstrate the practical necessity of the cold-start and confirm that the observed ~6% gains stem from the proposed RL components rather than data artifacts alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training pipeline relies on external models and benchmarks

full rationale

The paper constructs its 200K Vision-R1-cold dataset by applying an existing MLLM plus DeepSeek-R1 via modality bridging and filtering, then performs cold-start followed by PTST + GRPO on a separate 10K math set, and reports accuracy numbers on standard external benchmarks such as MathVista. No equation, prediction, or central claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the performance deltas are measured outcomes rather than tautological outputs of the input construction. This is the normal case of a self-contained empirical study.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claim depends on the unverified quality of the automatically generated 200K CoT dataset and the effectiveness of the newly proposed PTST schedule; no independent evidence for either is supplied beyond final benchmark numbers.

axioms (2)

ad hoc to paper Existing MLLMs and DeepSeek-R1 can generate high-quality multimodal chain-of-thought traces via modality bridging and filtering
Invoked to justify the 200K Vision-R1-cold dataset used for cold-start initialization
ad hoc to paper Progressive Thinking Suppression Training can mitigate overthinking while preserving reasoning accuracy
Central to the RL phase on the 10K math dataset

invented entities (2)

Vision-R1-cold dataset no independent evidence
purpose: Cold-start initialization data for RL
Synthetically constructed 200K multimodal CoT examples
Progressive Thinking Suppression Training (PTST) no independent evidence
purpose: Gradual refinement of reasoning length and correctness
New training schedule paired with GRPO

pith-pipeline@v0.9.0 · 5669 in / 1528 out tokens · 75953 ms · 2026-05-11T08:53:51.349360+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Are VLMs Seeing or Just Saying? Uncovering the Illusion of Visual Re-examination
cs.CV 2026-05 unverdicted novelty 8.0

VLMs fail to detect semantically different image swaps up to 60% of the time despite self-reflective statements, with thinking models more vulnerable and attention analysis showing self-reflection does not increase vi...
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images
cs.CV 2026-04 unverdicted novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
DepthAgent: Towards Better Universal Depth Estimation via Sample-wise Expert Selection
cs.CV 2026-05 unverdicted novelty 7.0

A reinforcement-learned vision-language agent adaptively selects and fuses monocular depth experts per sample for better performance across camera geometries.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
CaMo: Camera Motion Grounded Evaluation and Training for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

Proposes Spatial Narrative Score (SNS) evaluation for VLMs' camera motion understanding and introduces CaMo model achieving consistent performance on SNS and direct QA.
From Plans to Pixels: Learning to Plan and Orchestrate for Open-Ended Image Editing
cs.CV 2026-05 unverdicted novelty 7.0

A planner-orchestrator system learns long-horizon image editing by maximizing outcome-based rewards from a vision-language judge and refining plans from successful trajectories.
GeoVista: Visually Grounded Active Perception for Ultra-High-Resolution Remote Sensing Understanding
cs.CV 2026-05 unverdicted novelty 7.0

GeoVista introduces a planning-driven active perception framework with global exploration plans, branch-wise local inspection, and explicit evidence tracking to achieve state-of-the-art results on ultra-high-resolutio...
From Web to Pixels: Bringing Agentic Search into Visual Perception
cs.CV 2026-05 unverdicted novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 unverdicted novelty 7.0

A new benchmark reveals MLLMs achieve only 13% or lower accuracy on advanced perspective-conditioned spatial tasks in omnidirectional images, with RL reward shaping raising a 7B model from 31% to 60% in controlled settings.
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 unverdicted novelty 7.0

MLLMs display a large perception-reasoning gap on perspective-conditioned spatial reasoning tasks from omnidirectional images, with sharp accuracy drops on advanced tasks like egocentric rotation, though partial gains...
Beyond Localization: A Comprehensive Diagnosis of Perspective-Conditioned Spatial Reasoning in MLLMs from Omnidirectional Images
cs.CV 2026-05 conditional novelty 7.0

MLLMs exhibit a large perception-reasoning gap on perspective-conditioned spatial reasoning in omnidirectional images, with accuracy falling from 57% on basic direction tasks to under 1% on compositional reasoning, th...
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
GazeVLM: Active Vision via Internal Attention Control for Multimodal Reasoning
cs.CV 2026-05 unverdicted novelty 7.0

GazeVLM introduces internal gaze tokens that allow VLMs to dynamically suppress irrelevant visual features and simulate foveal attention for improved high-resolution multimodal reasoning.
MIRL: Mutual Information-Guided Reinforcement Learning for Vision-Language Models
cs.CV 2026-05 unverdicted novelty 7.0

MIRL uses mutual information to guide trajectory selection and provide separate rewards for visual perception in RLVR for VLMs, achieving 70.22% average accuracy with 25% fewer full trajectories.
Improving Vision-language Models with Perception-centric Process Reward Models
cs.CV 2026-04 unverdicted novelty 7.0

Perceval is a perception-centric PRM that detects token-level perceptual errors in VLMs, supporting token-advantage RL training and iterative test-time scaling for improved reasoning.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding
cs.CV 2026-04 unverdicted novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench an...
Hybrid Latent Reasoning with Decoupled Policy Optimization
cs.CV 2026-04 unverdicted novelty 7.0

HyLaR with DePO enables effective RL in hybrid discrete-continuous spaces for multimodal models, outperforming prior MLLMs on perception and understanding benchmarks.
Freshness-Aware Prioritized Experience Replay for LLM/VLM Reinforcement Learning
cs.CL 2026-04 unverdicted novelty 7.0

Freshness-Aware PER augments prioritized experience replay with exponential age decay based on effective sample size to enable successful reuse of trajectories in LLM and VLM reinforcement learning, outperforming on-p...
UIPress: Bringing Optical Token Compression to UI-to-Code Generation
cs.CL 2026-04 unverdicted novelty 7.0

UIPress is the first encoder-side learned optical compression method for UI-to-Code that compresses visual tokens to 256, outperforming the uncompressed baseline by 7.5% CLIP score and the best inference-time baseline...
Mosaic: Multimodal Jailbreak against Closed-Source VLMs via Multi-View Ensemble Optimization
cs.CV 2026-04 unverdicted novelty 7.0

Mosaic combines text perturbation, multi-view image optimization, and surrogate model ensembles to reduce reliance on any single open-source model and achieve higher attack success rates on commercial closed-source VLMs.
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
cs.LG 2026-04 unverdicted novelty 7.0

This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
Understanding the Role of Hallucination in Reinforcement Post-Training of Multimodal Reasoning Models
cs.LG 2026-04 unverdicted novelty 7.0

RL post-training on hallucination-forced multimodal data improves reasoning performance and can outperform standard training.
V-Reflection: Transforming MLLMs from Passive Observers to Active Interrogators
cs.CV 2026-03 unverdicted novelty 7.0

V-Reflection introduces a think-then-look mechanism where MLLM latent states actively interrogate visual features via two-stage distillation from a box-guided teacher to a dynamic autoregressive student, narrowing the...
ChartNet: A Million-Scale, High-Quality Multimodal Dataset for Robust Chart Understanding
cs.CV 2026-03 unverdicted novelty 7.0

ChartNet is a million-scale multimodal dataset for chart understanding created via code-guided synthesis spanning 24 chart types with five aligned modalities per sample.
SCP: Spatial Causal Prediction in Video
cs.CV 2026-03 unverdicted novelty 7.0

SCP defines a new benchmark task for predicting spatial causal outcomes beyond direct observation and shows that 23 leading models lag far behind humans on it.
LongVideo-R1: Smart Navigation for Low-cost Long Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

LongVideo-R1 trains a reasoning agent on 33K trajectories to intelligently select informative video clips via iterative refinement and RL, achieving better accuracy-efficiency tradeoffs on long video QA benchmarks.
Visual Para-Thinker: Divide-and-Conquer Reasoning for Visual Comprehension
cs.CV 2026-02 unverdicted novelty 7.0

Visual Para-Thinker is the first parallel reasoning framework for MLLMs that uses visual partitioning strategies, Pa-Attention, and LPRoPE to extend test-time scaling benefits to visual comprehension tasks.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
Forest Before Trees: Latent Superposition for Efficient Visual Reasoning
cs.CL 2026-01 unverdicted novelty 7.0

Laser reformulates visual reasoning via Dynamic Windowed Alignment Learning to maintain latent superposition of global features, delivering 5.03% average gains over Monet and over 97% fewer inference tokens on six benchmarks.
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization
cs.CV 2026-01 unverdicted novelty 7.0

GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.
Reasoning Within the Mind: Dynamic Multimodal Interleaving in Latent Space
cs.CV 2025-12 unverdicted novelty 7.0

DMLR performs dynamic visual-textual interleaving in latent space using confidence-guided latent policy gradient optimization and a dynamic visual injection strategy, yielding improved multimodal reasoning on benchmarks.
Asking like Socrates: Socrates helps VLMs understand remote sensing images
cs.CV 2025-11 unverdicted novelty 7.0

RS-EoT uses a SocraticAgent self-play system and two-stage RL to train VLMs for genuine iterative reasoning and visual inspection on remote sensing VQA and grounding tasks, achieving SOTA results.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning
cs.CV 2025-07 conditional novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
VGR: Visual Grounded Reasoning
cs.CV 2025-06 unverdicted novelty 7.0

VGR introduces a visual-grounded reasoning MLLM that detects and replays image regions during inference, achieving gains on visual benchmarks with 30% fewer image tokens than the LLaVA-NeXT-7B baseline.
GRIT: Teaching MLLMs to Think with Images
cs.CV 2025-05 unverdicted novelty 7.0

GRIT introduces a grounded reasoning paradigm for MLLMs where reasoning chains interleave text and bounding boxes, trained via GRPO-GR reinforcement learning on as few as 20 examples without annotations.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents
cs.CV 2025-04 unverdicted novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using ...
Video-R1: Reinforcing Video Reasoning in MLLMs
cs.CV 2025-03 conditional novelty 7.0

Video-R1 uses temporal-aware RL and mixed datasets to boost video reasoning in MLLMs, with a 7B model reaching 37.1% on VSI-Bench and surpassing GPT-4o.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization
cs.AI 2025-03 conditional novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
RISE: Reliable Improvement in Self-Evolving Vision-Language Models
cs.CV 2026-05 unverdicted novelty 6.0

RISE is a self-evolving framework for VLMs that adds fine-grained alternation, quality supervision, and dynamic balancing to produce reliable gains on seven benchmarks from unlabeled data.
ProCrit: Self-Elicited Multi-Perspective Reasoning with Critic-Guided Revision for Multimodal Sarcasm Detection
cs.MA 2026-05 unverdicted novelty 6.0

ProCrit proposes a Proposal-Critic framework that synthesizes process-level annotations via agentic rollout and uses draft-critique-revise with mutual-refinement RL to improve multimodal sarcasm detection.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...
Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR
cs.AI 2026-05 unverdicted novelty 6.0

POW3R adapts rubric criterion weights via rollout contrast in RLVR to improve mean reward, strict completion rates, and training speed over static rubric aggregation on multimodal and text tasks.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
cs.CV 2026-05 unverdicted novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
Bad Seeing or Bad Thinking? Rewarding Perception for Vision-Language Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

A new RL method called MoCA with Perception Verification rewards perceptual fidelity independently to improve both seeing and thinking in VLMs.
PDCR: Perception-Decomposed Confidence Reward for Vision-Language Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

PDCR improves vision-language reasoning by computing separate normalized confidence advantages for perception steps and reasoning steps after unsupervised decomposition.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
cs.CV 2026-05 unverdicted novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards f...
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
Reinforcing Multimodal Reasoning Against Visual Degradation
cs.CV 2026-05 unverdicted novelty 6.0

ROMA improves MLLM robustness to seen and unseen visual corruptions by +2.3-2.4% over GRPO on seven reasoning benchmarks while matching clean accuracy.
Beyond Thinking: Imagining in 360$^\circ$ for Humanoid Visual Search
cs.CV 2026-05 unverdicted novelty 6.0

Imagining in 360° decouples visual search into a single-step probabilistic semantic layout predictor and an actor, removing the need for multi-turn CoT reasoning and trajectory annotations while improving efficiency i...
Chart-FR1: Visual Focus-Driven Fine-Grained Reasoning on Dense Charts
cs.CV 2026-05 unverdicted novelty 6.0

Chart-FR1 uses Focus-CoT for linking reasoning to visual cues and Focus-GRPO reinforcement learning with efficiency rewards to outperform prior MLLMs on dense chart reasoning tasks.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
See Further, Think Deeper: Advancing VLM's Reasoning Ability with Low-level Visual Cues and Reflection
cs.CV 2026-04 unverdicted novelty 6.0

ForeSight lets VLMs use low-level visual cues and mask-based visual feedback within an RL loop to reason more accurately, with the 7B model beating same-scale peers and some closed-source SOTA on a new benchmark.
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution
cs.CV 2026-04 unverdicted novelty 6.0

CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 6.0

SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
Thinking Before Matching: A Reinforcement Reasoning Paradigm Towards General Person Re-Identification
cs.CV 2026-04 unverdicted novelty 6.0

ReID-R achieves competitive person re-identification performance using chain-of-thought reasoning and reinforcement learning with only 14.3K non-trivial samples, about 20.9% of typical data scales, while providing int...
One Step Forward and K Steps Back: Better Reasoning with Denoising Recursion Models
cs.LG 2026-04 unverdicted novelty 6.0

Denoising Recursion Models train multi-step noise reversal in looped transformers and outperform the prior Tiny Recursion Model on ARC-AGI.
AVRT: Audio-Visual Reasoning Transfer through Single-Modality Teachers
cs.CV 2026-04 unverdicted novelty 6.0

AVRT transfers reasoning to audio-visual models by distilling traces from single-modality teachers via LLM merger followed by SFT cold-start and RL, achieving SOTA on OmniBench, DailyOmni, and MMAR with 3B/7B models.
Generalization in LLM Problem Solving: The Case of the Shortest Path
cs.AI 2026-04 unverdicted novelty 6.0

LLMs show strong spatial generalization to unseen maps in shortest-path tasks but fail length scaling due to recursive instability, with data coverage setting hard limits.