hub Canonical reference

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu · 2025 · cs.CV · arXiv 2503.07365

Canonical reference. 83% of citing Pith papers cite this work as background.

50 Pith papers citing it

Background 83% of classified citations

open full Pith review browse 50 citing papers arXiv PDF

abstract

DeepSeek R1, and o1 have demonstrated powerful reasoning capabilities in the text domain through stable large-scale reinforcement learning. To enable broader applications, some works have attempted to transfer these capabilities to multimodal reasoning. However, these efforts have been limited by the limited difficulty of selected tasks and relatively small training scales, making it challenging to demonstrate strong multimodal reasoning abilities. To address this gap, we introduce the MMK12 dataset and MM-EUREKA with 7B and 32B parameters. The former is a high-quality multimodal mathematics reasoning dataset featuring diverse knowledge domains with human-verified answers and solution processes. The latter is a multimodal model employing rule-based reinforcement learning on MMK12, utilizing online filtering and two-stage training strategy to enhance training stability. MM-EUREKA demonstrates remarkable performance gains in multimodal mathematical reasoning, outperforming previous powerful models like InternVL2.5-78B or InternVL2.5-38B-MPO. In particular, MM-EUREKA achieves competitive or superior performance compared to both open-source and closed-source models, and trails slightly behind o1 in multidisciplinary reasoning tasks. We open-source our complete pipeline to foster further research in this area. We release all our codes, models, data, etc. at https://github.com/ModalMinds/MM-EUREKA

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 9 baseline 1 dataset 1 method 1

citation-polarity summary

background 10 baseline 1 use dataset 1

representative citing papers

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling

cs.AI · 2026-05-12 · unverdicted · novelty 8.0

MM-OptBench is a solver-grounded benchmark showing current multimodal LLMs reach at most 52% pass@1 on generating correct optimization models from text-plus-visual problem specifications.

S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images

cs.CV · 2026-04-23 · unverdicted · novelty 8.0

S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.

Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era

cs.LG · 2026-05-17 · unverdicted · novelty 7.0

Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.

ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both

cs.CV · 2026-05-14 · unverdicted · novelty 7.0

ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.

CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves

cs.CV · 2026-05-13 · unverdicted · novelty 7.0 · 2 refs

CurveBench is a new benchmark for recovering rooted containment trees from images of nested Jordan curves, where the strongest model reaches only 19.1% accuracy on hard cases and fine-tuning lifts an open model to 33.3% on easy cases.

Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning

cs.CL · 2026-05-13 · unverdicted · novelty 7.0

Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.

Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning

cs.AI · 2026-01-14 · unverdicted · novelty 7.0

Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.

Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization

cs.CV · 2026-01-07 · unverdicted · novelty 7.0

GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.

High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning

cs.CV · 2025-07-08 · conditional · novelty 7.0

MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.

GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents

cs.CV · 2025-04-14 · unverdicted · novelty 7.0

GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.

R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization

cs.AI · 2025-03-17 · conditional · novelty 7.0

R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.

Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning

cs.CV · 2026-05-10 · unverdicted · novelty 7.0

RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.

SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning

cs.AI · 2026-05-10 · unverdicted · novelty 7.0 · 2 refs

SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.

CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding

cs.CV · 2026-04-24 · unverdicted · novelty 7.0

CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.

DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

cs.CV · 2025-05-20 · unverdicted · novelty 7.0

DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.

ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

cs.CV · 2026-05-19 · unverdicted · novelty 6.0 · 2 refs

ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.

VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation

cs.CV · 2026-05-15 · unverdicted · novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.

20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone

cs.LG · 2026-05-12 · conditional · novelty 6.0 · 2 refs

Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

cs.CV · 2026-03-15 · unverdicted · novelty 6.0

Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.

LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling

cs.CV · 2025-11-25 · unverdicted · novelty 6.0

LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding

cs.CV · 2025-11-17 · unverdicted · novelty 6.0

REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.

Perception-Aware Policy Optimization for Multimodal Reasoning

cs.CL · 2025-07-08 · unverdicted · novelty 6.0

PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.

v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning

cs.CL · 2025-05-24 · unverdicted · novelty 6.0

v1 adds a point-and-copy mechanism for dynamic visual token referencing in multimodal reasoning, trained on a new 300K dataset with grounding annotations, and outperforms baselines on multimodal math tasks.

citing papers explorer

Showing 50 of 50 citing papers.

MM-OptBench: A Solver-Grounded Benchmark for Multimodal Optimization Modeling cs.AI · 2026-05-12 · unverdicted · none · ref 2
MM-OptBench is a solver-grounded benchmark showing current multimodal LLMs reach at most 52% pass@1 on generating correct optimization models from text-plus-visual problem specifications.
S1-VL: Scientific Multimodal Reasoning Model with Thinking-with-Images cs.CV · 2026-04-23 · unverdicted · none · ref 12
S1-VL combines structured scientific reasoning with iterative image manipulation via code execution to reach state-of-the-art results on visual and scientific reasoning benchmarks.
Reasoning Portability: Guiding Continual Learning for MLLMs in the RLVR Era cs.LG · 2026-05-17 · unverdicted · none · ref 33 · internal anchor
Formalizes Reasoning Portability (RP) and proposes RDB-CL to modulate per-sample KL regularization in RLVR for MLLM continual learning, achieving +12.0% Last accuracy over vanilla RLVR baseline by preserving reusable reasoning on high-RP samples.
ATLAS: Agentic or Latent Visual Reasoning? One Word is Enough for Both cs.CV · 2026-05-14 · unverdicted · none · ref 12 · internal anchor
ATLAS uses a single functional token to unify agentic and latent visual reasoning without image generation or external execution.
CurveBench: A Benchmark for Exact Topological Reasoning over Nested Jordan Curves cs.CV · 2026-05-13 · unverdicted · none · ref 43 · 2 links · internal anchor
CurveBench is a new benchmark for recovering rooted containment trees from images of nested Jordan curves, where the strongest model reaches only 19.1% accuracy on hard cases and fine-tuning lifts an open model to 33.3% on easy cases.
Physics-R1: An Audited Olympiad Corpus and Recipe for Visual Physics Reasoning cs.CL · 2026-05-13 · unverdicted · none · ref 32 · internal anchor
Audited olympiad corpus and Physics-R1 recipe improve 8B VLM by up to 18 points on held-out physics problems while exposing contamination in prior evals.
Omni-R1: Towards the Unified Generative Paradigm for Multimodal Reasoning cs.AI · 2026-01-14 · unverdicted · none · ref 1 · internal anchor
Omni-R1 unifies multimodal reasoning by generating intermediate images during the process in a SFT-plus-RL framework, with an Omni-R1-Zero variant that matches or exceeds it using only text data.
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization cs.CV · 2026-01-07 · unverdicted · none · ref 8 · internal anchor
GPRO trains a meta-controller on 790k failure-labeled samples to dynamically select fast, perception, or reasoning paths in LVLMs, yielding higher accuracy and shorter responses than prior slow-thinking methods.
High-Resolution Visual Reasoning via Multi-Turn Grounding-Based Reinforcement Learning cs.CV · 2025-07-08 · conditional · none · ref 27 · internal anchor
MGPO elicits grounding in LMMs via multi-turn RL with binary rewards, yielding 5.4% and 5.2% gains on MME-Realworld and V* Bench and surpassing GPT-4o on the latter after training on 21K samples.
GUI-R1 : A Generalist R1-Style Vision-Language Action Model For GUI Agents cs.CV · 2025-04-14 · unverdicted · none · ref 20 · internal anchor
GUI-R1 uses reinforcement fine-tuning with GRPO on a small curated dataset to create a generalist vision-language action model that outperforms prior GUI agent methods across mobile, desktop, and web benchmarks using only 0.02% of the data.
R1-VL: Learning to Reason with Multimodal Large Language Models via Step-wise Group Relative Policy Optimization cs.AI · 2025-03-17 · conditional · none · ref 27 · internal anchor
R1-VL uses StepGRPO with rule-based StepRAR and StepRVR rewards to let MLLMs learn step-by-step reasoning beyond imitation of positive paths.
Reflection Anchors for Propagation-Aware Visual Retention in Long-Chain Multimodal Reasoning cs.CV · 2026-05-10 · unverdicted · none · ref 9
RAPO uses an information-theoretic lower bound on visual gain to select high-entropy reflection anchors and optimizes a chain-masked KL surrogate, delivering gains over baselines on reasoning benchmarks across LVLM backbones.
SeePhys Pro: Diagnosing Modality Transfer and Blind-Training Effects in Multimodal RLVR for Physics Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 18 · 2 links
SeePhys Pro benchmark reveals multimodal models degrade on physics reasoning as information transfers from text to images, with blind training improvements often stemming from textual cues rather than visual evidence.
CGC: Compositional Grounded Contrast for Fine-Grained Multi-Image Understanding cs.CV · 2026-04-24 · unverdicted · none · ref 32
CGC improves fine-grained multi-image understanding in MLLMs by constructing contrastive training instances from existing single-image annotations and adding a rule-based spatial reward, achieving SOTA on MIG-Bench and VLM2-Bench with transfer gains to other multimodal tasks.
DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning cs.CV · 2025-05-20 · unverdicted · none · ref 14
DeepEyes uses reinforcement learning to teach vision-language models active perception and image-based thinking, yielding gains on perception, reasoning, grounding, and hallucination benchmarks.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning cs.CV · 2026-05-19 · unverdicted · none · ref 19 · 2 links · internal anchor
ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six benchmarks.
Beyond Mode Collapse: Distribution Matching for Diverse Reasoning cs.AI · 2026-05-19 · unverdicted · none · ref 24 · internal anchor
DMPO approximates forward KL minimization in on-policy RL by aligning the policy to a group-level reward-proportional target distribution, yielding 9-12% relative gains over GRPO on NP-Bench and smaller gains on math reasoning.
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation cs.CV · 2026-05-15 · unverdicted · none · ref 22 · internal anchor
VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines including GPT-4o.
20/20 Vision Language Models: A Prescription for Better VLMs through Data Curation Alone cs.LG · 2026-05-12 · conditional · none · ref 26 · 2 links · internal anchor
Data curation alone raises VLM accuracy by more than 11 points on average across many benchmarks while cutting required training compute by up to 87 times.
Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models cs.CV · 2026-03-15 · unverdicted · none · ref 18 · internal anchor
Attention dispersion during extended reasoning impairs MLLM perception on images, and a training-free VRGA framework mitigates it by selecting and reweighting visual attention heads using an entropy-focus criterion.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling cs.CV · 2025-11-25 · unverdicted · none · ref 31 · internal anchor
LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding cs.CV · 2025-11-17 · unverdicted · none · ref 30 · internal anchor
REVISOR adds multimodal visual-text reflection and a Dual Attribution Decoupled Reward to improve long-form video reasoning in MLLMs without extra supervised fine-tuning.
Perception-Aware Policy Optimization for Multimodal Reasoning cs.CL · 2025-07-08 · unverdicted · none · ref 18 · internal anchor
PAPO integrates perception-aware supervision via a KL-based loss into RLVR methods like GRPO, yielding 4.4-17.5% gains on multimodal benchmarks and 30.5% fewer perception errors, with larger gains on vision-heavy tasks.
v1: Learning to Point Visual Tokens for Multimodal Grounded Reasoning cs.CL · 2025-05-24 · unverdicted · none · ref 4 · internal anchor
v1 adds a point-and-copy mechanism for dynamic visual token referencing in multimodal reasoning, trained on a new 300K dataset with grounding annotations, and outperforms baselines on multimodal math tasks.
UI-R1: Enhancing Efficient Action Prediction of GUI Agents by Reinforcement Learning cs.AI · 2025-03-27 · accept · none · ref 8 · internal anchor
UI-R1 shows rule-based RL with GRPO on 136 GUI tasks improves a 3B MLLM's action prediction accuracy by 6-22% over its base model and matches larger SFT-trained models.
OpenVLThinker: Complex Vision-Language Reasoning via Iterative SFT-RL Cycles cs.CV · 2025-03-21 · conditional · none · ref 48 · internal anchor
Iterative SFT-RL cycles enable a 7B LVLM to develop sophisticated visual chain-of-thought reasoning and improve performance on math and general reasoning benchmarks.
MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning cs.CV · 2025-03-10 · conditional · none · ref 27 · internal anchor
MM-Eureka models trained via rule-based RL on the new MMK12 dataset achieve competitive or superior multimodal mathematical reasoning performance compared to models like InternVL2.5-78B while trailing slightly behind o1.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology cs.CV · 2026-05-11 · unverdicted · none · ref 82
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 56 · 2 links
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
CharTide: Data-Centric Chart-to-Code Generation via Tri-Perspective Tuning and Inquiry-Driven Evolution cs.CV · 2026-04-24 · unverdicted · none · ref 22
CharTide decouples chart-to-code data into three perspectives and uses inquiry-driven RL with atomic QA verification to let smaller VLMs surpass GPT-4o on chart-to-code tasks.
SSL-R1: Self-Supervised Visual Reinforcement Post-Training for Multimodal Large Language Models cs.CV · 2026-04-22 · unverdicted · none · ref 44
SSL-R1 reformulates visual SSL tasks into verifiable puzzles to supply rewards for RL post-training of MLLMs, yielding gains on multimodal benchmarks without external supervision.
VL-Calibration: Decoupled Confidence Calibration for Large Vision-Language Models Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 2
VL-Calibration is a reinforcement learning method that separates visual and reasoning confidence in LVLMs via intrinsic visual certainty estimation to improve calibration and accuracy.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward cs.CV · 2026-04-06 · unverdicted · none · ref 49
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency cs.CV · 2025-08-25 · unverdicted · none · ref 90
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and agentic tasks.
VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model cs.CV · 2025-04-10 · unverdicted · none · ref 38
VLM-R1 applies R1-style RL using rule-based rewards on visual tasks with clear ground truth to achieve competitive performance and superior generalization over SFT in vision-language models.
Visually-Guided Policy Optimization for Multimodal Reasoning cs.CV · 2026-04-10 · unverdicted · none · ref 4 · 2 links · internal anchor
VGPO applies visual attention compensation via similarity and dual-grained advantage re-weighting to improve visual activation and performance in multimodal reasoning.
OneThinker: All-in-one Reasoning Model for Image and Video cs.CV · 2025-12-02 · unverdicted · none · ref 54 · internal anchor
OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis cs.AI · 2025-11-13 · unverdicted · none · ref 11 · internal anchor
A reasoning-driven problem generator plans synthesis directions with CoT and uses solver performance feedback to adapt difficulty, producing complementary problems that yield a 3.4% average improvement across 10 reasoning benchmarks.
Mixture-of-Visual-Thoughts: Exploring Context-Adaptive Reasoning Mode Selection for General Visual Reasoning cs.AI · 2025-09-26 · unverdicted · none · ref 28 · internal anchor
MoVT unifies different visual reasoning modes in a single model and uses the AdaVaR two-stage framework with supervised cold-start and RL via AdaGRPO to enable context-adaptive mode selection, yielding consistent gains on visual reasoning tasks.
Mini-o3: Scaling Up Reasoning Patterns and Interaction Turns for Visual Search cs.CV · 2025-09-09 · unverdicted · none · ref 26 · internal anchor
Mini-o3 scales visual search reasoning to tens of interaction turns via a new probe dataset, iterative trajectory collection, and over-turn masking in RL, claiming SOTA performance while training only up to six turns.
RealSR-R1: Reinforcement Learning for Real-World Image Super-Resolution with Vision-Language Chain-of-Thought cs.CV · 2025-06-20 · unverdicted · none · ref 37 · internal anchor
RealSR-R1 introduces VLCoT-GRPO with four rewards to add understanding and reasoning to real-world image super-resolution models.
Lingshu: A Generalist Foundation Model for Unified Multimodal Medical Understanding and Reasoning cs.CL · 2025-06-08 · unverdicted · none · ref 6 · 2 links · internal anchor
Lingshu is a medical-specialized multimodal LLM that outperforms prior open-source models on multimodal QA, text QA, and report generation after training on a large curated dataset of medical knowledge.
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding cs.CV · 2025-03-17 · unverdicted · none · ref 35 · internal anchor
Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.
Measure Twice, Click Once: Co-evolving Proposer and Visual Critic via Reinforcement Learning for GUI Grounding cs.LG · 2026-04-23 · unverdicted · none · ref 57
A co-evolving proposer-critic RL framework improves GUI grounding accuracy by letting the model critique its own proposals rendered on screenshots.
DT2IT-MRM: Debiased Preference Construction and Iterative Training for Multimodal Reward Modeling cs.AI · 2026-04-21 · unverdicted · none · ref 33
DT2IT-MRM proposes a debiased preference construction pipeline, T2I data reformulation, and iterative training to curate multimodal preference data, achieving SOTA on VL-RewardBench, Multimodal RewardBench, and MM-RLHF-RewardBench.
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding cs.AI · 2026-04-14 · unverdicted · none · ref 42 · 2 links
DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-efficient resolution allocation.
SignReasoner: Compositional Reasoning for Complex Traffic Sign Understanding via Functional Structure Units cs.CV · 2026-04-12 · unverdicted · none · ref 19
SignReasoner decomposes traffic signs into functional structure units and uses a two-stage VLM post-training pipeline to achieve state-of-the-art compositional reasoning on a new benchmark.
From System 1 to System 2: A Survey of Reasoning Large Language Models cs.AI · 2025-02-24 · accept · none · ref 257
The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.
RISE: Reliable Improvement in Self-Evolving Vision-Language Models cs.CV · 2026-05-20 · unreviewed · ref 29 · internal anchor
Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models cs.AI · 2026-04-11 · unreviewed · ref 101

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer