hub Baseline reference

OneThinker: All-in-one Reasoning Model for Image and Video

· 2025 · cs.CV · arXiv 2512.03043

Baseline reference. 71% of citing Pith papers use this work as a benchmark or comparison.

20 Pith papers citing it

Baseline 71% of classified citations

open full Pith review browse 20 citing papers arXiv PDF

abstract

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 background 2 dataset 2

citation-polarity summary

baseline 3 background 2 use dataset 2

representative citing papers

From Web to Pixels: Bringing Agentic Search into Visual Perception

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

Gen-Searcher: Reinforcing Agentic Search for Image Generation

cs.CV · 2026-03-30 · unverdicted · novelty 7.0 · 2 refs

Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

cs.CV · 2026-01-30 · unverdicted · novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.

CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts

cs.CV · 2026-06-30 · unverdicted · novelty 6.0

CoLT replaces text-based chain-of-thought in MLLMs with 3-step latent thought chains supervised by a removable external decoder in forward and backward modes, yielding 10.1x faster inference on eight benchmarks.

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

VersusQ introduces a pairwise margin reasoning framework using large multimodal models to predict signed continuous quality margins between video pairs, claiming improved cross-domain generalization over pointwise scoring methods.

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

cs.CV · 2026-05-17 · unverdicted · novelty 6.0 · 2 refs

IC-Seg is a multi-turn clarification framework with hierarchical GRPO optimization that resolves ambiguous queries in referring video object segmentation and introduces the Ambi-RVOS benchmark.

Video-Zero: Self-Evolution Video Understanding

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

Video-Zero is an annotation-free Questioner-Solver co-evolution framework that centers self-evolution on temporally localized evidence to improve video VLMs.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

cs.CV · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

Co-Evolving Policy Distillation

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

cs.CV · 2026-04-12 · unverdicted · novelty 6.0 · 3 refs

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

cs.CV · 2026-02-04 · unverdicted · novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training

cs.CL · 2026-02-04 · unverdicted · novelty 6.0

Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.

AdaTooler-V: Adaptive Tool-Use for Images and Videos

cs.CV · 2025-12-18 · conditional · novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

cs.RO · 2026-06-09 · unverdicted · novelty 5.0

Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

cs.CV · 2026-04-16 · unverdicted · novelty 5.0 · 2 refs

Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.

SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search

cs.CV · 2026-06-30 · unverdicted · novelty 4.0

SimpleSearch-VL improves Qwen3-VL multimodal agent baselines by 15.8-16 points on average using 7K total training examples and reaches parity with Gemini-3-Pro on the 30B variant.

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

cs.CV · 2026-04-27 · unverdicted · novelty 4.0 · 2 refs

Position paper identifies structural challenges in applying generic agentic AI to Earth Observation and outlines design principles for EO-native agents focused on geospatial state and validity.

EasyVideoR1: Easier RL for Video Understanding

cs.CV · 2026-04-18 · unverdicted · novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

citing papers explorer

Showing 19 of 19 citing papers after filters.

From Web to Pixels: Bringing Agentic Search into Visual Perception cs.CV · 2026-05-12 · unverdicted · none · ref 39 · internal anchor
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
Gen-Searcher: Reinforcing Agentic Search for Image Generation cs.CV · 2026-03-30 · unverdicted · none · ref 14 · 2 links · internal anchor
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV · 2026-01-30 · unverdicted · none · ref 17 · internal anchor
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
CoLT: Teaching Multi-Modal Models to Think with Chain of Latent Thoughts cs.CV · 2026-06-30 · unverdicted · none · ref 10 · internal anchor
CoLT replaces text-based chain-of-thought in MLLMs with 3-step latent thought chains supervised by a removable external decoder in forward and backward modes, yielding 10.1x faster inference on eight benchmarks.
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning cs.CV · 2026-05-21 · unverdicted · none · ref 10 · internal anchor
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment cs.CV · 2026-05-20 · unverdicted · none · ref 36 · internal anchor
VersusQ introduces a pairwise margin reasoning framework using large multimodal models to predict signed continuous quality margins between video pairs, claiming improved cross-domain generalization over pointwise scoring methods.
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification cs.CV · 2026-05-17 · unverdicted · none · ref 12 · 2 links · internal anchor
IC-Seg is a multi-turn clarification framework with hierarchical GRPO optimization that resolves ambiguous queries in referring video object segmentation and introduces the Ambi-RVOS benchmark.
Video-Zero: Self-Evolution Video Understanding cs.CV · 2026-05-14 · unverdicted · none · ref 11 · internal anchor
Video-Zero is an annotation-free Questioner-Solver co-evolution framework that centers self-evolution on temporally localized evidence to improve video VLMs.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 19 · 2 links · internal anchor
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 23 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning cs.CV · 2026-04-12 · unverdicted · none · ref 11 · 3 links · internal anchor
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model cs.CV · 2026-02-04 · unverdicted · none · ref 10 · internal anchor
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training cs.CL · 2026-02-04 · unverdicted · none · ref 29 · internal anchor
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models cs.RO · 2026-06-09 · unverdicted · none · ref 24 · internal anchor
Embodied-R1.5 is an 8B EFM achieving SOTA on 16 of 24 embodied VLM benchmarks, fine-tunable to outperform leading VLAs, with claimed zero-shot real-robot generalization.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding cs.CV · 2026-04-16 · unverdicted · none · ref 14 · 2 links · internal anchor
Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.
SimpleSearch-VL: A Simple Recipe for Multimodal Agentic Deep Search cs.CV · 2026-06-30 · unverdicted · none · ref 74 · internal anchor
SimpleSearch-VL improves Qwen3-VL multimodal agent baselines by 15.8-16 points on average using 7K total training examples and reaches parity with Gemini-3-Pro on the 30B variant.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions cs.CV · 2026-04-27 · unverdicted · none · ref 32 · 2 links · internal anchor
Position paper identifies structural challenges in applying generic agentic AI to Earth Observation and outlines design principles for EO-native agents focused on geospatial state and validity.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 10 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

OneThinker: All-in-one Reasoning Model for Image and Video

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer