hub Baseline reference

OneThinker: All-in-one Reasoning Model for Image and Video

· 2025 · cs.CV · arXiv 2512.03043

Baseline reference. 71% of citing Pith papers use this work as a benchmark or comparison.

16 Pith papers citing it

Baseline 71% of classified citations

open full Pith review browse 16 citing papers arXiv PDF

abstract

Reinforcement learning (RL) has recently achieved remarkable success in eliciting visual reasoning within Multimodal Large Language Models (MLLMs). However, existing approaches typically train separate models for different tasks and treat image and video reasoning as disjoint domains. This results in limited scalability toward a multimodal reasoning generalist, which restricts practical versatility and hinders potential knowledge sharing across tasks and modalities. To this end, we propose OneThinker, an all-in-one reasoning model that unifies image and video understanding across diverse fundamental visual tasks, including question answering, captioning, spatial and temporal grounding, tracking, and segmentation. To achieve this, we construct the OneThinker-600k training corpus covering all these tasks and employ commercial models for CoT annotation, resulting in OneThinker-SFT-340k for SFT cold start. Furthermore, we propose EMA-GRPO to handle reward heterogeneity in multi-task RL by tracking task-wise moving averages of reward standard deviations for balanced optimization. Extensive experiments on diverse visual benchmarks show that OneThinker delivers strong performance on 31 benchmarks, across 10 fundamental visual understanding tasks. Moreover, it exhibits effective knowledge transfer between certain tasks and preliminary zero-shot generalization ability, marking a step toward a unified multimodal reasoning generalist. All code, model, and data are released.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

baseline 3 background 2 dataset 2

citation-polarity summary

baseline 3 background 2 use dataset 2

representative citing papers

From Web to Pixels: Bringing Agentic Search into Visual Perception

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.

Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs

cs.CL · 2026-05-10 · unverdicted · novelty 7.0

LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.

Gen-Searcher: Reinforcing Agentic Search for Image Generation

cs.CV · 2026-03-30 · unverdicted · novelty 7.0 · 2 refs

Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.

CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

cs.CV · 2026-01-30 · unverdicted · novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

cs.CV · 2026-05-21 · unverdicted · novelty 6.0

CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.

VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment

cs.CV · 2026-05-20 · unverdicted · novelty 6.0

VersusQ introduces a pairwise margin reasoning framework using large multimodal models to predict signed continuous quality margins between video pairs, claiming improved cross-domain generalization over pointwise scoring methods.

Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs

cs.CV · 2026-05-01 · unverdicted · novelty 6.0 · 2 refs

PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.

Co-Evolving Policy Distillation

cs.LG · 2026-04-29 · unverdicted · novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.

Agentic AI for Remote Sensing: Technical Challenges and Research Directions

cs.CV · 2026-04-27 · unverdicted · novelty 6.0 · 2 refs

Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware evaluation.

Visual Enhanced Depth Scaling for Multimodal Latent Reasoning

cs.CV · 2026-04-12 · unverdicted · novelty 6.0 · 3 refs

Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.

Vision-aligned Latent Reasoning for Multi-modal Large Language Model

cs.CV · 2026-02-04 · unverdicted · novelty 6.0

VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.

Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training

cs.CL · 2026-02-04 · unverdicted · novelty 6.0

Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.

AdaTooler-V: Adaptive Tool-Use for Images and Videos

cs.CV · 2025-12-18 · conditional · novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding

cs.CV · 2026-04-16 · unverdicted · novelty 5.0 · 2 refs

Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.

EasyVideoR1: Easier RL for Video Understanding

cs.CV · 2026-04-18 · unverdicted · novelty 4.0

EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.

Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification

cs.CV · 2026-05-17

citing papers explorer

Showing 16 of 16 citing papers.

From Web to Pixels: Bringing Agentic Search into Visual Perception cs.CV · 2026-05-12 · unverdicted · none · ref 39 · internal anchor
WebEye benchmark and Pixel-Searcher agent enable visual perception tasks by using web search to resolve object identities before precise localization or answering.
Perception Without Engagement: Dissecting the Causal Discovery Deficit in LMMs cs.CL · 2026-05-10 · unverdicted · none · ref 24 · internal anchor
LMMs perceive videos but underexploit visual content for causal reasoning due to textual shortcuts; ProCauEval diagnoses this and ADPO training reduces reliance on priors.
Gen-Searcher: Reinforcing Agentic Search for Image Generation cs.CV · 2026-03-30 · unverdicted · none · ref 14 · 2 links · internal anchor
Gen-Searcher is the first trained search-augmented image generation agent using SFT followed by GRPO reinforcement learning with dual text-image rewards, delivering 15-16 point gains on knowledge-intensive benchmarks.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning cs.CV · 2026-01-30 · unverdicted · none · ref 17 · internal anchor
CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning cs.CV · 2026-05-21 · unverdicted · none · ref 10 · internal anchor
CRPO applies counterfactual videos and a cross-branch relation reward in RL post-training to reduce shortcut reliance in Video LLMs, with gains shown on the new DyBench paired benchmark.
VersusQ: Pairwise Margin Reasoning for Generalizable Video Quality Assessment cs.CV · 2026-05-20 · unverdicted · none · ref 36 · internal anchor
VersusQ introduces a pairwise margin reasoning framework using large multimodal models to predict signed continuous quality margins between video pairs, claiming improved cross-domain generalization over pointwise scoring methods.
Persistent Visual Memory: Sustaining Perception for Deep Generation in LVLMs cs.CV · 2026-05-01 · unverdicted · none · ref 19 · 2 links · internal anchor
PVM adds a parallel branch to LVLMs that directly supplies visual embeddings to prevent attention decay over long generated sequences, yielding accuracy gains on reasoning tasks with minimal overhead.
Co-Evolving Policy Distillation cs.LG · 2026-04-29 · unverdicted · none · ref 23 · internal anchor
CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific experts on text-image-video reasoning.
Agentic AI for Remote Sensing: Technical Challenges and Research Directions cs.CV · 2026-04-27 · unverdicted · none · ref 32 · 2 links · internal anchor
Agentic AI faces structural challenges in remote sensing due to geospatial data properties and workflow constraints, requiring EO-native agents built around structured state, tool-aware reasoning, and validity-aware evaluation.
Visual Enhanced Depth Scaling for Multimodal Latent Reasoning cs.CV · 2026-04-12 · unverdicted · none · ref 11 · 3 links · internal anchor
Visual replay module and adaptive depth scaling improve multimodal latent reasoning, reaching SOTA benchmarks with faster inference than explicit chain-of-thought methods.
Vision-aligned Latent Reasoning for Multi-modal Large Language Model cs.CV · 2026-02-04 · unverdicted · none · ref 10 · internal anchor
VaLR generates vision-aligned latent tokens before each reasoning step to preserve perceptual cues, improving VSI-Bench accuracy from 33.0% to 52.9%.
Dual Tuning for Reasoning Efficacy-Driven Data Curation in Multimodal LLM Training cs.CL · 2026-02-04 · unverdicted · none · ref 29 · internal anchor
Dual Tuning is a data curation method that jointly scores training examples for benefit and for reasoning-gain to choose between reasoning and direct-answer post-training modes for multimodal LLMs.
AdaTooler-V: Adaptive Tool-Use for Images and Videos cs.CV · 2025-12-18 · conditional · none · ref 12 · internal anchor
AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
Chain-of-Glimpse: Search-Guided Progressive Object-Grounded Reasoning for Video Understanding cs.CV · 2026-04-16 · unverdicted · none · ref 14 · 2 links · internal anchor
Chain-of-Glimpse is a reinforcement-learning-based framework that iteratively grounds visual evidence regions to enable multi-step object-aware reasoning in videos.
EasyVideoR1: Easier RL for Video Understanding cs.CV · 2026-04-18 · unverdicted · none · ref 10 · internal anchor
EasyVideoR1 delivers an optimized RL pipeline for video understanding in large vision-language models, achieving 1.47x throughput gains and aligned results on 22 benchmarks.
Don't Guess, Just Ask: Resolving Ambiguity in Referring Segmentation via Multi-turn Clarification cs.CV · 2026-05-17 · unreviewed · ref 12 · internal anchor

OneThinker: All-in-one Reasoning Model for Image and Video

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer