Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Bowen Zhang; Chixiang Ma; Chubin Zhang; Dongliang He; Haoji Zhang; Jiawen Li; Sule Bai; Xin Gu; Yansong Tang; Zhichao Zhou

arxiv: 2508.04416 · v2 · pith:AZBQ5LRZnew · submitted 2025-08-06 · 💻 cs.CV

Thinking With Videos: Multimodal Tool-Augmented Reinforcement Learning for Long Video Reasoning

Haoji Zhang , Xin Gu , Jiawen Li , Chixiang Ma , Sule Bai , Chubin Zhang , Bowen Zhang , Zhichao Zhou

show 2 more authors

Dongliang He Yansong Tang

This is my paper

classification 💻 cs.CV

keywords videoreasoninglearningansweringgroundinglongmultimodalquestion

0 comments

read the original abstract

The video reasoning ability of multimodal large language models (MLLMs) is crucial for downstream tasks like video question answering and temporal grounding. While recent approaches have explored text-based chain-of-thought (CoT) reasoning for MLLMs, these methods often suffer from limited cross-modal interaction and increased hallucination, especially with longer videos or reasoning chains. To address these challenges, we propose Video Intelligence via Tool-Augmented Learning (VITAL), a novel end-to-end agentic video reasoning framework. With a visual toolbox, the model can densely sample new video frames on demand and generate multimodal CoT for precise long video reasoning. We observe that temporal grounding and question answering are mutually beneficial for video understanding tasks. Therefore, we construct two high-quality multi-task video reasoning datasets MTVR-CoT-72k for supervised fine-tuning and MTVR-RL-110k for reinforcement learning. Moreover, we propose a Difficulty-aware Group Relative Policy Optimization algorithm (DGRPO) to mitigate difficulty imbalance in multi-task reinforcement learning. Extensive experiments on 11 challenging video understanding benchmarks demonstrate the advanced reasoning ability of VITAL, outperforming existing methods in video question answering and temporal grounding tasks, especially in long video scenarios. Code is available at https://zhang9302002.github.io/thinkingwithvideos-page/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 26 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
cs.CV 2026-05 unverdicted novelty 7.0

SVI-Bench provides 35K hours of sports video with 9 tasks across four cognitive levels, revealing models drop from ~74% on action QA to 5% on agentic evidence integration.
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
cs.CV 2026-05 unverdicted novelty 7.0

Introduces CaST-Bench, a dataset of 2,066 causal questions on 1,015 videos with annotated causal chains and metrics to evaluate VLMs on spatio-temporal causal reasoning.
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
cs.CV 2026-05 unverdicted novelty 7.0

CaST-Bench creates a benchmark with causal-chain annotations and novel metrics showing that current VLMs struggle to construct precise grounded causal chains in video QA.
LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning
cs.CL 2026-05 unverdicted novelty 7.0

LatentOmni proposes a latent-space cross-modal reasoning framework that uses feature-level supervision and Omni-Sync Position Embedding to align and synchronize audio-visual latents, supported by a new 35K interleaved...
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 7.0

ParaVT is a parallel video tool-calling RL framework that resolves the Tool Prior Paradox via PARA-GRPO, delivering +7.9% average gains on six long-video benchmarks and raising format compliance from 0.13 to 0.64.
ReTool-Video: Recursive Tool-Using Video Agents with Meta-Augmented Tool Grounding
cs.CV 2026-05 unverdicted novelty 7.0

ReTool-Video uses a 134-tool meta-augmented library and recursive grounding to translate abstract video intents into fine-grained multimodal operations, outperforming baselines on MVBench, MLVU, and Video-MME.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
Listening with Time: Precise Temporal Awareness for Long-Form Audio Understanding
eess.AS 2026-04 unverdicted novelty 7.0

LAT-Audio introduces a global-to-local reasoning approach with TWA-CoT that outperforms prior models on temporal tasks for audio up to 30 minutes.
VideoThinker: Building Agentic VideoLLMs with LLM-Guided Tool Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

VideoThinker uses LLM-generated synthetic tool trajectories in caption space grounded to video frames to train agentic VideoLLMs that outperform baselines on long-video benchmarks.
DyCo-RL: Dynamic Cross-Modal Coordination for Visual Reasoning
cs.CV 2026-06 unverdicted novelty 6.0

DyCo-RL improves four RLVR algorithms on seven visual and math reasoning benchmarks by assigning tokens visual or text roles via Fisher-Rao geodesic distance on attention and reweighting advantages by role-alignment score.
SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence
cs.CV 2026-05 unverdicted novelty 6.0

SVI-Bench is a 35K-hour sports video benchmark with 9 tasks across four cognitive pillars that reveals multimodal models drop from ~73% on action QA to 5% on agentic evidence-gathering tasks.
DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding
cs.CV 2026-05 unverdicted novelty 6.0

DynFrame introduces tokenized learnable span-density retrieval and Segment-Decoupled GRPO in video MLLMs, achieving competitive or SOTA results on six benchmarks with 4B and 8B models.
ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
cs.CV 2026-05 unverdicted novelty 6.0

ParaVT introduces the first multi-agent RL framework for parallel video tool calling in LMMs, using PARA-GRPO to resolve the Tool Prior Paradox and achieve +7.9% average improvement over Qwen3-VL baseline across six b...
VideoSeeker: Incentivizing Instance-level Video Understanding via Native Agentic Tool Invocation
cs.CV 2026-05 unverdicted novelty 6.0

VideoSeeker integrates agentic reasoning and visual prompts into LVLMs via automated data synthesis, cold-start supervision, and RL training, yielding +13.7% gains on instance-level video tasks over baselines includin...
POINTS-Long: Adaptive Dual-Mode Visual Reasoning in MLLMs
cs.CV 2026-04 unverdicted novelty 6.0

POINTS-Long is a dual-mode multimodal large language model that uses dynamic visual token scaling to retain 97.7-99.7% accuracy on long-form tasks with 1/40 to 1/10th the tokens and supports streaming via detachable KV-cache.
The Past Is Not Past: Memory-Enhanced Dynamic Reward Shaping
cs.LG 2026-04 unverdicted novelty 6.0

MEDS improves LLM RL performance by up to 4.13 pass@1 and 4.37 pass@128 points by dynamically penalizing rollouts matching prevalent historical error clusters identified via memory-stored representations and density c...
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
cs.CV 2026-02 unverdicted novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
LongVT: Incentivizing "Thinking with Long Videos" via Native Tool Calling
cs.CV 2025-11 unverdicted novelty 6.0

LongVT adds native video-cropping tool calling to LMMs for interleaved multimodal chain-of-tool-thought reasoning on long videos and releases VideoSIAH data for training and evaluation.
MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding
cs.CV 2025-05 unverdicted novelty 6.0

MUSEG applies timestamp-aware multi-segment grounding with a phased-reward RL recipe to boost temporal grounding and time-sensitive video QA performance in MLLMs.
See More, Think Deeper: Query-Expanded Visual Evidence and Answer-Clue Guided Reflection for Long Video Understanding
cs.CV 2026-06 unverdicted novelty 5.0

CoVER framework lets Video-LLMs gather query-expanded visual evidence and verify answers with answer-clue visual feedback to improve long-video understanding.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD proposes structured self-distillation with a multi-dimensional judge model and direction-magnitude decoupling to improve token-level credit assignment and convergence speed in VideoLLM reasoning training.
MAG-3D: Multi-Agent Grounded Reasoning for 3D Understanding
cs.CV 2026-04 unverdicted novelty 5.0

MAG-3D is a training-free multi-agent framework that coordinates planning, grounding, and coding agents with off-the-shelf VLMs to achieve grounded 3D reasoning and state-of-the-art benchmark results.
OneThinker: All-in-one Reasoning Model for Image and Video
cs.CV 2025-12 unverdicted novelty 5.0

OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
Watch, Remember, Reason: Human-View Video Understanding with MLLMs
cs.CV 2026-06 unverdicted novelty 4.0

This is a survey that frames video MLLM research via a human-view formulation of perceptual representations, memory states, reasoning traces, and predictions, then reviews methods, datasets, benchmarks, and open problems.