Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

· 2025 · cs.CV · arXiv 2511.04570

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

The "Thinking with Text" and "Thinking with Images" paradigms significantly improve the reasoning abilities of large language models (LLMs) and Vision-Language Models (VLMs). However, these paradigms have inherent limitations. (1) Images capture only single moments and fail to represent dynamic processes or continuous changes, and (2) The separation of text and vision as distinct modalities, which hinders unified multimodal understanding and generation. Therefore, we propose "Thinking with Video", a new paradigm that leverages video generation models such as Sora-2 to use video frames as a unified medium for multimodal reasoning. To support this exploration, we developed the Video Thinking Benchmark (VideoThinkBench), which covers both vision-centric tasks (e.g., Eyeballing Puzzles) and text-centric tasks (e.g., GSM8K and MMMU). Our evaluation on VideoThinkBench establishes Sora-2 as a capable reasoner. On vision-centric tasks, Sora-2 is comparable to state-of-the-art (SOTA) VLMs, and even surpasses GPT-5 by 10% on eyeballing puzzles. On text-centric tasks, Sora-2 achieves 92% accuracy on MATH, and 69.2% accuracy on MMMU. Furthermore, we systematically analyze the source of these abilities. We also find that self-consistency and in-context learning can improve Sora-2's performance. In summary, our findings show that the video generation model is the potential unified multimodal understanding and generation model, positioning "Thinking with Video" as a potential unified multimodal reasoning paradigm.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Video Models Can Reason with Verifiable Rewards

cs.CV · 2026-05-14 · unverdicted · novelty 6.0

VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.

OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model

cs.CV · 2026-04-22 · unverdicted · novelty 6.0

OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.

Kling-Omni Technical Report

cs.CV · 2025-12-18 · unverdicted · novelty 6.0

Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.

citing papers explorer

Showing 3 of 3 citing papers.

Video Models Can Reason with Verifiable Rewards cs.CV · 2026-05-14 · unverdicted · none · ref 35 · internal anchor
VideoRLVR uses SDE-GRPO optimization, dense decomposed rewards, and Early-Step Focus to train video diffusion models on verifiable reasoning tasks, outperforming supervised fine-tuning and other video generators on Maze, FlowFree, and Sokoban.
OMIBench: Benchmarking Olympiad-Level Multi-Image Reasoning in Large Vision-Language Model cs.CV · 2026-04-22 · unverdicted · none · ref 54 · internal anchor
OMIBench benchmark reveals that current LVLMs achieve at most 50% on Olympiad problems requiring reasoning across multiple images.
Kling-Omni Technical Report cs.CV · 2025-12-18 · unverdicted · none · ref 29 · internal anchor
Kling-Omni is a unified multimodal generative system that produces cinematic videos from diverse inputs by integrating generation, editing, and intelligent reasoning in a single end-to-end model.

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer