pith. sign in

arxiv: 2511.11113 · v2 · submitted 2025-11-14 · 💻 cs.CV · cs.AI· cs.LG

VIDEOP2R: Video Understanding from Perception to Reasoning

Pith reviewed 2026-05-17 22:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords video reasoningreinforcement fine-tuningchain of thoughtlarge video language modelsperceptionreasoningpolicy optimizationvideo benchmarks
0
0 comments X

The pith

VideoP2R shows that separating perception and reasoning processes in training large video models leads to superior performance on reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoP2R, a framework that applies reinforcement fine-tuning to large video language models by explicitly modeling perception and reasoning as separate steps. It generates a special chain-of-thought dataset that captures both processes and uses a modified optimization method that rewards each process independently. If this approach works as claimed, video models could better handle complex tasks like understanding events in videos and drawing conclusions from them. This matters because current video AI often mixes up seeing details with logical thinking, leading to errors in real-world applications such as surveillance or content analysis. The results indicate that perception outputs alone can support accurate reasoning without additional steps.

Core claim

The central claim is that by treating perception and reasoning as distinct processes in a two-stage reinforcement fine-tuning setup, video language models can achieve better understanding and reasoning. Specifically, a three-step pipeline creates a 162K dataset of process-aware chain-of-thought examples, and a process-aware group relative policy optimization algorithm assigns separate rewards to perception and reasoning phases. This yields state-of-the-art results on six of seven video benchmarks and confirms that perception outputs contain enough information for downstream reasoning.

What carries the argument

The key machinery is the process-aware group relative policy optimization (PA-GRPO) that provides separate rewards for perception and reasoning, supported by a generated process-aware chain-of-thought dataset from a three-step pipeline.

Load-bearing premise

That supplying separate rewards for perception and reasoning in PA-GRPO, combined with the generated process-aware CoT data, produces genuine improvements rather than artifacts of reward design or data generation choices.

What would settle it

A direct comparison where the same model is trained with standard group relative policy optimization without separate perception and reasoning rewards, and it matches or exceeds the reported performance, would challenge the necessity of the process-aware approach.

Figures

Figures reproduced from arXiv: 2511.11113 by Jayakrishnan Unnikrishnan, Rui Zhao, Toufiq Parag, Yifan Jiang, Yueying Wang, Zhenyu Liao, Zhimin Chen.

Figure 1
Figure 1. Figure 1: Comparison between GRPO-based video RFT framework (process-agnostic) and V [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of overall VIDEOP2R RFT framework (left) and the three-step CoT generation pipeline (right). 56, 71]. Time-R1 uses timestamp-aware and template rewards [56]; Video-R1 and STAR-R1 reward sensitiv￾ity to correct temporal order [14, 28]; Videochat-R1 and VersaVid-R1 adopt task-specific rewards [7, 26]; Vide￾oRFT adds a stage-aware semantic reward [52]. How￾ever, most prior efforts model video rea… view at source ↗
Figure 3
Figure 3. Figure 3: The illustration of the PA-GRPO algorithm. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Effect of perception on downstream reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Success (Left) and Failure (Right) case of V [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training Dynamics and Think-Answer Mismatch [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt Template for Observation Sufficiency Veri [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt Template for Process-aware CoT Generation. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 10
Figure 10. Figure 10: Word length (Left) and Word cloud (Right) Visual [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: An Annotation Example of the Video QA Sample [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 9
Figure 9. Figure 9: Embeddings visualization of VIDEOP2R-CoT-162K 7.5. Annotation Examples We provide annotation examples in Figs. 11 and 12 to il￾lustrate how our annotations explicitly separate percep￾tion from reasoning [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt Template for Answer Extraction. 10.3. Examples of Qwen Inference Output We present examples of Qwen’s outputs under differ￾ent configurations in our perception examination experi￾ment in [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt Template for Perception Examination Ex [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗
Figure 15
Figure 15. Figure 15: RL training Dynamics of VIDEOP2R model consistently adheres to the process-aware infer￾ence template and maintains stable format compliance. Since the length reward is conditioned on both accuracy and format rewards, we instead visualize the lengths of the perception and reasoning segments during RL. We observe an initial increase followed by a decrease in both segments, indicating that the model adaptive… view at source ↗
Figure 16
Figure 16. Figure 16: Example of Think-Answer Mismatch. consistent with its original design, while VersaVid-R1 has too few available traces on VSI-Bench for meaning￾ful statistics. All results are computed on the multiple￾choice subsets of each benchmark. 13. More Qualitative Results of VIDEOP2R 13.1. Success Case We provide two additional success cases of VIDEOP2R in [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
Figure 18
Figure 18. Figure 18: Failure Cases of Overly detailed visual configura [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Success Cases of VIDEOP2R <observation>Based on the question, I need to notice the location where the cat stays the longest in the video. Observing the video, the cat is seen initially on a stool, then it jumps down and walks around the room. It briefly interacts with the robot and the orange robot, but most of its time is spent on the stool. The cat does not stay on the carpet, in the cat's nest, or on t… view at source ↗
Figure 20
Figure 20. Figure 20: From base Qwen2.5-VL-7B to VIDEOP2R-SFT and VIDEOP2R: a representative example illustrating the stepwise improvement in model’s perception and reasoning. 8 [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Examples of Perception Examination: Top Left: Qwen with the text question only; Top Right: Qwen with the text [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗
read the original abstract

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. Our project page is available at https://videop2r.github.io/videop2r/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VideoP2R, a process-aware reinforcement fine-tuning (RFT) framework for large video language models (LVLMs). It consists of an SFT stage that generates the VideoP2R-CoT-162K process-aware chain-of-thought dataset via a three-step pipeline, and an RL stage that applies a novel process-aware group relative policy optimization (PA-GRPO) algorithm supplying separate rewards for perception and reasoning. The central empirical claim is that this design achieves state-of-the-art performance on six out of seven video reasoning and understanding benchmarks, with ablations confirming the effectiveness of process-aware modeling and that perception outputs are information-sufficient for downstream reasoning.

Significance. If the reported gains hold under rigorous verification, the work would be a meaningful contribution to multimodal video reasoning by explicitly separating perception and reasoning processes in RFT, rather than treating them as a single signal. The release of a 162K-scale process-aware CoT dataset and the PA-GRPO variant could serve as reusable resources for the community. The ablation results on information sufficiency add a useful diagnostic dimension, though the overall significance remains tied to the reproducibility and orthogonality of the claimed improvements.

major comments (2)
  1. [RL stage / PA-GRPO] RL stage / PA-GRPO description: The central claim that separate perception and reasoning rewards in PA-GRPO drive distinct, non-correlated improvements is load-bearing for the process-aware contribution. The manuscript states that PA-GRPO 'supplies separate rewards' but provides no explicit formulation, pseudocode, or computation details (e.g., whether the perception reward uses independent frame-level ground-truth annotations, model-generated consistency checks, or quantities derived from the same video embeddings/CoT traces used for reasoning). If overlap exists, the separation collapses and SotA gains may be attributable to dataset quality or standard GRPO rather than the proposed distinction. This directly tests the orthogonality assumption.
  2. [Experimental results] Experimental results section: The claim of SotA on six out of seven benchmarks is presented without reported error bars, statistical significance tests, or explicit confirmation of benchmark splits and evaluation protocols. Given that the improvements rest on both the new dataset and the modified RL algorithm, the absence of these details makes it difficult to rule out that gains are artifacts of data generation choices or hyperparameter tuning rather than the process-aware design.
minor comments (1)
  1. [Abstract / Method overview] The abstract and method sections use 'process-aware' repeatedly without a concise definition or diagram early in the paper; a single schematic showing the perception/reasoning split would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments help clarify the presentation of the process-aware contributions and strengthen the empirical claims. We address each major comment below and have revised the manuscript accordingly to improve clarity, add missing details, and enhance experimental rigor.

read point-by-point responses
  1. Referee: [RL stage / PA-GRPO] RL stage / PA-GRPO description: The central claim that separate perception and reasoning rewards in PA-GRPO drive distinct, non-correlated improvements is load-bearing for the process-aware contribution. The manuscript states that PA-GRPO 'supplies separate rewards' but provides no explicit formulation, pseudocode, or computation details (e.g., whether the perception reward uses independent frame-level ground-truth annotations, model-generated consistency checks, or quantities derived from the same video embeddings/CoT traces used for reasoning). If overlap exists, the separation collapses and SotA gains may be attributable to dataset quality or standard GRPO rather than the proposed distinction. This directly tests the orthogonality assumption.

    Authors: We agree that the explicit formulation and computation details are essential to substantiate the orthogonality of the process-aware rewards. In the revised manuscript, Section 3.2 now includes the full mathematical definition of PA-GRPO: the joint objective is optimized with separate scalar rewards R_perception (computed from independent frame-level ground-truth annotations generated during the three-step VideoP2R-CoT-162K pipeline) and R_reasoning (derived from final-answer correctness plus step-wise CoT consistency checks that operate on the reasoning trace rather than raw embeddings). We have added Algorithm 1 (pseudocode) in the appendix that shows the group-relative normalization is performed independently per reward type before the combined advantage is used for the policy update. Ablation Table 4 already demonstrates that ablating either reward produces non-overlapping performance drops, supporting that the signals are not redundant. These additions directly address the concern that gains could be explained by dataset quality alone. revision: yes

  2. Referee: [Experimental results] Experimental results section: The claim of SotA on six out of seven benchmarks is presented without reported error bars, statistical significance tests, or explicit confirmation of benchmark splits and evaluation protocols. Given that the improvements rest on both the new dataset and the modified RL algorithm, the absence of these details makes it difficult to rule out that gains are artifacts of data generation choices or hyperparameter tuning rather than the process-aware design.

    Authors: We acknowledge that the original submission lacked sufficient statistical detail. The revised Experimental Results section (Section 4) now reports mean and standard deviation over three independent random seeds for all main-table entries. We have added paired t-tests (p < 0.05) confirming that the reported gains over the strongest baselines are statistically significant on the six benchmarks where VideoP2R is SOTA. Benchmark splits and evaluation protocols are now explicitly stated to match the official test sets and metrics released by each benchmark (e.g., Video-MME, MVBench, etc.). A new paragraph in the appendix discusses hyperparameter sensitivity and confirms that the process-aware gains remain stable across reasonable ranges of the GRPO hyperparameters. These changes allow readers to assess whether the improvements are attributable to the proposed design rather than tuning artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper describes an empirical two-stage RFT pipeline: a three-step data generation process yielding VideoP2R-CoT-162K followed by PA-GRPO that supplies separate perception and reasoning rewards. Central claims rest on benchmark accuracy gains and ablation studies rather than any closed mathematical derivation. No equations are presented that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation or prior ansatz by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard assumptions from LLM reinforcement fine-tuning literature plus the new empirical claim that separate perception/reasoning rewards improve outcomes; no explicit free parameters, axioms, or invented entities are introduced beyond the named algorithm and dataset.

pith-pipeline@v0.9.0 · 5540 in / 1093 out tokens · 24320 ms · 2026-05-17T22:21:35.173507+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering

    cs.CV 2026-04 unverdicted novelty 5.0

    OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 1 Pith paper · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

  2. [2]

    The curious case of nonverbal abstract reasoning with multi-modal large language models.arXiv preprint arXiv:2401.12117, 2024

    Kian Ahrabian, Zhivar Sourati, Kexuan Sun, Jiarui Zhang, Yifan Jiang, Fred Morstatter, and Jay Pujara. The curious case of nonverbal abstract reasoning with multi-modal large language models.arXiv preprint arXiv:2401.12117, 2024. 1, 2

  3. [3]

    Claude 3, 2024

    Anthropic. Claude 3, 2024. 4

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5

  5. [5]

    Reasoning language models: A blueprint

    Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Rea- soning language models: A blueprint.arXiv preprint arXiv:2501.11223, 2025. 2

  6. [6]

    Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025

    Mustafa Chasmai, Gauri Jagatap, Gouthaman KV , Grant Van Horn, Subhransu Maji, and Andrea Fanelli. Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025. 6

  7. [7]

    VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

    Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A ver- satile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 1, 2, 3, 5

  8. [8]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5

  9. [9]

    Chatbot arena: An open platform for evaluating llms by human preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first In- ternational Conference on Machine Learning, 2024. 4

  10. [10]

    Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

    Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017. 8

  11. [11]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 3

  12. [12]

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

    Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 4

  13. [13]

    Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition

    Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong Li Lee, and Wynne Hsu. Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition. InProceedings of the 41st Interna- tional Conference on Machine Learning, pages 13109– 13125, 2024. 2

  14. [14]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 1, 2, 3, 4, 5

  15. [15]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yun- hang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 2, 5, 3

  16. [16]

    A Survey on LLM-as-a-Judge

    Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024. 2, 4

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 4, 7, 8

  18. [18]

    Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

    Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26181–26191,

  19. [19]

    Self-adaptive sampling for accurate video question answering on image text models

    Wei Han, Hui Chen, Min-Yen Kan, and Soujanya Po- ria. Self-adaptive sampling for accurate video question answering on image text models. InFindings of the As- sociation for Computational Linguistics: NAACL 2024, pages 2522–2534, 2024. 6

  20. [20]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,

  21. [21]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 1

  22. [22]

    Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

    Yifan Jiang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara, et al. Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024. 1, 2

  23. [23]

    Gonza- lez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention. InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles, 2023. 3

  24. [24]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 5

  25. [25]

    Mvbench: A comprehensive multi-modal video under- standing benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 22195–22206, 2024. 2, 5, 3

  26. [26]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 1, 2, 3, 4, 5

  27. [27]

    Graph prompts: Adapting video graph for video question answering

    Yiming Li, Xiaoshan Yang, Bing-Kun Bao, and Chang- sheng Xu. Graph prompts: Adapting video graph for video question answering. 2025. 2

  28. [28]

    Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

    Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation rea- soning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025. 3

  29. [29]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision- language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025. 3

  30. [30]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 4

  31. [31]

    Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 6

  32. [32]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Lin- guistics ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Lin- guistics ACL 2024, pages 8731–8772, 2024. 1, 2, 5, 3

  33. [33]

    Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

    Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

  34. [34]

    Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H

    Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967,

  35. [35]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018. 2

  36. [36]

    Videocap-r1: Enhancing mllms for video captioning via structured thinking.arXiv preprint arXiv:2506.01725, 2025

    Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, et al. Videocap-r1: Enhancing mllms for video captioning via structured thinking.arXiv preprint arXiv:2506.01725, 2025. 2 10

  37. [37]

    Morevqa: Exploring modular rea- soning models for video question answering

    Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular rea- soning models for video question answering. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13235–13245, 2024. 2

  38. [38]

    Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022. 3

  39. [39]

    Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025

    Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2

  40. [40]

    Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025

    Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025. 2, 5, 3

  41. [41]

    Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training

    Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3284– 3294, 2025. 2

  42. [42]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017. 4, 7

  43. [43]

    Squinting at vqa models: Introspecting vqa models with sub-questions

    Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10003–10011, 2020. 1, 2, 4

  44. [44]

    Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025

    Si Shen, Peijun Shen, Wenhua Zhao, and Danhao Zhu. Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025. 8

  45. [45]

    En- hancing video-llm reasoning via agent-of-thoughts distil- lation

    Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distil- lation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8523–8533, 2025. 2

  46. [46]

    Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460–9471, 2022

    Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460–9471, 2022. 8

  47. [47]

    Modularized self- reflected video reasoner for multimodal llm with appli- cation to video question answering

    Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, and Wenwu Zhu. Modularized self- reflected video reasoner for multimodal llm with appli- cation to video question answering. InForty-second In- ternational Conference on Machine Learning. 2

  48. [48]

    Core knowledge.Developmental science, 10(1):89–96, 2007

    Elizabeth S Spelke and Katherine D Kinzler. Core knowledge.Developmental science, 10(1):89–96, 2007. 1, 2

  49. [49]

    Video understanding with large lan- guage models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

    Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large lan- guage models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 1

  50. [50]

    Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms

    Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1, 2

  51. [51]

    Srpo: Enhancing multimodal llm reasoning via reflection-aware rein- forcement learning.arXiv preprint arXiv:2506.01713,

    Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learn- ing.arXiv preprint arXiv:2506.01713, 2025. 7

  52. [52]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capabil- ity in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025. 1, 2, 3, 4, 5

  53. [53]

    Videoagent: Long-form video understand- ing with large language model as agent

    Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understand- ing with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer,

  54. [54]

    Truly proximal policy optimization

    Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InUncertainty in artificial intelli- gence, pages 113–122. PMLR, 2020. 7

  55. [55]

    Stair: spatial-temporal reasoning with auditable intermediate results for video question answering

    Yueqian Wang, Yuxuan Wang, Kai Chen, and Dongyan Zhao. Stair: spatial-temporal reasoning with auditable intermediate results for video question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 19215–19223, 2024. 2

  56. [56]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 3, 5

  57. [57]

    Language mod- els with image descriptors are strong few-shot video- language learners.Advances in Neural Information Pro- cessing Systems, 35:8483–8497, 2022

    Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. Language mod- els with image descriptors are strong few-shot video- language learners.Advances in Neural Information Pro- cessing Systems, 35:8483–8497, 2022. 2

  58. [58]

    Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025

    Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning. arXiv preprint arXiv:2506.06097, 2025. 2

  59. [59]

    Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

  60. [60]

    Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

    Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 3

  61. [61]

    Video graph transformer for video question answer- ing

    Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answer- ing. InEuropean Conference on Computer Vision, pages 39–58. Springer, 2022. 2 11

  62. [62]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 10632– 10643, 2025. 2, 5, 3, 7

  63. [63]

    Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

    En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zin- ing Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xi- angyu Zhang, Jingyu Wang, et al. Unhackable tempo- ral rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025. 5

  64. [64]

    arXiv preprint arXiv:2504.07954 , year =

    En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering percep- tion policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025. 3

  65. [65]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 7

  66. [66]

    Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing

    Yuanyuan Yuan, Shuai Wang, Mingyue Jiang, and Tsong Yueh Chen. Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16908– 16917, 2021. 1, 2, 4

  67. [67]

    Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment

    Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, and Mingyi Hong. Reinforcing multi- turn reasoning in llm agents via turn-level credit assign- ment.arXiv preprint arXiv:2505.11821, 2025. 5

  68. [68]

    A simple llm framework for long-range video question-answering

    Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question- answering.arXiv preprint arXiv:2312.17235, 2023. 2

  69. [69]

    Long Context Transfer from Language to Vision

    Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Hao- ran Tan, Chunyuan Li, and Ziwei Liu. Long con- text transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 5

  70. [70]

    Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848,

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025. 7

  71. [71]

    Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video rea- soning.arXiv preprint arXiv:2504.09641, 2025. 3

  72. [72]

    Mmvu: Measuring expert- level multi-discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert- level multi-discipline video understanding. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 2, 5, 3 12 VIDEOP2R: Video Understanding from Perception to Reas...

  73. [73]

    Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation

    Details of Process-Aware CoT Generation and Data Analysis 7.1. Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation. We employ Qwen2.5-VL-72B- Instruct with a temperature of 0 for the generation. Prompt Template for Process-aware CoT Generation {Question} You are required to answer the question using the visual content pr...

  74. [74]

    Carefully read the question and the correct answer

  75. [75]

    Briefly explain whether (and how) the observations support the correct answer

  76. [76]

    video”, “person

    Finally output your judgement, either <judgement>Yes</judgement> or <judgement>No</judgement>. Figure 8. Prompt Template for Observation Sufficiency Veri- fication. We use the same prompt for perception correctness judgment in RL stage. when applicable (e.g., for multiple-choice questions). In the subsequent CoT Verification stage, task-specific accuracy ...

  77. [77]

    Annotation Example of the Image QA Sample

    The calculation seems correct.\n</think> <answer>55</answer> Figure 12. Annotation Example of the Image QA Sample

  78. [78]

    Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs

    Experiment Set up 8.1. Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs. For efficiency, we limit the video input to 16 frames at a resolution of 128 × 28 × 28 dur- ing training, where 28×28 denotes the patch size and 2 Table 3. Distribution of question types across VIDEOP2R-CoT-162K. Question Type SumMultiple Choice...

  79. [79]

    We conduct the same two-stage training process, but only change the Claude3.7 to Llama3.1 [11] families for providing per- ception correctness judgment

    Ablation Study on Judge Model Table 4 presents the results of using different judge mod- els for perception correctness judgement. We conduct the same two-stage training process, but only change the Claude3.7 to Llama3.1 [11] families for providing per- ception correctness judgment. Compared with the base model, all VIDEOP2R vari- 3 ants using different j...

  80. [80]

    Prompt for Qwen Inference

    Details of the Perception Examination 10.1. Prompt Used and Detailed Set up The perception examination experiment involves three types of experiments on either text or video domains. We compare the zero-shot performance of Qwen2.5-VL- 7B across different input settings and examine how per- ception segments influence its answers: (i) performance on text-on...

Showing first 80 references.