VIDEOP2R: Video Understanding from Perception to Reasoning

Jayakrishnan Unnikrishnan; Rui Zhao; Toufiq Parag; Yifan Jiang; Yueying Wang; Zhenyu Liao; Zhimin Chen

arxiv: 2511.11113 · v2 · submitted 2025-11-14 · 💻 cs.CV · cs.AI· cs.LG

VIDEOP2R: Video Understanding from Perception to Reasoning

Yifan Jiang , Yueying Wang , Rui Zhao , Toufiq Parag , Zhimin Chen , Zhenyu Liao , Jayakrishnan Unnikrishnan This is my paper

Pith reviewed 2026-05-17 22:21 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords video reasoningreinforcement fine-tuningchain of thoughtlarge video language modelsperceptionreasoningpolicy optimizationvideo benchmarks

0 comments

The pith

VideoP2R shows that separating perception and reasoning processes in training large video models leads to superior performance on reasoning benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VideoP2R, a framework that applies reinforcement fine-tuning to large video language models by explicitly modeling perception and reasoning as separate steps. It generates a special chain-of-thought dataset that captures both processes and uses a modified optimization method that rewards each process independently. If this approach works as claimed, video models could better handle complex tasks like understanding events in videos and drawing conclusions from them. This matters because current video AI often mixes up seeing details with logical thinking, leading to errors in real-world applications such as surveillance or content analysis. The results indicate that perception outputs alone can support accurate reasoning without additional steps.

Core claim

The central claim is that by treating perception and reasoning as distinct processes in a two-stage reinforcement fine-tuning setup, video language models can achieve better understanding and reasoning. Specifically, a three-step pipeline creates a 162K dataset of process-aware chain-of-thought examples, and a process-aware group relative policy optimization algorithm assigns separate rewards to perception and reasoning phases. This yields state-of-the-art results on six of seven video benchmarks and confirms that perception outputs contain enough information for downstream reasoning.

What carries the argument

The key machinery is the process-aware group relative policy optimization (PA-GRPO) that provides separate rewards for perception and reasoning, supported by a generated process-aware chain-of-thought dataset from a three-step pipeline.

Load-bearing premise

That supplying separate rewards for perception and reasoning in PA-GRPO, combined with the generated process-aware CoT data, produces genuine improvements rather than artifacts of reward design or data generation choices.

What would settle it

A direct comparison where the same model is trained with standard group relative policy optimization without separate perception and reasoning rewards, and it matches or exceeds the reported performance, would challenge the necessity of the process-aware approach.

Figures

Figures reproduced from arXiv: 2511.11113 by Jayakrishnan Unnikrishnan, Rui Zhao, Toufiq Parag, Yifan Jiang, Yueying Wang, Zhenyu Liao, Zhimin Chen.

**Figure 2.** Figure 2: Illustration of overall VIDEOP2R RFT framework (left) and the three-step CoT generation pipeline (right). 56, 71]. Time-R1 uses timestamp-aware and template rewards [56]; Video-R1 and STAR-R1 reward sensitivity to correct temporal order [14, 28]; Videochat-R1 and VersaVid-R1 adopt task-specific rewards [7, 26]; VideoRFT adds a stage-aware semantic reward [52]. However, most prior efforts model video rea… view at source ↗

**Figure 3.** Figure 3: The illustration of the PA-GRPO algorithm. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of perception on downstream reasoning [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Success (Left) and Failure (Right) case of V [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Training Dynamics and Think-Answer Mismatch [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 8.** Figure 8: Prompt Template for Observation Sufficiency Veri [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 7.** Figure 7: Prompt Template for Process-aware CoT Generation. [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 10.** Figure 10: Word length (Left) and Word cloud (Right) Visual [PITH_FULL_IMAGE:figures/full_fig_p014_10.png] view at source ↗

**Figure 11.** Figure 11: An Annotation Example of the Video QA Sample [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗

**Figure 9.** Figure 9: Embeddings visualization of VIDEOP2R-CoT-162K 7.5. Annotation Examples We provide annotation examples in Figs. 11 and 12 to illustrate how our annotations explicitly separate perception from reasoning [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 14.** Figure 14: Prompt Template for Answer Extraction. 10.3. Examples of Qwen Inference Output We present examples of Qwen’s outputs under different configurations in our perception examination experiment in [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 13.** Figure 13: Prompt Template for Perception Examination Ex [PITH_FULL_IMAGE:figures/full_fig_p016_13.png] view at source ↗

**Figure 15.** Figure 15: RL training Dynamics of VIDEOP2R model consistently adheres to the process-aware inference template and maintains stable format compliance. Since the length reward is conditioned on both accuracy and format rewards, we instead visualize the lengths of the perception and reasoning segments during RL. We observe an initial increase followed by a decrease in both segments, indicating that the model adaptive… view at source ↗

**Figure 16.** Figure 16: Example of Think-Answer Mismatch. consistent with its original design, while VersaVid-R1 has too few available traces on VSI-Bench for meaningful statistics. All results are computed on the multiplechoice subsets of each benchmark. 13. More Qualitative Results of VIDEOP2R 13.1. Success Case We provide two additional success cases of VIDEOP2R in [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

**Figure 18.** Figure 18: Failure Cases of Overly detailed visual configura [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗

**Figure 19.** Figure 19: Success Cases of VIDEOP2R <observation>Based on the question, I need to notice the location where the cat stays the longest in the video. Observing the video, the cat is seen initially on a stool, then it jumps down and walks around the room. It briefly interacts with the robot and the orange robot, but most of its time is spent on the stool. The cat does not stay on the carpet, in the cat's nest, or on t… view at source ↗

**Figure 20.** Figure 20: From base Qwen2.5-VL-7B to VIDEOP2R-SFT and VIDEOP2R: a representative example illustrating the stepwise improvement in model’s perception and reasoning. 8 [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗

**Figure 21.** Figure 21: Examples of Perception Examination: Top Left: Qwen with the text question only; Top Right: Qwen with the text [PITH_FULL_IMAGE:figures/full_fig_p021_21.png] view at source ↗

read the original abstract

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. Our project page is available at https://videop2r.github.io/videop2r/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VideoP2R adds a three-step CoT dataset pipeline and PA-GRPO with separate perception/reasoning rewards, delivering reported SOTA on six of seven benchmarks, but the reward split may not be fully independent.

read the letter

The main thing here is a practical extension of reinforcement fine-tuning to video-language models that treats perception and reasoning as separate stages. The authors build a 162K process-aware CoT dataset through a three-step generation pipeline and then train with PA-GRPO, which supplies distinct rewards for each stage instead of a single combined signal. This produces the claimed gains on most video reasoning benchmarks and includes ablations that support keeping the stages distinct. They also report that the perception outputs carry enough information for the reasoning step to succeed without extra help. Those pieces are the clearest additions over standard RFT or plain GRPO work on LVLMs. The empirical results look like a reasonable incremental step for anyone trying to make video models reason more reliably on real footage. The ablations give some evidence that the process-aware choices matter rather than just scaling data or compute. A reader working on multimodal RL or video understanding would find the dataset construction details and the reward separation idea worth looking at for their own setups. The softer spot is exactly the one the stress test flags. If the perception reward ends up depending on the same video embeddings or CoT traces used for reasoning, the two signals are correlated and the separation does not add much beyond better data quality. The abstract does not spell out the exact reward formulas or whether perception uses independent frame-level annotations, so it is hard to tell how orthogonal the rewards really are. That leaves open the possibility that the SOTA numbers trace more to the dataset than to PA-GRPO itself. This paper is aimed at people building or fine-tuning video-language models who want concrete recipes for process-aware training. It is coherent enough and grounded in existing RFT literature that it deserves a serious referee to check the implementation details, run controls on the reward independence, and verify the benchmark numbers with full tables and splits.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces VideoP2R, a process-aware reinforcement fine-tuning (RFT) framework for large video language models (LVLMs). It consists of an SFT stage that generates the VideoP2R-CoT-162K process-aware chain-of-thought dataset via a three-step pipeline, and an RL stage that applies a novel process-aware group relative policy optimization (PA-GRPO) algorithm supplying separate rewards for perception and reasoning. The central empirical claim is that this design achieves state-of-the-art performance on six out of seven video reasoning and understanding benchmarks, with ablations confirming the effectiveness of process-aware modeling and that perception outputs are information-sufficient for downstream reasoning.

Significance. If the reported gains hold under rigorous verification, the work would be a meaningful contribution to multimodal video reasoning by explicitly separating perception and reasoning processes in RFT, rather than treating them as a single signal. The release of a 162K-scale process-aware CoT dataset and the PA-GRPO variant could serve as reusable resources for the community. The ablation results on information sufficiency add a useful diagnostic dimension, though the overall significance remains tied to the reproducibility and orthogonality of the claimed improvements.

major comments (2)

[RL stage / PA-GRPO] RL stage / PA-GRPO description: The central claim that separate perception and reasoning rewards in PA-GRPO drive distinct, non-correlated improvements is load-bearing for the process-aware contribution. The manuscript states that PA-GRPO 'supplies separate rewards' but provides no explicit formulation, pseudocode, or computation details (e.g., whether the perception reward uses independent frame-level ground-truth annotations, model-generated consistency checks, or quantities derived from the same video embeddings/CoT traces used for reasoning). If overlap exists, the separation collapses and SotA gains may be attributable to dataset quality or standard GRPO rather than the proposed distinction. This directly tests the orthogonality assumption.
[Experimental results] Experimental results section: The claim of SotA on six out of seven benchmarks is presented without reported error bars, statistical significance tests, or explicit confirmation of benchmark splits and evaluation protocols. Given that the improvements rest on both the new dataset and the modified RL algorithm, the absence of these details makes it difficult to rule out that gains are artifacts of data generation choices or hyperparameter tuning rather than the process-aware design.

minor comments (1)

[Abstract / Method overview] The abstract and method sections use 'process-aware' repeatedly without a concise definition or diagram early in the paper; a single schematic showing the perception/reasoning split would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The comments help clarify the presentation of the process-aware contributions and strengthen the empirical claims. We address each major comment below and have revised the manuscript accordingly to improve clarity, add missing details, and enhance experimental rigor.

read point-by-point responses

Referee: [RL stage / PA-GRPO] RL stage / PA-GRPO description: The central claim that separate perception and reasoning rewards in PA-GRPO drive distinct, non-correlated improvements is load-bearing for the process-aware contribution. The manuscript states that PA-GRPO 'supplies separate rewards' but provides no explicit formulation, pseudocode, or computation details (e.g., whether the perception reward uses independent frame-level ground-truth annotations, model-generated consistency checks, or quantities derived from the same video embeddings/CoT traces used for reasoning). If overlap exists, the separation collapses and SotA gains may be attributable to dataset quality or standard GRPO rather than the proposed distinction. This directly tests the orthogonality assumption.

Authors: We agree that the explicit formulation and computation details are essential to substantiate the orthogonality of the process-aware rewards. In the revised manuscript, Section 3.2 now includes the full mathematical definition of PA-GRPO: the joint objective is optimized with separate scalar rewards R_perception (computed from independent frame-level ground-truth annotations generated during the three-step VideoP2R-CoT-162K pipeline) and R_reasoning (derived from final-answer correctness plus step-wise CoT consistency checks that operate on the reasoning trace rather than raw embeddings). We have added Algorithm 1 (pseudocode) in the appendix that shows the group-relative normalization is performed independently per reward type before the combined advantage is used for the policy update. Ablation Table 4 already demonstrates that ablating either reward produces non-overlapping performance drops, supporting that the signals are not redundant. These additions directly address the concern that gains could be explained by dataset quality alone. revision: yes
Referee: [Experimental results] Experimental results section: The claim of SotA on six out of seven benchmarks is presented without reported error bars, statistical significance tests, or explicit confirmation of benchmark splits and evaluation protocols. Given that the improvements rest on both the new dataset and the modified RL algorithm, the absence of these details makes it difficult to rule out that gains are artifacts of data generation choices or hyperparameter tuning rather than the process-aware design.

Authors: We acknowledge that the original submission lacked sufficient statistical detail. The revised Experimental Results section (Section 4) now reports mean and standard deviation over three independent random seeds for all main-table entries. We have added paired t-tests (p < 0.05) confirming that the reported gains over the strongest baselines are statistically significant on the six benchmarks where VideoP2R is SOTA. Benchmark splits and evaluation protocols are now explicitly stated to match the official test sets and metrics released by each benchmark (e.g., Video-MME, MVBench, etc.). A new paragraph in the appendix discusses hyperparameter sensitivity and confirms that the process-aware gains remain stable across reasonable ranges of the GRPO hyperparameters. These changes allow readers to assess whether the improvements are attributable to the proposed design rather than tuning artifacts. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework

full rationale

The paper describes an empirical two-stage RFT pipeline: a three-step data generation process yielding VideoP2R-CoT-162K followed by PA-GRPO that supplies separate perception and reasoning rewards. Central claims rest on benchmark accuracy gains and ablation studies rather than any closed mathematical derivation. No equations are presented that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation or prior ansatz by the same authors. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The framework rests on standard assumptions from LLM reinforcement fine-tuning literature plus the new empirical claim that separate perception/reasoning rewards improve outcomes; no explicit free parameters, axioms, or invented entities are introduced beyond the named algorithm and dataset.

pith-pipeline@v0.9.0 · 5540 in / 1093 out tokens · 24320 ms · 2026-05-17T22:21:35.173507+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PA-GRPO supplies separate rewards for perception and reasoning... R_acc,P = 1(judged sufficient); R_acc,R = Acc_t(oi,R, y_true)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

models perception and reasoning as distinct processes

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
cs.CV 2026-04 unverdicted novelty 5.0

OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 1 Pith paper · 17 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The curious case of nonverbal abstract reasoning with multi-modal large language models.arXiv preprint arXiv:2401.12117, 2024

Kian Ahrabian, Zhivar Sourati, Kexuan Sun, Jiarui Zhang, Yifan Jiang, Fred Morstatter, and Jay Pujara. The curious case of nonverbal abstract reasoning with multi-modal large language models.arXiv preprint arXiv:2401.12117, 2024. 1, 2

work page arXiv 2024
[3]

Claude 3, 2024

Anthropic. Claude 3, 2024. 4

work page 2024
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Reasoning language models: A blueprint

Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Rea- soning language models: A blueprint.arXiv preprint arXiv:2501.11223, 2025. 2

work page arXiv 2025
[6]

Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025

Mustafa Chasmai, Gauri Jagatap, Gouthaman KV , Grant Van Horn, Subhransu Maji, and Andrea Fanelli. Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025. 6

work page arXiv 2025
[7]

VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A ver- satile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 1, 2, 3, 5

work page arXiv 2025
[8]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first In- ternational Conference on Machine Learning, 2024. 4

work page 2024
[10]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017. 8

work page 2017
[11]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 3

work page 2024
[12]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong Li Lee, and Wynne Hsu. Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition. InProceedings of the 41st Interna- tional Conference on Machine Learning, pages 13109– 13125, 2024. 2

work page 2024
[14]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 1, 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yun- hang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 2, 5, 3

work page 2025
[16]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 4, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26181–26191,

work page
[19]

Self-adaptive sampling for accurate video question answering on image text models

Wei Han, Hui Chen, Min-Yen Kan, and Soujanya Po- ria. Self-adaptive sampling for accurate video question answering on image text models. InFindings of the As- sociation for Computational Linguistics: NAACL 2024, pages 2522–2534, 2024. 6

work page 2024
[20]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[22]

Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

Yifan Jiang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara, et al. Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024. 1, 2

work page 2024
[23]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention. InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles, 2023. 3

work page 2023
[24]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 22195–22206, 2024. 2, 5, 3

work page 2024
[26]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 1, 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Graph prompts: Adapting video graph for video question answering

Yiming Li, Xiaoshan Yang, Bing-Kun Bao, and Chang- sheng Xu. Graph prompts: Adapting video graph for video question answering. 2025. 2

work page 2025
[28]

Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation rea- soning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025. 3

work page arXiv 2025
[29]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision- language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 4

work page 2023
[31]

Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 6

work page 2023
[32]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Lin- guistics ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Lin- guistics ACL 2024, pages 8731–8772, 2024. 1, 2, 5, 3

work page 2024
[33]

Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

work page arXiv
[34]

Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967,

work page arXiv
[35]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

Videocap-r1: Enhancing mllms for video captioning via structured thinking.arXiv preprint arXiv:2506.01725, 2025

Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, et al. Videocap-r1: Enhancing mllms for video captioning via structured thinking.arXiv preprint arXiv:2506.01725, 2025. 2 10

work page arXiv 2025
[37]

Morevqa: Exploring modular rea- soning models for video question answering

Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular rea- soning models for video question answering. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13235–13245, 2024. 2

work page 2024
[38]

Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022. 3

work page 2022
[39]

Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025

Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2

work page arXiv 2025
[40]

Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025

Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025. 2, 5, 3

work page arXiv 2025
[41]

Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training

Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3284– 3294, 2025. 2

work page 2025
[42]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017. 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2017
[43]

Squinting at vqa models: Introspecting vqa models with sub-questions

Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10003–10011, 2020. 1, 2, 4

work page 2020
[44]

Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025

Si Shen, Peijun Shen, Wenhua Zhao, and Danhao Zhu. Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025. 8

work page arXiv 2025
[45]

En- hancing video-llm reasoning via agent-of-thoughts distil- lation

Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distil- lation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8523–8533, 2025. 2

work page 2025
[46]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460–9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460–9471, 2022. 8

work page 2022
[47]

Modularized self- reflected video reasoner for multimodal llm with appli- cation to video question answering

Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, and Wenwu Zhu. Modularized self- reflected video reasoner for multimodal llm with appli- cation to video question answering. InForty-second In- ternational Conference on Machine Learning. 2

work page
[48]

Core knowledge.Developmental science, 10(1):89–96, 2007

Elizabeth S Spelke and Katherine D Kinzler. Core knowledge.Developmental science, 10(1):89–96, 2007. 1, 2

work page 2007
[49]

Video understanding with large lan- guage models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large lan- guage models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 1

work page 2025
[50]

Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1, 2

work page 2024
[51]

Srpo: Enhancing multimodal llm reasoning via reflection-aware rein- forcement learning.arXiv preprint arXiv:2506.01713,

Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learn- ing.arXiv preprint arXiv:2506.01713, 2025. 7

work page arXiv 2025
[52]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capabil- ity in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025. 1, 2, 3, 4, 5

work page arXiv 2025
[53]

Videoagent: Long-form video understand- ing with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understand- ing with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer,

work page
[54]

Truly proximal policy optimization

Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InUncertainty in artificial intelli- gence, pages 113–122. PMLR, 2020. 7

work page 2020
[55]

Stair: spatial-temporal reasoning with auditable intermediate results for video question answering

Yueqian Wang, Yuxuan Wang, Kai Chen, and Dongyan Zhao. Stair: spatial-temporal reasoning with auditable intermediate results for video question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 19215–19223, 2024. 2

work page 2024
[56]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Language mod- els with image descriptors are strong few-shot video- language learners.Advances in Neural Information Pro- cessing Systems, 35:8483–8497, 2022

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. Language mod- els with image descriptors are strong few-shot video- language learners.Advances in Neural Information Pro- cessing Systems, 35:8483–8497, 2022. 2

work page 2022
[58]

Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning. arXiv preprint arXiv:2506.06097, 2025. 2

work page arXiv 2025
[59]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

work page 2022
[60]

Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 3

work page arXiv 2025
[61]

Video graph transformer for video question answer- ing

Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answer- ing. InEuropean Conference on Computer Vision, pages 39–58. Springer, 2022. 2 11

work page 2022
[62]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 10632– 10643, 2025. 2, 5, 3, 7

work page 2025
[63]

Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zin- ing Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xi- angyu Zhang, Jingyu Wang, et al. Unhackable tempo- ral rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025. 5

work page arXiv 2025
[64]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering percep- tion policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025. 3

work page arXiv 2025
[65]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing

Yuanyuan Yuan, Shuai Wang, Mingyue Jiang, and Tsong Yueh Chen. Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16908– 16917, 2021. 1, 2, 4

work page 2021
[67]

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, and Mingyi Hong. Reinforcing multi- turn reasoning in llm agents via turn-level credit assign- ment.arXiv preprint arXiv:2505.11821, 2025. 5

work page arXiv 2025
[68]

A simple llm framework for long-range video question-answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question- answering.arXiv preprint arXiv:2312.17235, 2023. 2

work page arXiv 2023
[69]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Hao- ran Tan, Chunyuan Li, and Ziwei Liu. Long con- text transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[70]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848,

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025. 7

work page arXiv 2025
[71]

Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video rea- soning.arXiv preprint arXiv:2504.09641, 2025. 3

work page arXiv 2025
[72]

Mmvu: Measuring expert- level multi-discipline video understanding

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert- level multi-discipline video understanding. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 2, 5, 3 12 VIDEOP2R: Video Understanding from Perception to Reas...

work page 2025
[73]

Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation

Details of Process-Aware CoT Generation and Data Analysis 7.1. Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation. We employ Qwen2.5-VL-72B- Instruct with a temperature of 0 for the generation. Prompt Template for Process-aware CoT Generation {Question} You are required to answer the question using the visual content pr...

work page
[74]

Carefully read the question and the correct answer

work page
[75]

Briefly explain whether (and how) the observations support the correct answer

work page
[76]

video”, “person

Finally output your judgement, either <judgement>Yes</judgement> or <judgement>No</judgement>. Figure 8. Prompt Template for Observation Sufficiency Veri- fication. We use the same prompt for perception correctness judgment in RL stage. when applicable (e.g., for multiple-choice questions). In the subsequent CoT Verification stage, task-specific accuracy ...

work page
[77]

Annotation Example of the Image QA Sample

The calculation seems correct.\n</think> <answer>55</answer> Figure 12. Annotation Example of the Image QA Sample

work page
[78]

Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs

Experiment Set up 8.1. Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs. For efficiency, we limit the video input to 16 frames at a resolution of 128 × 28 × 28 dur- ing training, where 28×28 denotes the patch size and 2 Table 3. Distribution of question types across VIDEOP2R-CoT-162K. Question Type SumMultiple Choice...

work page
[79]

We conduct the same two-stage training process, but only change the Claude3.7 to Llama3.1 [11] families for providing per- ception correctness judgment

Ablation Study on Judge Model Table 4 presents the results of using different judge mod- els for perception correctness judgement. We conduct the same two-stage training process, but only change the Claude3.7 to Llama3.1 [11] families for providing per- ception correctness judgment. Compared with the base model, all VIDEOP2R vari- 3 ants using different j...

work page
[80]

Prompt for Qwen Inference

Details of the Perception Examination 10.1. Prompt Used and Detailed Set up The perception examination experiment involves three types of experiments on either text or video domains. We compare the zero-shot performance of Qwen2.5-VL- 7B across different input settings and examine how per- ception segments influence its answers: (i) performance on text-on...

work page

Showing first 80 references.

[1] [1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[2] [2]

The curious case of nonverbal abstract reasoning with multi-modal large language models.arXiv preprint arXiv:2401.12117, 2024

Kian Ahrabian, Zhivar Sourati, Kexuan Sun, Jiarui Zhang, Yifan Jiang, Fred Morstatter, and Jay Pujara. The curious case of nonverbal abstract reasoning with multi-modal large language models.arXiv preprint arXiv:2401.12117, 2024. 1, 2

work page arXiv 2024

[3] [3]

Claude 3, 2024

Anthropic. Claude 3, 2024. 4

work page 2024

[4] [4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Reasoning language models: A blueprint

Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Rea- soning language models: A blueprint.arXiv preprint arXiv:2501.11223, 2025. 2

work page arXiv 2025

[6] [6]

Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025

Mustafa Chasmai, Gauri Jagatap, Gouthaman KV , Grant Van Horn, Subhransu Maji, and Andrea Fanelli. Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025. 6

work page arXiv 2025

[7] [7]

VersaVid-R1: A Versatile Video Understanding and Reasoning Model from Question Answering to Captioning Tasks,

Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A ver- satile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 1, 2, 3, 5

work page arXiv 2025

[8] [8]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Chatbot arena: An open platform for evaluating llms by human preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first In- ternational Conference on Machine Learning, 2024. 4

work page 2024

[10] [10]

Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017

Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017. 8

work page 2017

[11] [11]

The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 3

work page 2024

[12] [12]

Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators

Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition

Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong Li Lee, and Wynne Hsu. Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition. InProceedings of the 41st Interna- tional Conference on Machine Learning, pages 13109– 13125, 2024. 2

work page 2024

[14] [14]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 1, 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[15] [15]

Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yun- hang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 2, 5, 3

work page 2025

[16] [16]

A Survey on LLM-as-a-Judge

Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2024

[17] [17]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 4, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[18] [18]

Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection

Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26181–26191,

work page

[19] [19]

Self-adaptive sampling for accurate video question answering on image text models

Wei Han, Hui Chen, Min-Yen Kan, and Soujanya Po- ria. Self-adaptive sampling for accurate video question answering on image text models. InFindings of the As- sociation for Computational Linguistics: NAACL 2024, pages 2522–2534, 2024. 6

work page 2024

[20] [20]

Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,

work page internal anchor Pith review Pith/arXiv arXiv

[21] [21]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024

[22] [22]

Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024

Yifan Jiang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara, et al. Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024. 1, 2

work page 2024

[23] [23]

Gonza- lez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention. InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles, 2023. 3

work page 2023

[24] [24]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Mvbench: A comprehensive multi-modal video under- standing benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 22195–22206, 2024. 2, 5, 3

work page 2024

[26] [26]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 1, 2, 3, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Graph prompts: Adapting video graph for video question answering

Yiming Li, Xiaoshan Yang, Bing-Kun Bao, and Chang- sheng Xu. Graph prompts: Adapting video graph for video question answering. 2025. 2

work page 2025

[28] [28]

Star-r1: Spatial transformation reasoning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025

Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation rea- soning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025. 3

work page arXiv 2025

[29] [29]

Self-Rewarding Vision-Language Model via Reasoning Decomposition

Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision- language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 4

work page 2023

[31] [31]

Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 6

work page 2023

[32] [32]

Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Lin- guistics ACL 2024, pages 8731–8772, 2024

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Lin- guistics ACL 2024, pages 8731–8772, 2024. 1, 2, 5, 3

work page 2024

[33] [33]

Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,

work page arXiv

[34] [34]

Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H

Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967,

work page arXiv

[35] [35]

UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018. 2

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

Videocap-r1: Enhancing mllms for video captioning via structured thinking.arXiv preprint arXiv:2506.01725, 2025

Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, et al. Videocap-r1: Enhancing mllms for video captioning via structured thinking.arXiv preprint arXiv:2506.01725, 2025. 2 10

work page arXiv 2025

[37] [37]

Morevqa: Exploring modular rea- soning models for video question answering

Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular rea- soning models for video question answering. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13235–13245, 2024. 2

work page 2024

[38] [38]

Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022. 3

work page 2022

[39] [39]

Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025

Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2

work page arXiv 2025

[40] [40]

Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025

Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025. 2, 5, 3

work page arXiv 2025

[41] [41]

Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training

Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3284– 3294, 2025. 2

work page 2025

[42] [42]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017. 4, 7

work page internal anchor Pith review Pith/arXiv arXiv 2017

[43] [43]

Squinting at vqa models: Introspecting vqa models with sub-questions

Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10003–10011, 2020. 1, 2, 4

work page 2020

[44] [44]

Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025

Si Shen, Peijun Shen, Wenhua Zhao, and Danhao Zhu. Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025. 8

work page arXiv 2025

[45] [45]

En- hancing video-llm reasoning via agent-of-thoughts distil- lation

Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distil- lation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8523–8533, 2025. 2

work page 2025

[46] [46]

Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460–9471, 2022

Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460–9471, 2022. 8

work page 2022

[47] [47]

Modularized self- reflected video reasoner for multimodal llm with appli- cation to video question answering

Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, and Wenwu Zhu. Modularized self- reflected video reasoner for multimodal llm with appli- cation to video question answering. InForty-second In- ternational Conference on Machine Learning. 2

work page

[48] [48]

Core knowledge.Developmental science, 10(1):89–96, 2007

Elizabeth S Spelke and Katherine D Kinzler. Core knowledge.Developmental science, 10(1):89–96, 2007. 1, 2

work page 2007

[49] [49]

Video understanding with large lan- guage models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large lan- guage models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 1

work page 2025

[50] [50]

Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms

Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1, 2

work page 2024

[51] [51]

Srpo: Enhancing multimodal llm reasoning via reflection-aware rein- forcement learning.arXiv preprint arXiv:2506.01713,

Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learn- ing.arXiv preprint arXiv:2506.01713, 2025. 7

work page arXiv 2025

[52] [52]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capabil- ity in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025. 1, 2, 3, 4, 5

work page arXiv 2025

[53] [53]

Videoagent: Long-form video understand- ing with large language model as agent

Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understand- ing with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer,

work page

[54] [54]

Truly proximal policy optimization

Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InUncertainty in artificial intelli- gence, pages 113–122. PMLR, 2020. 7

work page 2020

[55] [55]

Stair: spatial-temporal reasoning with auditable intermediate results for video question answering

Yueqian Wang, Yuxuan Wang, Kai Chen, and Dongyan Zhao. Stair: spatial-temporal reasoning with auditable intermediate results for video question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 19215–19223, 2024. 2

work page 2024

[56] [56]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[57] [57]

Language mod- els with image descriptors are strong few-shot video- language learners.Advances in Neural Information Pro- cessing Systems, 35:8483–8497, 2022

Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. Language mod- els with image descriptors are strong few-shot video- language learners.Advances in Neural Information Pro- cessing Systems, 35:8483–8497, 2022. 2

work page 2022

[58] [58]

Videochat-a1: Thinking with long videos by chain-of-shot reasoning.arXiv preprint arXiv:2506.06097, 2025

Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning. arXiv preprint arXiv:2506.06097, 2025. 2

work page arXiv 2025

[59] [59]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1

work page 2022

[60] [60]

Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 3

work page arXiv 2025

[61] [61]

Video graph transformer for video question answer- ing

Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answer- ing. InEuropean Conference on Computer Vision, pages 39–58. Springer, 2022. 2 11

work page 2022

[62] [62]

Thinking in space: How multimodal large language models see, remember, and recall spaces

Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 10632– 10643, 2025. 2, 5, 3, 7

work page 2025

[63] [63]

Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zin- ing Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xi- angyu Zhang, Jingyu Wang, et al. Unhackable tempo- ral rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025. 5

work page arXiv 2025

[64] [64]

arXiv preprint arXiv:2504.07954 , year =

En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering percep- tion policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025. 3

work page arXiv 2025

[65] [65]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 7

work page internal anchor Pith review Pith/arXiv arXiv 2025

[66] [66]

Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing

Yuanyuan Yuan, Shuai Wang, Mingyue Jiang, and Tsong Yueh Chen. Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16908– 16917, 2021. 1, 2, 4

work page 2021

[67] [67]

Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment

Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, and Mingyi Hong. Reinforcing multi- turn reasoning in llm agents via turn-level credit assign- ment.arXiv preprint arXiv:2505.11821, 2025. 5

work page arXiv 2025

[68] [68]

A simple llm framework for long-range video question-answering

Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question- answering.arXiv preprint arXiv:2312.17235, 2023. 2

work page arXiv 2023

[69] [69]

Long Context Transfer from Language to Vision

Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Hao- ran Tan, Chunyuan Li, and Ziwei Liu. Long con- text transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[70] [70]

Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848,

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025. 7

work page arXiv 2025

[71] [71]

Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video rea- soning.arXiv preprint arXiv:2504.09641, 2025. 3

work page arXiv 2025

[72] [72]

Mmvu: Measuring expert- level multi-discipline video understanding

Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert- level multi-discipline video understanding. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 2, 5, 3 12 VIDEOP2R: Video Understanding from Perception to Reas...

work page 2025

[73] [73]

Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation

Details of Process-Aware CoT Generation and Data Analysis 7.1. Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation. We employ Qwen2.5-VL-72B- Instruct with a temperature of 0 for the generation. Prompt Template for Process-aware CoT Generation {Question} You are required to answer the question using the visual content pr...

work page

[74] [74]

Carefully read the question and the correct answer

work page

[75] [75]

Briefly explain whether (and how) the observations support the correct answer

work page

[76] [76]

video”, “person

Finally output your judgement, either <judgement>Yes</judgement> or <judgement>No</judgement>. Figure 8. Prompt Template for Observation Sufficiency Veri- fication. We use the same prompt for perception correctness judgment in RL stage. when applicable (e.g., for multiple-choice questions). In the subsequent CoT Verification stage, task-specific accuracy ...

work page

[77] [77]

Annotation Example of the Image QA Sample

The calculation seems correct.\n</think> <answer>55</answer> Figure 12. Annotation Example of the Image QA Sample

work page

[78] [78]

Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs

Experiment Set up 8.1. Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs. For efficiency, we limit the video input to 16 frames at a resolution of 128 × 28 × 28 dur- ing training, where 28×28 denotes the patch size and 2 Table 3. Distribution of question types across VIDEOP2R-CoT-162K. Question Type SumMultiple Choice...

work page

[79] [79]

We conduct the same two-stage training process, but only change the Claude3.7 to Llama3.1 [11] families for providing per- ception correctness judgment

Ablation Study on Judge Model Table 4 presents the results of using different judge mod- els for perception correctness judgement. We conduct the same two-stage training process, but only change the Claude3.7 to Llama3.1 [11] families for providing per- ception correctness judgment. Compared with the base model, all VIDEOP2R vari- 3 ants using different j...

work page

[80] [80]

Prompt for Qwen Inference

Details of the Perception Examination 10.1. Prompt Used and Detailed Set up The perception examination experiment involves three types of experiments on either text or video domains. We compare the zero-shot performance of Qwen2.5-VL- 7B across different input settings and examine how per- ception segments influence its answers: (i) performance on text-on...

work page