VIDEOP2R: Video Understanding from Perception to Reasoning
Pith reviewed 2026-05-17 22:21 UTC · model grok-4.3
The pith
VideoP2R shows that separating perception and reasoning processes in training large video models leads to superior performance on reasoning benchmarks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that by treating perception and reasoning as distinct processes in a two-stage reinforcement fine-tuning setup, video language models can achieve better understanding and reasoning. Specifically, a three-step pipeline creates a 162K dataset of process-aware chain-of-thought examples, and a process-aware group relative policy optimization algorithm assigns separate rewards to perception and reasoning phases. This yields state-of-the-art results on six of seven video benchmarks and confirms that perception outputs contain enough information for downstream reasoning.
What carries the argument
The key machinery is the process-aware group relative policy optimization (PA-GRPO) that provides separate rewards for perception and reasoning, supported by a generated process-aware chain-of-thought dataset from a three-step pipeline.
Load-bearing premise
That supplying separate rewards for perception and reasoning in PA-GRPO, combined with the generated process-aware CoT data, produces genuine improvements rather than artifacts of reward design or data generation choices.
What would settle it
A direct comparison where the same model is trained with standard group relative policy optimization without separate perception and reasoning rewards, and it matches or exceeds the reported performance, would challenge the necessity of the process-aware approach.
Figures
read the original abstract
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning. Our project page is available at https://videop2r.github.io/videop2r/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces VideoP2R, a process-aware reinforcement fine-tuning (RFT) framework for large video language models (LVLMs). It consists of an SFT stage that generates the VideoP2R-CoT-162K process-aware chain-of-thought dataset via a three-step pipeline, and an RL stage that applies a novel process-aware group relative policy optimization (PA-GRPO) algorithm supplying separate rewards for perception and reasoning. The central empirical claim is that this design achieves state-of-the-art performance on six out of seven video reasoning and understanding benchmarks, with ablations confirming the effectiveness of process-aware modeling and that perception outputs are information-sufficient for downstream reasoning.
Significance. If the reported gains hold under rigorous verification, the work would be a meaningful contribution to multimodal video reasoning by explicitly separating perception and reasoning processes in RFT, rather than treating them as a single signal. The release of a 162K-scale process-aware CoT dataset and the PA-GRPO variant could serve as reusable resources for the community. The ablation results on information sufficiency add a useful diagnostic dimension, though the overall significance remains tied to the reproducibility and orthogonality of the claimed improvements.
major comments (2)
- [RL stage / PA-GRPO] RL stage / PA-GRPO description: The central claim that separate perception and reasoning rewards in PA-GRPO drive distinct, non-correlated improvements is load-bearing for the process-aware contribution. The manuscript states that PA-GRPO 'supplies separate rewards' but provides no explicit formulation, pseudocode, or computation details (e.g., whether the perception reward uses independent frame-level ground-truth annotations, model-generated consistency checks, or quantities derived from the same video embeddings/CoT traces used for reasoning). If overlap exists, the separation collapses and SotA gains may be attributable to dataset quality or standard GRPO rather than the proposed distinction. This directly tests the orthogonality assumption.
- [Experimental results] Experimental results section: The claim of SotA on six out of seven benchmarks is presented without reported error bars, statistical significance tests, or explicit confirmation of benchmark splits and evaluation protocols. Given that the improvements rest on both the new dataset and the modified RL algorithm, the absence of these details makes it difficult to rule out that gains are artifacts of data generation choices or hyperparameter tuning rather than the process-aware design.
minor comments (1)
- [Abstract / Method overview] The abstract and method sections use 'process-aware' repeatedly without a concise definition or diagram early in the paper; a single schematic showing the perception/reasoning split would improve readability.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The comments help clarify the presentation of the process-aware contributions and strengthen the empirical claims. We address each major comment below and have revised the manuscript accordingly to improve clarity, add missing details, and enhance experimental rigor.
read point-by-point responses
-
Referee: [RL stage / PA-GRPO] RL stage / PA-GRPO description: The central claim that separate perception and reasoning rewards in PA-GRPO drive distinct, non-correlated improvements is load-bearing for the process-aware contribution. The manuscript states that PA-GRPO 'supplies separate rewards' but provides no explicit formulation, pseudocode, or computation details (e.g., whether the perception reward uses independent frame-level ground-truth annotations, model-generated consistency checks, or quantities derived from the same video embeddings/CoT traces used for reasoning). If overlap exists, the separation collapses and SotA gains may be attributable to dataset quality or standard GRPO rather than the proposed distinction. This directly tests the orthogonality assumption.
Authors: We agree that the explicit formulation and computation details are essential to substantiate the orthogonality of the process-aware rewards. In the revised manuscript, Section 3.2 now includes the full mathematical definition of PA-GRPO: the joint objective is optimized with separate scalar rewards R_perception (computed from independent frame-level ground-truth annotations generated during the three-step VideoP2R-CoT-162K pipeline) and R_reasoning (derived from final-answer correctness plus step-wise CoT consistency checks that operate on the reasoning trace rather than raw embeddings). We have added Algorithm 1 (pseudocode) in the appendix that shows the group-relative normalization is performed independently per reward type before the combined advantage is used for the policy update. Ablation Table 4 already demonstrates that ablating either reward produces non-overlapping performance drops, supporting that the signals are not redundant. These additions directly address the concern that gains could be explained by dataset quality alone. revision: yes
-
Referee: [Experimental results] Experimental results section: The claim of SotA on six out of seven benchmarks is presented without reported error bars, statistical significance tests, or explicit confirmation of benchmark splits and evaluation protocols. Given that the improvements rest on both the new dataset and the modified RL algorithm, the absence of these details makes it difficult to rule out that gains are artifacts of data generation choices or hyperparameter tuning rather than the process-aware design.
Authors: We acknowledge that the original submission lacked sufficient statistical detail. The revised Experimental Results section (Section 4) now reports mean and standard deviation over three independent random seeds for all main-table entries. We have added paired t-tests (p < 0.05) confirming that the reported gains over the strongest baselines are statistically significant on the six benchmarks where VideoP2R is SOTA. Benchmark splits and evaluation protocols are now explicitly stated to match the official test sets and metrics released by each benchmark (e.g., Video-MME, MVBench, etc.). A new paragraph in the appendix discusses hyperparameter sensitivity and confirms that the process-aware gains remain stable across reasonable ranges of the GRPO hyperparameters. These changes allow readers to assess whether the improvements are attributable to the proposed design rather than tuning artifacts. revision: yes
Circularity Check
No significant circularity in empirical framework
full rationale
The paper describes an empirical two-stage RFT pipeline: a three-step data generation process yielding VideoP2R-CoT-162K followed by PA-GRPO that supplies separate perception and reasoning rewards. Central claims rest on benchmark accuracy gains and ablation studies rather than any closed mathematical derivation. No equations are presented that define a quantity in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing premise reduces to a self-citation or prior ansatz by the same authors. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
PA-GRPO supplies separate rewards for perception and reasoning... R_acc,P = 1(judged sufficient); R_acc,R = Acc_t(oi,R, y_true)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
models perception and reasoning as distinct processes
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering
OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoni...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Kian Ahrabian, Zhivar Sourati, Kexuan Sun, Jiarui Zhang, Yifan Jiang, Fred Morstatter, and Jay Pujara. The curious case of nonverbal abstract reasoning with multi-modal large language models.arXiv preprint arXiv:2401.12117, 2024. 1, 2
- [3]
-
[4]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
Reasoning language models: A blueprint
Maciej Besta, Julia Barth, Eric Schreiber, Ales Kubicek, Afonso Catarino, Robert Gerstenberger, Piotr Nyczyk, Patrick Iff, Yueling Li, Sam Houliston, et al. Rea- soning language models: A blueprint.arXiv preprint arXiv:2501.11223, 2025. 2
-
[6]
Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025
Mustafa Chasmai, Gauri Jagatap, Gouthaman KV , Grant Van Horn, Subhransu Maji, and Andrea Fanelli. Moment sampling in video llms for long-form video qa.arXiv preprint arXiv:2507.00033, 2025. 6
-
[7]
Xinlong Chen, Yuanxing Zhang, Yushuo Guan, Bohan Zeng, Yang Shi, Sihan Yang, Pengfei Wan, Qiang Liu, Liang Wang, and Tieniu Tan. Versavid-r1: A ver- satile video understanding and reasoning model from question answering to captioning tasks.arXiv preprint arXiv:2506.09079, 2025. 1, 2, 3, 5
-
[8]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advanc- ing spatial-temporal modeling and audio understanding in video-llms.arXiv preprint arXiv:2406.07476, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[9]
Chatbot arena: An open platform for evaluating llms by human preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anas- tasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Banghua Zhu, Hao Zhang, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. InForty-first In- ternational Conference on Machine Learning, 2024. 4
work page 2024
-
[10]
Paul F Christiano, Jan Leike, Tom Brown, Miljan Mar- tic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences.Advances in neural information processing systems, 30, 2017. 8
work page 2017
-
[11]
The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024. 3
work page 2024
-
[12]
Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators
Yann Dubois, Bal ´azs Galambosi, Percy Liang, and Tat- sunori B Hashimoto. Length-controlled alpacaeval: A simple way to debias automatic evaluators.arXiv preprint arXiv:2404.04475, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong Li Lee, and Wynne Hsu. Video- 9 of-thought: step-by-step video reasoning from percep- tion to cognition. InProceedings of the 41st Interna- tional Conference on Machine Learning, pages 13109– 13125, 2024. 2
work page 2024
-
[14]
Video-R1: Reinforcing Video Reasoning in MLLMs
Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 1, 2, 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis
Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yun- hang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi- modal llms in video analysis. InProceedings of the Com- puter Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 1, 2, 5, 3
work page 2025
-
[16]
Jiawei Gu, Xuhui Jiang, Zhichao Shi, Hexiang Tan, Xue- hao Zhai, Chengjin Xu, Wei Li, Yinghan Shen, Shengjie Ma, Honghao Liu, et al. A survey on llm-as-a-judge. arXiv preprint arXiv:2411.15594, 2024. 2, 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 1, 4, 7, 8
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Songhao Han, Wei Huang, Hairong Shi, Le Zhuo, Xiu Su, Shifeng Zhang, Xu Zhou, Xiaojuan Qi, Yue Liao, and Si Liu. Videoespresso: A large-scale chain-of-thought dataset for fine-grained video reasoning via core frame selection. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26181–26191,
-
[19]
Self-adaptive sampling for accurate video question answering on image text models
Wei Han, Hui Chen, Min-Yen Kan, and Soujanya Po- ria. Self-adaptive sampling for accurate video question answering on image text models. InFindings of the As- sociation for Computational Linguistics: NAACL 2024, pages 2522–2534, 2024. 6
work page 2024
-
[20]
Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos
Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Alek- sander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 1
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
Yifan Jiang, Kexuan Sun, Zhivar Sourati, Kian Ahrabian, Kaixin Ma, Filip Ilievski, Jay Pujara, et al. Marvel: Mul- tidimensional abstraction and reasoning through visual evaluation and learning.Advances in Neural Information Processing Systems, 37:46567–46592, 2024. 1, 2
work page 2024
-
[23]
Gonza- lez, Hao Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonza- lez, Hao Zhang, and Ion Stoica. Efficient memory man- agement for large language model serving with pagedat- tention. InProceedings of the ACM SIGOPS 29th Sym- posium on Operating Systems Principles, 2023. 3
work page 2023
-
[24]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Mvbench: A comprehensive multi-modal video under- standing benchmark
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video under- standing benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, pages 22195–22206, 2024. 2, 5, 3
work page 2024
-
[26]
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 1, 2, 3, 4, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Graph prompts: Adapting video graph for video question answering
Yiming Li, Xiaoshan Yang, Bing-Kun Bao, and Chang- sheng Xu. Graph prompts: Adapting video graph for video question answering. 2025. 2
work page 2025
-
[28]
Zongzhao Li, Zongyang Ma, Mingze Li, Songyou Li, Yu Rong, Tingyang Xu, Ziqi Zhang, Deli Zhao, and Wenbing Huang. Star-r1: Spatial transformation rea- soning by reinforcing multimodal llms.arXiv preprint arXiv:2505.15804, 2025. 3
-
[29]
Self-Rewarding Vision-Language Model via Reasoning Decomposition
Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision- language model via reasoning decomposition.arXiv preprint arXiv:2508.19652, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[30]
Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harri- son Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe Twelfth International Conference on Learning Representations, 2023. 4
work page 2023
-
[31]
Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural infor- mation processing systems, 36:34892–34916, 2023. 6
work page 2023
-
[32]
Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Lin- guistics ACL 2024, pages 8731–8772, 2024. 1, 2, 5, 3
work page 2024
-
[33]
Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, and Mike Zheng Shou. Videomind: A chain-of-lora agent for long video reasoning.arXiv preprint arXiv:2503.13444,
-
[34]
Q., Zhang, X., Jie, Z., Sun, P., Jin, X., and Li, H
Trung Quoc Luong, Xinbo Zhang, Zhanming Jie, Peng Sun, Xiaoran Jin, and Hang Li. Reft: Reasoning with reinforced fine-tuning.arXiv preprint arXiv:2401.08967,
-
[35]
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction
Leland McInnes, John Healy, and James Melville. Umap: Uniform manifold approximation and pro- jection for dimension reduction.arXiv preprint arXiv:1802.03426, 2018. 2
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Desen Meng, Rui Huang, Zhilin Dai, Xinhao Li, Yifan Xu, Jun Zhang, Zhenpeng Huang, Meng Zhang, Lingshu Zhang, Yi Liu, et al. Videocap-r1: Enhancing mllms for video captioning via structured thinking.arXiv preprint arXiv:2506.01725, 2025. 2 10
-
[37]
Morevqa: Exploring modular rea- soning models for video question answering
Juhong Min, Shyamal Buch, Arsha Nagrani, Minsu Cho, and Cordelia Schmid. Morevqa: Exploring modular rea- soning models for video question answering. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13235–13245, 2024. 2
work page 2024
-
[38]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with hu- man feedback.Advances in neural information process- ing systems, 35:27730–27744, 2022. 3
work page 2022
-
[39]
Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyun- woo J Kim. Deepvideo-r1: Video reinforcement fine- tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 2
-
[40]
Yukun Qi, Yiming Zhao, Yu Zeng, Xikun Bao, Wenxuan Huang, Lin Chen, Zehui Chen, Jie Zhao, Zhongang Qi, and Feng Zhao. Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning.arXiv preprint arXiv:2504.07956, 2025. 2, 5, 3
-
[41]
Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training
Haiyi Qiu, Minghe Gao, Long Qian, Kaihang Pan, Qifan Yu, Juncheng Li, Wenjie Wang, Siliang Tang, Yueting Zhuang, and Tat-Seng Chua. Step: Enhancing video- llms’ compositional reasoning by spatio-temporal graph- guided self-training. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 3284– 3294, 2025. 2
work page 2025
-
[42]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimiza- tion algorithms.arXiv preprint arXiv:1707.06347, 2017. 4, 7
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[43]
Squinting at vqa models: Introspecting vqa models with sub-questions
Ramprasaath R Selvaraju, Purva Tendulkar, Devi Parikh, Eric Horvitz, Marco Tulio Ribeiro, Besmira Nushi, and Ece Kamar. Squinting at vqa models: Introspecting vqa models with sub-questions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10003–10011, 2020. 1, 2, 4
work page 2020
-
[44]
Si Shen, Peijun Shen, Wenhua Zhao, and Danhao Zhu. Mitigating think-answer mismatch in llm reason- ing through noise-aware advantage reweighting.arXiv preprint arXiv:2508.05928, 2025. 8
-
[45]
En- hancing video-llm reasoning via agent-of-thoughts distil- lation
Yudi Shi, Shangzhe Di, Qirui Chen, and Weidi Xie. En- hancing video-llm reasoning via agent-of-thoughts distil- lation. InProceedings of the Computer Vision and Pat- tern Recognition Conference, pages 8523–8533, 2025. 2
work page 2025
-
[46]
Joar Skalse, Nikolaus Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward gaming.Advances in Neural Information Processing Systems, 35:9460–9471, 2022. 8
work page 2022
-
[47]
Zihan Song, Xin Wang, Zi Qian, Hong Chen, Longtao Huang, Hui Xue, and Wenwu Zhu. Modularized self- reflected video reasoner for multimodal llm with appli- cation to video question answering. InForty-second In- ternational Conference on Machine Learning. 2
-
[48]
Core knowledge.Developmental science, 10(1):89–96, 2007
Elizabeth S Spelke and Katherine D Kinzler. Core knowledge.Developmental science, 10(1):89–96, 2007. 1, 2
work page 2007
-
[49]
Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, et al. Video understanding with large lan- guage models: A survey.IEEE Transactions on Circuits and Systems for Video Technology, 2025. 1
work page 2025
-
[50]
Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms
Shengbang Tong, Zhuang Liu, Yuexiang Zhai, Yi Ma, Yann LeCun, and Saining Xie. Eyes wide shut? ex- ploring the visual shortcomings of multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9568–9578, 2024. 1, 2
work page 2024
-
[51]
Zhongwei Wan, Zhihao Dou, Che Liu, Yu Zhang, Dongfei Cui, Qinjian Zhao, Hui Shen, Jing Xiong, Yi Xin, Yifan Jiang, et al. Srpo: Enhancing multimodal llm reasoning via reflection-aware reinforcement learn- ing.arXiv preprint arXiv:2506.01713, 2025. 7
-
[52]
Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capabil- ity in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025. 1, 2, 3, 4, 5
-
[53]
Videoagent: Long-form video understand- ing with large language model as agent
Xiaohan Wang, Yuhui Zhang, Orr Zohar, and Serena Yeung-Levy. Videoagent: Long-form video understand- ing with large language model as agent. InEuropean Conference on Computer Vision, pages 58–76. Springer,
-
[54]
Truly proximal policy optimization
Yuhui Wang, Hao He, and Xiaoyang Tan. Truly proximal policy optimization. InUncertainty in artificial intelli- gence, pages 113–122. PMLR, 2020. 7
work page 2020
-
[55]
Stair: spatial-temporal reasoning with auditable intermediate results for video question answering
Yueqian Wang, Yuxuan Wang, Kai Chen, and Dongyan Zhao. Stair: spatial-temporal reasoning with auditable intermediate results for video question answering. In Proceedings of the AAAI Conference on Artificial Intelli- gence, pages 19215–19223, 2024. 2
work page 2024
-
[56]
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 1, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. Language mod- els with image descriptors are strong few-shot video- language learners.Advances in Neural Information Pro- cessing Systems, 35:8483–8497, 2022. 2
work page 2022
-
[58]
Zikang Wang, Boyu Chen, Zhengrong Yue, Yi Wang, Yu Qiao, Limin Wang, and Yali Wang. Videochat-a1: Thinking with long videos by chain-of-shot reasoning. arXiv preprint arXiv:2506.06097, 2025. 2
-
[59]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022. 1
work page 2022
-
[60]
Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in visual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 3
-
[61]
Video graph transformer for video question answer- ing
Junbin Xiao, Pan Zhou, Tat-Seng Chua, and Shuicheng Yan. Video graph transformer for video question answer- ing. InEuropean Conference on Computer Vision, pages 39–58. Springer, 2022. 2 11
work page 2022
-
[62]
Thinking in space: How multimodal large language models see, remember, and recall spaces
Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vi- sion and Pattern Recognition Conference, pages 10632– 10643, 2025. 2, 5, 3, 7
work page 2025
-
[63]
Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025
En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zin- ing Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xi- angyu Zhang, Jingyu Wang, et al. Unhackable tempo- ral rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025. 5
-
[64]
arXiv preprint arXiv:2504.07954 , year =
En Yu, Kangheng Lin, Liang Zhao, Jisheng Yin, Yana Wei, Yuang Peng, Haoran Wei, Jianjian Sun, Chunrui Han, Zheng Ge, et al. Perception-r1: Pioneering percep- tion policy with reinforcement learning.arXiv preprint arXiv:2504.07954, 2025. 3
-
[65]
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gao- hong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025. 7
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing
Yuanyuan Yuan, Shuai Wang, Mingyue Jiang, and Tsong Yueh Chen. Perception matters: Detecting per- ception failures of vqa models using metamorphic test- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16908– 16917, 2021. 1, 2, 4
work page 2021
-
[67]
Reinforcing multi-turn reasoning in llm agents via turn-level credit assignment
Siliang Zeng, Quan Wei, William Brown, Oana Frunza, Yuriy Nevmyvaka, and Mingyi Hong. Reinforcing multi- turn reasoning in llm agents via turn-level credit assign- ment.arXiv preprint arXiv:2505.11821, 2025. 5
-
[68]
A simple llm framework for long-range video question-answering
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question- answering.arXiv preprint arXiv:2312.17235, 2023. 2
-
[69]
Long Context Transfer from Language to Vision
Peiyuan Zhang, Kaichen Zhang, Bo Li, Guangtao Zeng, Jingkang Yang, Yuanhan Zhang, Ziyue Wang, Hao- ran Tan, Chunyuan Li, and Ziwei Liu. Long con- text transfer from language to vision.arXiv preprint arXiv:2406.16852, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[70]
Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity.arXiv preprint arXiv:2507.21848, 2025. 7
-
[71]
Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025
Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video rea- soning.arXiv preprint arXiv:2504.09641, 2025. 3
-
[72]
Mmvu: Measuring expert- level multi-discipline video understanding
Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert- level multi-discipline video understanding. InProceed- ings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 2, 5, 3 12 VIDEOP2R: Video Understanding from Perception to Reas...
work page 2025
-
[73]
Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation
Details of Process-Aware CoT Generation and Data Analysis 7.1. Prompt Used Figure 7 illustrates the prompt template for process- aware CoT generation. We employ Qwen2.5-VL-72B- Instruct with a temperature of 0 for the generation. Prompt Template for Process-aware CoT Generation {Question} You are required to answer the question using the visual content pr...
-
[74]
Carefully read the question and the correct answer
-
[75]
Briefly explain whether (and how) the observations support the correct answer
-
[76]
Finally output your judgement, either <judgement>Yes</judgement> or <judgement>No</judgement>. Figure 8. Prompt Template for Observation Sufficiency Veri- fication. We use the same prompt for perception correctness judgment in RL stage. when applicable (e.g., for multiple-choice questions). In the subsequent CoT Verification stage, task-specific accuracy ...
-
[77]
Annotation Example of the Image QA Sample
The calculation seems correct.\n</think> <answer>55</answer> Figure 12. Annotation Example of the Image QA Sample
-
[78]
Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs
Experiment Set up 8.1. Implementation Details The whole two-stage training is conducted on 8× NVIDIA A800 GPUs. For efficiency, we limit the video input to 16 frames at a resolution of 128 × 28 × 28 dur- ing training, where 28×28 denotes the patch size and 2 Table 3. Distribution of question types across VIDEOP2R-CoT-162K. Question Type SumMultiple Choice...
-
[79]
Ablation Study on Judge Model Table 4 presents the results of using different judge mod- els for perception correctness judgement. We conduct the same two-stage training process, but only change the Claude3.7 to Llama3.1 [11] families for providing per- ception correctness judgment. Compared with the base model, all VIDEOP2R vari- 3 ants using different j...
-
[80]
Details of the Perception Examination 10.1. Prompt Used and Detailed Set up The perception examination experiment involves three types of experiments on either text or video domains. We compare the zero-shot performance of Qwen2.5-VL- 7B across different input settings and examine how per- ception segments influence its answers: (i) performance on text-on...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.