pith. machine review for the scientific record. sign in

arxiv: 2605.06094 · v3 · submitted 2026-05-07 · 💻 cs.CV · cs.AI

Recognition: no theorem link

VISD: Enhancing Video Reasoning via Structured Self-Distillation

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:31 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video reasoningself-distillationVideoLLMsreinforcement learningtoken-level supervisionspatio-temporal groundingprivileged informationjudge model
0
0 comments X

The pith

Structured self-distillation with a video-aware judge improves VideoLLM reasoning accuracy and training efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that VideoLLMs can learn complex video reasoning more effectively when sparse sequence-level rewards are supplemented by dense, structured token-level signals. It introduces a judge model that evaluates reasoning along separate dimensions of answer correctness, logical consistency, and spatio-temporal grounding, then feeds this information to a teacher policy. A direction-magnitude decoupling step keeps the signals compatible with reinforcement learning by letting rollout advantages control update direction while the judge signals control magnitude. If the approach works, training becomes both more accurate on grounding tasks and roughly twice as fast in optimization steps. Readers care because current VideoLLM training struggles with long sequences where credit assignment is difficult.

Core claim

VISD employs a video-aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token-level supervision. To stably integrate dense supervision with RL, it introduces a direction-magnitude decoupling mechanism where rollout-level advantages computed from rewards determine update direction while structured privileged signals modulate token-level update magnitudes. Curriculum scheduling and EMA-based teacher stabilization further support robust optimization over long video sequences.

What carries the argument

The direction-magnitude decoupling mechanism, which lets rollout advantages set the direction of policy updates while multi-dimensional privileged signals from the video-aware judge set the magnitude of token-level updates.

If this is right

  • VISD consistently outperforms strong baselines on diverse benchmarks in answer accuracy.
  • VISD improves the quality of spatio-temporal grounding in generated reasoning trajectories.
  • VISD reaches these performance gains with nearly 2x faster convergence measured in optimization steps.
  • The framework produces more semantically aligned and fine-grained credit assignment than either RL or unstructured self-distillation alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same judge-based decomposition of reasoning quality could be tested on non-video sequential tasks such as long-document reasoning to check whether the efficiency gains generalize.
  • Curriculum scheduling paired with the decoupling mechanism may allow training on progressively longer video clips without proportional increases in compute.
  • The separation of direction and magnitude offers a template for other hybrid RL-plus-distillation pipelines where dense signals risk destabilizing verifiable reward training.

Load-bearing premise

The video-aware judge model produces diagnostically meaningful, unbiased privileged information that can be safely used for token-level supervision without introducing new failure modes or reward hacking.

What would settle it

An ablation experiment in which replacing the judge's structured multi-dimensional feedback with unstructured or random signals eliminates the reported gains in accuracy and convergence speed would show that the diagnostic decomposition is not responsible for the improvement.

Figures

Figures reproduced from arXiv: 2605.06094 by Hao Lin, Hongbo Jin, Jiayu Ding, Jingqi Tian, Kunyang Lv, Qiaoman Zhang, Xu Jiang, Zhongjing Du.

Figure 1
Figure 1. Figure 1: Comparison between RLVR method and VISD. (a) The RLVR method, e.g., GRPO, provides sparse sequence-level signals and fails to capture fine-grained spatial and temporal evidence. (b) VISD uses dense token-level updates, thereby improving convergence speed and overall performance. To effectively integrate structured supervision with reinforcement learning, we extend the direction– magnitude decoupling paradi… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of VISD. VISD first uses a video-aware judge to generate structured privileged feedback for each student rollout, then replays the same completion with a feedback-conditioned teacher. The rollout-level reward advantage determines the update direction, while the teacher-student discrepancy modulates token-level update magnitudes for fine-grained credit assignment. R(x, y), which reflects task-level… view at source ↗
Figure 3
Figure 3. Figure 3: Training ablation curves. (a) Feedback conditioning compares VISD with and without judge feedback, showing that structured feedback leads to a stronger final reward trajectory. (b) Teacher update strategy compares current-policy, Sync-10, and EMA teachers, where EMA provides the most stable optimization and best final reward. (c) Top-K support versus sampled token compares two teacher-student credit formul… view at source ↗
Figure 4
Figure 4. Figure 4: Top-K support versus sampled-token training curves. We compare (a) total reward, (b) group mean reward, (c) answer accuracy, (d) temporal IoU, (e) spatial IoU, (f) temporal point, (g) temporal segment, (h) spatial grounding, and (i) format reward. Feedback conditioning view at source ↗
Figure 5
Figure 5. Figure 5: Component-level reward curves for feedback ablation. Each panel compares VISD with and without judge feedback for one reward component: (a) total reward, (b) group mean reward, (c) answer accuracy, (d) temporal IoU, (e) spatial IoU, (f) temporal point, (g) temporal segment, (h) spatial grounding, and (i) format. Consistent with view at source ↗
Figure 6
Figure 6. Figure 6: Teacher update stability diagnostics. We show (a) gradient norm and (b) response entropy for different teacher parameterizations. D.4 Training Details for Ablation Settings The ablation curves are plotted by training step. For the feedback comparison, the no-feedback setting uses the same reinforcement-learning and teacher-replay pipeline but removes the judge-generated 22 view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory-dependent judge feedback. For the same video question, the judge provides different evaluations and diagnoses for different student rollouts. The feedback is used as privileged information for teacher replay rather than as a replacement for the reinforcement reward. 26 view at source ↗
Figure 8
Figure 8. Figure 8: Answer-only versus feedback-conditioned token credit. (a) and (b) visualize the same fixed student rollout; only the teacher context changes. Warmer and cooler token backgrounds indicate positive or negative teacher-student token evidence used to modulate policy-gradient magnitude. In the answer-only replay, the teacher is conditioned on verified answer-side information, so the token-credit pattern mainly … view at source ↗
Figure 9
Figure 9. Figure 9: Visualization. For spatial relation reasoning, VISD accurately localizes the queried child and identifies the object positioned above him, providing precise visual evidence while avoiding confusion with nearby objects. In contrast, related video reasoning models either give incorrect answers or rely on incomplete spatial grounding. 28 view at source ↗
Figure 10
Figure 10. Figure 10: Visualization. For temporal action reasoning, VISD grounds the panda and bucket across relevant frames and correctly infers that the panda is putting itself in the bucket. Competing models miss this action transition and produce incorrect answers. Question: who leans on the baby walker outdoors? Ground Truth Answer: a baby leans on a baby walker outdoors. Open-o3 Video: <think>The video shows a baby sitti… view at source ↗
Figure 11
Figure 11. Figure 11: Visualization. For outdoor spatial interaction reasoning, VISD accurately localizes both the baby and the baby walker, identifies the baby as the one leaning on the walker, and provides precise visual evidence grounded in the relevant temporal window. Other models instead confuse the person with the baby and produce an incorrect answer. 29 view at source ↗
Figure 12
Figure 12. Figure 12: Visualization. For temporal localization reasoning, VISD accurately grounds the person’s gaze direction across frames and identifies the correct time window when the person looks out of the window. Other models focus on earlier or incomplete head movements and produce incorrect intervals, whereas VISD captures the sustained window-looking action and matches the ground-truth period. Question: What changed … view at source ↗
Figure 13
Figure 13. Figure 13: Visualization. In object disappearance reasoning, VISD explicitly grounds the apple before and after the camera movement, correctly identifying that the apple has been removed from the table. Although related models partially recognize the same visual change in their reasoning process, they still produce an inconsistent final answer, highlighting VISD’s stronger alignment between visual grounding, reasoni… view at source ↗
Figure 14
Figure 14. Figure 14: Visualization. VISD not only answers correctly but also localizes the action within the correct temporal window, identifying the moment when the orange kitten kicks the white kitten. Although related video reasoning models produce the correct final answer, they rely on inaccurate temporal evidence. Question: During which time period does the person get up to turn off the light? Ground Truth Answer: 19.70s… view at source ↗
Figure 15
Figure 15. Figure 15: Visualization. For temporal action localization reasoning, VISD accurately grounds the key transition from the lit room to the dark room and identifies the correct time interval when the person gets up to turn off the light. In contrast, related video reasoning models focus on earlier incomplete movements, such as sitting up or approaching the switch, resulting in incorrect temporal predictions. 31 view at source ↗
Figure 16
Figure 16. Figure 16: Visualization. VISD correctly answers the question and localizes the lorry within the ground-truth temporal window. Although other models also produce the correct answer, they ground their reasoning in an incorrect time interval, showing less precise temporal localization. Question: Which activity do the two ghost cowboys enjoy while horse riding as depicted in the video? Ground Truth Answer: C Open-o3 Vi… view at source ↗
Figure 17
Figure 17. Figure 17: Visualization. In fine-grained activity reasoning, VISD accurately grounds the two ghost cowboys during horse riding and identifies their smiling, friendly interaction, correctly inferring that they enjoy singing rather than dancing, fighting, or quarrelling. In contrast, competing models are distracted by later or irrelevant visual cues and produce incorrect answers. 32 view at source ↗
read the original abstract

Training VideoLLMs for complex reasoning remains challenging due to sparse sequence level rewards and the lack of fine grained credit assignment over long, temporally grounded reasoning trajectories. While reinforcement learning with verifiable rewards (RLVR) provides reliable supervision, it fails to capture token level contributions, leading to inefficient learning. Conversely, existing self distillation methods offer dense supervision but lack structure and diagnostic specificity, and often interact unstably with reinforcement learning. In this work, we propose VISD, a structured self distillation framework that introduces diagnostically meaningful privileged information for video reasoning. VISD employs a video aware judge model to decompose reasoning quality into multiple dimensions, including answer correctness, logical consistency, and spatio-temporal grounding, and uses this structured feedback to guide a teacher policy for token level supervision. To stably integrate dense supervision with RL, we introduce a direction magnitude decoupling mechanism, where rollout level advantages computed from rewards determine update direction, while structured privileged signals modulate token level update magnitudes. This design enables semantically aligned and fine grained credit assignment, improving both reasoning faithfulness and training efficiency. Additionally, VISD incorporates curriculum scheduling and EMA based teacher stabilization to support robust optimization over long video sequences. Experiments on diverse benchmarks show that VISD consistently outperforms strong baselines, improving answer accuracy and spatio temporal grounding quality. Notably, VISD reaches these gains with nearly 2x faster convergence in optimization steps, highlighting the effectiveness of structured self supervision in improving both performance and sample efficiency for VideoLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes VISD, a structured self-distillation framework for VideoLLMs that uses a video-aware judge model to decompose reasoning quality into answer correctness, logical consistency, and spatio-temporal grounding dimensions. This structured feedback provides token-level supervision to a teacher policy, which is integrated with RL via a direction-magnitude decoupling mechanism (rollout advantages set update direction while privileged signals modulate magnitudes), plus curriculum scheduling and EMA-based teacher stabilization. Experiments on diverse benchmarks are claimed to show consistent outperformance over baselines in answer accuracy and grounding quality, with nearly 2x faster convergence.

Significance. If the experimental claims hold and the judge signals prove reliable, VISD could meaningfully improve sample efficiency and reasoning quality in VideoLLM training by supplying diagnostically structured dense supervision that complements sparse RL rewards without destabilizing optimization. This addresses a practical bottleneck in long-horizon video reasoning and could influence hybrid RL/self-distillation designs more broadly.

major comments (2)
  1. [Abstract] Abstract: The central claims of consistent outperformance, improved grounding quality, and nearly 2x faster convergence are stated without any quantitative tables, ablation results, error bars, baseline details, or description of judge-model training/calibration. This is load-bearing because the abstract supplies the only experimental evidence; without numbers or controls it is impossible to verify whether gains are attributable to the structured self-distillation or to incidental factors.
  2. [Method] Method (direction-magnitude decoupling): The design assumes the video-aware judge supplies diagnostically meaningful, unbiased privileged information for token-level magnitude modulation. No verification against human judgments, inter-annotator agreement, or ablation that removes the judge while retaining other components is described; this directly undermines attribution of the reported accuracy and convergence gains to the proposed mechanism rather than reward hacking or other artifacts.
minor comments (1)
  1. [Abstract] Abstract: 'spatio temporal' should be hyphenated as 'spatio-temporal' for consistency with earlier usage.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the thoughtful and constructive feedback on our manuscript. We address each major comment point by point below, clarifying our approach and outlining revisions to strengthen the presentation of results and methodological details.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of consistent outperformance, improved grounding quality, and nearly 2x faster convergence are stated without any quantitative tables, ablation results, error bars, baseline details, or description of judge-model training/calibration. This is load-bearing because the abstract supplies the only experimental evidence; without numbers or controls it is impossible to verify whether gains are attributable to the structured self-distillation or to incidental factors.

    Authors: We agree that the abstract would be strengthened by including key quantitative highlights to support the central claims. In the revised manuscript, we will update the abstract to report specific metrics drawn from the Experiments section, including accuracy improvements on the primary benchmarks, the observed convergence speedup factor, and brief references to the main baselines and judge-model training procedure. Full tables, ablations with error bars, and detailed controls will remain in the body of the paper, consistent with standard abstract length constraints. This change directly addresses the concern about verifiability while preserving the abstract's role as a high-level summary. revision: yes

  2. Referee: [Method] Method (direction-magnitude decoupling): The design assumes the video-aware judge supplies diagnostically meaningful, unbiased privileged information for token-level magnitude modulation. No verification against human judgments, inter-annotator agreement, or ablation that removes the judge while retaining other components is described; this directly undermines attribution of the reported accuracy and convergence gains to the proposed mechanism rather than reward hacking or other artifacts.

    Authors: We acknowledge the importance of establishing the reliability of the judge signals for attributing gains to the direction-magnitude decoupling. The manuscript already contains ablation studies that isolate the contribution of the structured privileged feedback by comparing full VISD against variants that remove or ablate the judge-derived magnitude modulation while retaining rollout advantages and other components; these results show that the performance and convergence improvements are tied to the proposed mechanism. We will expand the Method and Experiments sections to provide a clearer description of these ablations, the judge model's training and calibration process, and how the dimensional signals align with verifiable rewards. However, the current work does not include direct human validation or inter-annotator agreement studies for the judge outputs. revision: partial

standing simulated objections not resolved
  • Direct verification of the video-aware judge model's dimensional assessments against human judgments and inter-annotator agreement metrics

Circularity Check

0 steps flagged

No circularity: VISD is an engineering framework with no derivation chain that reduces to its own fitted inputs or self-citations.

full rationale

The paper presents VISD as a composite method combining a video-aware judge for multi-dimensional feedback, a direction-magnitude decoupling mechanism for RL integration, curriculum scheduling, and EMA stabilization. No equations, fitted parameters, or predictions are described that would reduce by construction to the same data or prior self-citations. The abstract and available text contain no load-bearing self-citations, uniqueness theorems, or ansatzes smuggled via prior work; the central claims rest on empirical improvements rather than any self-referential derivation. This is the expected non-finding for a methods paper whose contributions are algorithmic combinations rather than mathematical reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Because only the abstract is available, the ledger is populated from the high-level description. The central claim rests on the existence of a reliable multi-dimensional judge and on the stability of the decoupling mechanism; both are introduced without independent verification in the given text.

pith-pipeline@v0.9.0 · 5580 in / 1317 out tokens · 33739 ms · 2026-05-12T03:31:46.483302+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 17 internal anchors

  1. [1]

    On-policy distillation of language models: Learning from self-generated mistakes

    Rishabh Agarwal, Nino Vieillard, Yongchao Zhou, Piotr Stanczyk, Sabela Ramos Garea, Matthieu Geist, and Olivier Bachem. On-policy distillation of language models: Learning from self-generated mistakes. InThe twelfth international conference on learning representations, 2024

  2. [2]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  3. [3]

    Datasets and recipes for video temporal grounding via reinforcement learning

    Ruizhe Chen, Tianze Luo, Zhiting Fan, Heqing Zou, Zhaopeng Feng, Guiyang Xie, Hansheng Zhang, Zhuochen Wang, Zuozhu Liu, and Zhang Huaijian. Datasets and recipes for video temporal grounding via reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing: Industry Track, pages 983–992, 2025

  4. [4]

    Longvila: Scaling long-context visual language models for long videos

    Yukang Chen, Fuzhao Xue, Dacheng Li, Qinghao Hu, Ligeng Zhu, Xiuyu Li, Yunhao Fang, Haotian Tang, Shang Yang, Zhijian Liu, et al. Longvila: Scaling long-context visual language models for long videos. InThe Thirteenth International Conference on Learning Representa- tions, 2024

  5. [5]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Hanrong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

  6. [6]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shen- glong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling.arXiv preprint arXiv:2412.05271, 2024

  7. [7]

    V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

    Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, and Shaogang Gong. V-star: Bench- marking video-llms on video spatio-temporal reasoning.arXiv preprint arXiv:2503.11495, 2025

  8. [8]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  9. [9]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025

  10. [10]

    Video-MME-v2: Towards the Next Stage in Benchmarks for Comprehensive Video Understanding

    Chaoyou Fu, Haozhi Yuan, Yuhao Dong, Yi-Fan Zhang, Yunhang Shen, Xiaoxing Hu, Xueying Li, Jinsen Su, Chengwu Long, Xiaoyao Xie, et al. Video-mme-v2: Towards the next stage in benchmarks for comprehensive video understanding.arXiv preprint arXiv:2604.05015, 2026

  11. [11]

    Born again neural networks

    Tommaso Furlanello, Zachary Lipton, Michael Tschannen, Laurent Itti, and Anima Anandkumar. Born again neural networks. InInternational conference on machine learning, pages 1607–1616. PMLR, 2018

  12. [12]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. InProceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017

  13. [13]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 10

  14. [14]

    Trace: Temporal grounding video llm via causal event modeling,

    Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643, 2024

  15. [15]

    Distilling the knowledge in a neural network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. 2015

  16. [16]

    arXiv preprint arXiv:2502.04326 (2025)

    Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluat- ing real-world omnimodal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

  17. [17]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos. arXiv preprint arXiv:2501.13826, 2025

  18. [18]

    Reinforcement Learning via Self-Distillation

    Jonas Hübotter, Frederike Lübeck, Lejs Behric, Anton Baumann, Marco Bagatella, Daniel Marta, Ido Hakimi, Idan Shenfeld, Thomas Kleine Buening, Carlos Guestrin, et al. Reinforcement learning via self-distillation.arXiv preprint arXiv:2601.20802, 2026

  19. [19]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  20. [20]

    Visioncoach: Reinforcing grounded video reasoning via visual-perception prompting.arXiv preprint arXiv:2603.14659, 2026

    Daeun Lee, Shoubin Yu, Yue Zhang, and Mohit Bansal. Visioncoach: Reinforcing grounded video reasoning via visual-perception prompting.arXiv preprint arXiv:2603.14659, 2026

  21. [21]

    Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

    Gengsheng Li, Tianyu Yang, Junfeng Fang, Mingyang Song, Mao Zheng, Haiyun Guo, Dan Zhang, Jinqiao Wang, and Tat-Seng Chua. Unifying group-relative and self-distillation policy optimization via sample routing.arXiv preprint arXiv:2604.02288, 2026

  22. [22]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024

  23. [23]

    arXiv preprint arXiv:2504.06958 (2025)

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  24. [24]

    Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe

    Yaxuan Li, Yuxin Zuo, Bingxiang He, Jinqian Zhang, Chaojun Xiao, Cheng Qian, Tianyu Yu, Huan-ang Gao, Wenkai Yang, Zhiyuan Liu, et al. Rethinking on-policy distillation of large language models: Phenomenology, mechanism, and recipe.arXiv preprint arXiv:2604.13016, 2026

  25. [25]

    Self-hinting language models enhance reinforcement learning, 2026

    Baohao Liao, Hanze Dong, Xinxing Xu, Christof Monz, and Jiang Bian. Self-hinting language models enhance reinforcement learning.arXiv preprint arXiv:2602.03143, 2026

  26. [26]

    St-llm: Large language models are effective temporal learners

    Ruyang Liu, Chen Li, Haoran Tang, Yixiao Ge, Ying Shan, and Ge Li. St-llm: Large language models are effective temporal learners. InEuropean Conference on Computer Vision, pages 1–18. Springer, 2024

  27. [27]

    Oryx mllm: On-demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

    Zuyan Liu, Yuhao Dong, Ziwei Liu, Winston Hu, Jiwen Lu, and Yongming Rao. Oryx mllm: On- demand spatial-temporal understanding at arbitrary resolution.arXiv preprint arXiv:2409.12961, 2024

  28. [28]

    Fipo: Eliciting deep reasoning with future-kl influenced policy optimization, 2026

    Chiyu Ma, Shuo Yang, Kexin Huang, Jinda Lu, Haoming Meng, Shangshang Wang, Bolin Ding, Soroush V osoughi, Guoyin Wang, and Jingren Zhou. Fipo: Eliciting deep reasoning with future-kl influenced policy optimization.arXiv preprint arXiv:2603.19835, 2026

  29. [29]

    Video-chatgpt: Towards detailed video understanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 11

  30. [30]

    arXiv preprint arXiv:2510.20579 (2025)

    Jiahao Meng, Xiangtai Li, Haochen Wang, Yue Tan, Tao Zhang, Lingdong Kong, Yunhai Tong, Anran Wang, Zhiyang Teng, Yujing Wang, et al. Open-o3 video: Grounded video reasoning with explicit spatio-temporal evidence.arXiv preprint arXiv:2510.20579, 2025

  31. [31]

    arXiv preprint arXiv:2504.01805 (2025) 22 B

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025

  32. [32]

    Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807, 2025

    Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, and Yandong Wen. Simko: Simple pass@ k policy optimization.arXiv preprint arXiv:2510.14807, 2025

  33. [33]

    Policy Distillation

    Andrei A Rusu, Sergio Gomez Colmenarejo, Caglar Gulcehre, Guillaume Desjardins, James Kirkpatrick, Razvan Pascanu, V olodymyr Mnih, Koray Kavukcuoglu, and Raia Hadsell. Policy distillation.arXiv preprint arXiv:1511.06295, 2015

  34. [34]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  35. [35]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas Hübotter, and Pulkit Agrawal. Self-distillation enables continual learning.arXiv preprint arXiv:2601.19897, 2026

  36. [36]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025

  37. [37]

    PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

    Shangkun Sun, Ruyang Liu, Haoran Tang, Yixiao Ge, Haibo Lu, Jiankun Yang, and Chen Li. Ppllava: Varied video sequence understanding with prompt guidance.arXiv preprint arXiv:2411.02327, 2024

  38. [38]

    Mvp: Enhancing video large language models via self-supervised masked video prediction.arXiv preprint arXiv:2601.03781, 2026

    Xiaokun Sun, Zezhong Wu, Zewen Ding, and Linli Xu. Mvp: Enhancing video large language models via self-supervised masked video prediction.arXiv preprint arXiv:2601.03781, 2026

  39. [39]

    Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

    Haochen Wang, Xiangtai Li, Zilong Huang, Anran Wang, Jiacong Wang, Tao Zhang, Jiani Zheng, Sule Bai, Zijian Kang, Jiashi Feng, et al. Traceable evidence enhanced visual grounded reasoning: Evaluation and methodology.arXiv preprint arXiv:2507.07999, 2025

  40. [40]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434, 2025

  41. [41]

    Time-r1: Post-training large vision language model for temporal video grounding,

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025

  42. [42]

    InternVideo2.5: Empowering video MLLMs with long and rich context modeling.arXiv:2501.12386, 2025

    Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling.arXiv preprint arXiv:2501.12386, 2025

  43. [43]

    Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning

    Ziyang Wang, Jaehong Yoon, Shoubin Yu, Md Mohaiminul Islam, Gedas Bertasius, and Mohit Bansal. Video-rts: Rethinking reinforcement learning and test-time scaling for efficient and enhanced video reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 28114–28128, 2025

  44. [44]

    Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026

    Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, and Xudong Jiang. Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026

  45. [45]

    Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception,

    Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce multimodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 12

  46. [46]

    Self-Distilled RLVR

    Chenxu Yang, Chuanyu Qin, Qingyi Si, Minghui Chen, Naibin Gu, Dingyu Yao, Zheng Lin, Weiping Wang, Jiaqi Wang, and Nan Duan. Self-distilled rlvr.arXiv preprint arXiv:2604.03128, 2026

  47. [47]

    thinking with long videos

    Zuhao Yang, Sudong Wang, Kaichen Zhang, Keming Wu, Sicong Leng, Yifan Zhang, Bo Li, Chengwei Qin, Shijian Lu, Xingxuan Li, et al. Longvt: Incentivizing" thinking with long videos" via native tool calling.arXiv preprint arXiv:2511.20785, 2025

  48. [48]

    Video-o3: Native interleaved clue seeking for long video multi-hop reasoning.arXiv preprint arXiv:2601.23224, 2026

    Xiangyu Zeng, Zhiqiu Zhang, Yuhan Zhu, Xinhao Li, Zikang Wang, Changlian Ma, Qingyu Zhang, Zizheng Huang, Kun Ouyang, Tianxiang Jiang, et al. Video-o3: Native interleaved clue seeking for long video multi-hop reasoning.arXiv preprint arXiv:2601.23224, 2026

  49. [49]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

  50. [50]

    Video-llama: An instruction-tuned audio-visual language model for video understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding. InProceedings of the 2023 conference on empirical methods in natural language processing: system demonstrations, pages 543–553, 2023

  51. [51]

    Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

    Haoji Zhang, Xin Gu, Jiawen Li, Chixiang Ma, Sule Bai, Chubin Zhang, Bowen Zhang, Zhichao Zhou, Dongliang He, and Yansong Tang. Thinking with videos: Multimodal tool-augmented reinforcement learning for long video reasoning.arXiv preprint arXiv:2508.04416, 2025

  52. [52]

    Llava- video: Video instruction tuning with synthetic data.Transactions on Machine Learning Research, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.Transactions on Machine Learning Research, 2025

  53. [53]

    Self-Distilled Reasoner: On-Policy Self-Distillation for Large Language Models

    Siyan Zhao, Zhihui Xie, Mengchen Liu, Jing Huang, Guan Pang, Feiyu Chen, and Aditya Grover. Self-distilled reasoner: On-policy self-distillation for large language models.arXiv preprint arXiv:2601.18734, 2026

  54. [54]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, et al. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 13 Appendix Contents A Extended Related Work 15 B Limitations and Future Works 16 C Implementation and Evaluation Details 16 C.1 Algorithm Details . . . . . . ....

  55. [55]

    Say whether the student final answer is fully correct, partly correct, or wrong

    Answer diagnosis. Say whether the student final answer is fully correct, partly correct, or wrong. Briefly name the problematic part. For natural-language answers, focus on semantic meaning rather than exact wording . For structured answers, say which part is wrong or missing

  56. [56]

    Judge whether the student’s reasoning broadly supports the final answer

    Reasoning-versus-answer consistency. Judge whether the student’s reasoning broadly supports the final answer. If the reasoning points to one event, object, text clue, time range, or spatial reference but the final answer states another, say so. If the reasoning is too weak, too broad, or too incomplete to justify the final answer, say that clearly

  57. [57]

    feedback

    High-level error cause. When the response is not fully correct, pick the single most likely high-level cause and mention it briefly. Prefer one of these categories: the reasoning focused on the wrong event, time span, or object; the reasoning was too broad or lacked enough evidence to support such a specific answer; the reasoning was mostly on the right t...