pith. sign in

arxiv: 2605.21931 · v1 · pith:5BFESIVRnew · submitted 2026-05-21 · 💻 cs.CV

EvoVid: Temporal-Centric Self-Evolution for Video Large Language Models

Pith reviewed 2026-05-22 07:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords Video-LLMsself-evolutiontemporal reasoningvideo understandingself-supervised learningreinforcement learningunannotated videosQuestioner-Solver
0
0 comments X

The pith

EvoVid enables Video-LLMs to improve temporal reasoning directly from raw unannotated videos through self-evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EvoVid as a framework that lets video large language models get better at video reasoning without relying on costly human-annotated tasks or solutions. It uses a Questioner-Solver self-play setup where rewards focus on time-dependent elements: one reward favors questions that change meaningfully when video timing is altered, and the other uses the model's ability to locate relevant segments within videos to create automatic training signals. This targets the core limitation of prior reinforcement learning methods for Video-LLMs, which are constrained by human expertise and cannot easily scale to large amounts of raw video data. If the approach holds, video models could continue advancing by processing more unlabeled footage on their own, bypassing annotation bottlenecks. A sympathetic reader would value this because temporal dynamics sit at the heart of what makes video understanding distinct from static image or text tasks.

Core claim

EvoVid is a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. It introduces two complementary temporal-centric rewards in a Questioner-Solver self-play loop: a temporal-aware Questioner reward that encourages generation of questions sensitive to temporal perturbations, and a temporal-grounded Solver reward that supplies automatic supervision through inherent video segment localization. Experiments across four base models and six benchmarks show consistent gains over both base models and prior self-evolving methods, reaching performance levels competitive with supervised approaches.

What carries the argument

The temporal-centric self-evolution loop with a temporal-aware Questioner reward based on perturbation sensitivity and a temporal-grounded Solver reward based on segment localization, which together generate automatic temporal supervision from raw video.

If this is right

  • Video-LLMs achieve measurable gains on reasoning benchmarks without any new human annotations.
  • Self-evolution extends successfully to video, a dynamic modality where prior static-focused methods fell short.
  • Performance approaches that of fully supervised training pipelines across multiple base models.
  • The method scales with larger collections of raw video because it requires no task labels.
  • Temporal focus in rewards addresses a gap left by existing self-evolving frameworks designed for text or images.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward design might adapt to other sequential data such as audio tracks or time-series sensor inputs.
  • Over repeated self-evolution cycles on massive video corpora, models could develop stronger long-horizon temporal coherence than current supervised training allows.
  • Lower annotation costs could accelerate development of video-based agents in robotics or surveillance applications.

Load-bearing premise

The rewards derived from temporal perturbation sensitivity and video segment localization supply reliable automatic signals that actually advance temporal understanding in the models.

What would settle it

Running the self-evolution process on videos where correct temporal ordering is essential and then measuring no gain or a drop in performance on sequence-sensitive reasoning tasks would undermine the claim.

Figures

Figures reproduced from arXiv: 2605.21931 by Bihan Wen, Han Qiu, Qi She, Shiqi Huang, Zhongrong Zuo, Ziyue Wang.

Figure 1
Figure 1. Figure 1: Comparison between supervised RL, VANILLA self-evolving frameworks, and EvoVid. Left: Supervised RL relies on human-annotated tasks and solutions to construct reward signals, making training costly and inherently bounded by human expertise. Middle: VANILLA self-evolving frameworks, primarily designed for static modalities, i.e., images, generate generic and often single￾frame answerable questions, leading … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of EvoVid. Questioner πQ and Solver πS co-evolve through two temporal-centric rewards. Questioner Training: with πS frozen, πQ generates questions while πS responds using original and shuffled frames to derive the temporal-aware Questioner reward r Q temp. Solver Training: with πQ frozen, the Questioner generates questions from a sampled K-frame window, and πS predicts both the answer and temporal… view at source ↗
Figure 3
Figure 3. Figure 3: Effect of different shuffle strategies in Questioner reward on performance. We use random shuffle by default. K=4 K=8 K=12 K=16 47.8 48.0 48.2 48.4 48.6 48.8 49.0 49.2 49.4 Average Score on Video Benchmarks 49.1 49.2 48.9 48.2 Window Size in Solver Reward [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative analysis of generated questions. The word cloud visualization shows that EvoVid generates questions with substantially richer temporal and reasoning-oriented vocabulary compared to the VANILLA Self-evolving baseline. 4.6 Iteration Scaling Base Model Iter 1 Iter 2 Iter 3 Iter 4 Iter 5 Iter 6 35 40 45 50 55 Average Score on Video Benchmarks 34.70 37.30 40.10 40.80 39.20 39.20 38.80 47.90 48.40 49… view at source ↗
Figure 6
Figure 6. Figure 6: Iteration scaling. Performance under extended training iterations and model scales [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Temporal keyword frequency comparison. E Limitations and Broader Impacts E.1 Limitations Despite the promising performance of our video-based self-evolving framework, several limitations remain. First, the current framework is trained with only 16 video frames, which may restrict its ability to model long-range temporal dependencies. Extending the paradigm to handle longer videos and more complex real-worl… view at source ↗
read the original abstract

Recent Video Large Language Models (Video-LLMs) have demonstrated strong capabilities in video reasoning through reinforcement learning (RL). However, existing RL pipelines rely heavily on human-annotated tasks and solutions, making them costly to scale and fundamentally constrained by human expertise. Self-evolving frameworks have recently emerged as a promising alternative through autonomous Questioner-Solver self-play. Unfortunately, these approaches are primarily designed for static modalities such as text and images, fundamentally failing to capture the temporal dynamics that are central to video reasoning. In this work, we propose $\textbf{EvoVid}$, a temporal-centric self-evolving framework that enables Video-LLMs to improve directly from raw, unannotated videos. Specifically, we introduce two complementary temporal-centric rewards: a temporal-aware Questioner reward that encourages temporally dependent question generation through temporal perturbation sensitivity, and a temporal-grounded Solver reward that provides automatic temporal supervision via inherent video segment localization. Extensive experiments across four base models and six benchmarks demonstrate consistent improvements over both base models and existing self-evolving baselines, achieving competitive performance with supervised methods. These results highlight temporal-centric self-evolution as an effective and scalable paradigm for video understanding and reasoning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes EvoVid, a temporal-centric self-evolving framework for Video-LLMs that enables improvement directly from raw, unannotated videos. It introduces a temporal-aware Questioner reward based on sensitivity to temporal perturbations (e.g., frame shuffling or speed changes) and a temporal-grounded Solver reward based on inherent video segment localization. The central claim is that this setup supplies effective automatic temporal supervision, yielding consistent improvements over base models and existing self-evolving baselines while achieving competitive performance with supervised methods, demonstrated across four base models and six benchmarks.

Significance. If the results hold and the rewards are shown to enforce genuine temporal reasoning, the work would be significant for offering a scalable, annotation-free paradigm that extends self-play methods to video by explicitly targeting temporal dynamics, which prior text/image self-evolving frameworks overlook. This could meaningfully reduce dependence on costly human-annotated RL data for video understanding.

major comments (2)
  1. [§3.1 (Temporal-aware Questioner reward)] The definition of the temporal-aware Questioner reward (via sensitivity to temporal perturbations) does not include a direct measurement or control experiment confirming that high-reward questions require temporal reasoning rather than static appearance cues disrupted incidentally by the chosen perturbations (e.g., frame shuffling). This validation is absent from the self-play loop description and is load-bearing for the claim of effective automatic temporal supervision from raw video alone.
  2. [§4 (Experiments and ablations)] The temporal-grounded Solver reward relies on video segment localization for automatic supervision, yet no ablation isolates whether this component (versus generic self-play) drives the reported gains in temporal reasoning tasks; without such isolation, it remains unclear if the combined rewards outperform non-temporal self-evolution baselines for the intended reason.
minor comments (2)
  1. [Abstract] The abstract asserts 'consistent improvements' and 'competitive performance' without any numerical values, confidence intervals, or specific benchmark scores; including at least one key quantitative result would strengthen the summary.
  2. [§3] Notation for the two rewards could be clarified with explicit equations or pseudocode early in §3 to make the perturbation sensitivity and segment localization mechanisms easier to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and have revised the manuscript accordingly to strengthen the validation of our temporal-centric rewards.

read point-by-point responses
  1. Referee: [§3.1 (Temporal-aware Questioner reward)] The definition of the temporal-aware Questioner reward (via sensitivity to temporal perturbations) does not include a direct measurement or control experiment confirming that high-reward questions require temporal reasoning rather than static appearance cues disrupted incidentally by the chosen perturbations (e.g., frame shuffling). This validation is absent from the self-play loop description and is load-bearing for the claim of effective automatic temporal supervision from raw video alone.

    Authors: We appreciate the referee's emphasis on rigorous validation for the temporal-aware Questioner reward. The perturbations were chosen to specifically target temporal structure: frame shuffling preserves per-frame appearance while breaking sequential order, and speed changes modify event timing and motion dynamics without altering static visual features. To directly address the concern, we have added a control experiment in the revised manuscript comparing reward sensitivity under temporal perturbations versus appearance-focused perturbations (e.g., color jitter and brightness shifts). High-reward questions show markedly higher sensitivity to temporal changes, indicating reliance on temporal reasoning. This analysis is now included in §3.1 with supporting figures in the appendix. revision: yes

  2. Referee: [§4 (Experiments and ablations)] The temporal-grounded Solver reward relies on video segment localization for automatic supervision, yet no ablation isolates whether this component (versus generic self-play) drives the reported gains in temporal reasoning tasks; without such isolation, it remains unclear if the combined rewards outperform non-temporal self-evolution baselines for the intended reason.

    Authors: We agree that isolating the contribution of the temporal-grounded Solver reward strengthens the interpretation of results. Our original comparisons already include non-temporal self-evolving baselines, but to explicitly isolate the localization component we have added an ablation replacing it with a generic self-play reward based on answer consistency alone (without segment localization). The updated experiments in §4 demonstrate that the temporal-grounded component yields further gains on temporal reasoning tasks beyond generic self-play. These results are reported in the revised Section 4 and accompanying table. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained with independent temporal reward definitions.

full rationale

The paper introduces EvoVid as a new framework extending self-play ideas to video via explicitly defined temporal-aware Questioner reward (based on perturbation sensitivity) and temporal-grounded Solver reward (based on segment localization). These are presented as novel components rather than reductions of prior results or fitted parameters renamed as predictions. No equations or steps reduce the central claims to self-referential definitions, self-citation chains, or ansatzes smuggled from prior author work. The derivation chain relies on the proposed rewards providing automatic supervision from raw video, which is an independent modeling choice open to empirical validation rather than a tautology. This is the common case of an honest non-finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract provides limited detail; relies on domain assumption that self-play RL extends to video via temporal rewards. Introduces two new reward mechanisms without independent evidence beyond the proposed framework.

axioms (1)
  • domain assumption Self-evolving Questioner-Solver frameworks can be adapted to video by adding temporal dynamics and rewards
    Core premise stated in the abstract for extending prior self-evolving work to video modality.
invented entities (2)
  • temporal-aware Questioner reward no independent evidence
    purpose: Encourages generation of temporally dependent questions via temporal perturbation sensitivity
    New reward component introduced to address temporal aspects in video self-evolution.
  • temporal-grounded Solver reward no independent evidence
    purpose: Provides automatic temporal supervision through video segment localization
    New reward component introduced to enable self-supervision without annotations.

pith-pipeline@v0.9.0 · 5743 in / 1325 out tokens · 70671 ms · 2026-05-22T07:15:23.219575+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 15 internal anchors

  1. [1]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776, 2025. 1, 3, 5, 7, 15

  2. [2]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning. arXiv preprint arXiv:2504.06958, 2025. 3

  3. [3]

    Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025

    Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyunwoo J Kim. Deepvideo-r1: Video reinforcement fine-tuning via difficulty-aware regressive grpo.arXiv preprint arXiv:2506.07464, 2025. 3, 5, 15

  4. [4]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision language model for temporal video grounding. arXiv preprint arXiv:2503.13377, 2025. 3

  5. [5]

    Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026

    Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, and Xudong Jiang. Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026. 1, 3, 5, 15

  6. [6]

    Absolute Zero: Reinforced Self-play Reasoning with Zero Data

    Andrew Zhao, Yiran Wu, Yang Yue, Tong Wu, Quentin Xu, Matthieu Lin, Shenzhi Wang, Qingyun Wu, Zilong Zheng, and Gao Huang. Absolute zero: Reinforced self-play reasoning with zero data.arXiv preprint arXiv:2505.03335, 2025. 1, 3

  7. [7]

    R-Zero: Self-Evolving Reasoning LLM from Zero Data

    Chengsong Huang, Wenhao Yu, Xiaoyang Wang, Hongming Zhang, Zongxia Li, Ruosen Li, Jiaxin Huang, Haitao Mi, and Dong Yu. R-zero: Self-evolving reasoning llm from zero data.arXiv preprint arXiv:2508.05004, 2025. 1, 3

  8. [8]

    Self-supervised spatiotemporal learning via video clip order prediction

    Dejing Xu, Jun Xiao, Zhou Zhao, Jian Shao, Di Xie, and Yueting Zhuang. Self-supervised spatiotemporal learning via video clip order prediction. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10334–10343, 2019. 1

  9. [9]

    Spell: Self-play reinforcement learning for evolving long-context language models.arXiv preprint arXiv:2509.23863, 2025

    Ziyi Yang, Weizhou Shen, Chenliang Li, Ruijun Chen, Fanqi Wan, Ming Yan, Xiaojun Quan, and Fei Huang. Spell: Self-play reinforcement learning for evolving long-context language models.arXiv preprint arXiv:2509.23863, 2025. 1, 3

  10. [10]

    Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025

    Yicheng He, Chengsong Huang, Zongxia Li, Jiaxin Huang, and Yonghui Yang. Visplay: Self-evolving vision-language models from images.arXiv preprint arXiv:2511.15661, 2025. 1, 2, 3

  11. [11]

    Evolmm: Self-evolving large multimodal models with continuous rewards.arXiv preprint arXiv:2511.16672, 2025

    Omkar Thawakar, Shravan Venkatraman, Ritesh Thawkar, Abdelrahman Shaker, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Khan. Evolmm: Self-evolving large multimodal models with continuous rewards.arXiv preprint arXiv:2511.16672, 2025. 1, 3

  12. [12]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 4, 6

  13. [13]

    Timesearch- r: Adaptive temporal search for long-form video understanding via self-verification reinforcement learning

    Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, and Qi She. Timesearch- r: Adaptive temporal search for long-form video understanding via self-verification reinforcement learning. arXiv preprint arXiv:2511.05489, 2025. 3

  14. [14]

    Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models

    Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, and Quanquan Gu. Self-play fine-tuning converts weak language models to strong language models, 2024.URL https://arxiv. org/abs/2401.01335, 2401. 3

  15. [15]

    A survey on self-evolution of large language models

    Zhengwei Tao, Ting-En Lin, Xiancai Chen, Hangyu Li, Yuchuan Wu, Yongbin Li, Zhi Jin, Fei Huang, Dacheng Tao, and Jingren Zhou. A survey on self-evolution of large language models.arXiv preprint arXiv:2404.14387, 2024. 3

  16. [16]

    Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025

    Jakub Grudzien Kuba, Mengting Gu, Qi Ma, Yuandong Tian, Vijai Mohan, and Jason Chen. Language self-play for data-free training.arXiv preprint arXiv:2509.07414, 2025. 3

  17. [17]

    Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025

    Bo Liu, Chuanyang Jin, Seungone Kim, Weizhe Yuan, Wenting Zhao, Ilia Kulikov, Xian Li, Sainbayar Sukhbaatar, Jack Lanchantin, and Jason Weston. Spice: Self-play in corpus environments improves reasoning.arXiv preprint arXiv:2510.24684, 2025. 3

  18. [18]

    Self-Rewarding Vision-Language Model via Reasoning Decomposition

    Zongxia Li, Wenhao Yu, Chengsong Huang, Rui Liu, Zhenwen Liang, Fuxiao Liu, Jingxi Che, Dian Yu, Jordan Boyd-Graber, Haitao Mi, et al. Self-rewarding vision-language model via reasoning decomposition. arXiv preprint arXiv:2508.19652, 2025. 3 10

  19. [19]

    V-zero: Self-improving multimodal reasoning with zero annotation.arXiv preprint arXiv:2601.10094, 2026

    Han Wang, Yi Yang, Jingyuan Hu, Minfeng Zhu, and Wei Chen. V-zero: Self-improving multimodal reasoning with zero annotation.arXiv preprint arXiv:2601.10094, 2026. 3

  20. [20]

    Self-evolving vision-language models for image quality assessment via voting and ranking

    Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, and Li Zhang. Self-evolving vision-language models for image quality assessment via voting and ranking. arXiv preprint arXiv:2509.25787, 2025. 3

  21. [21]

    Vision-zero: Scalable vlm self-improvement via strategic gamified self-play.arXiv preprint arXiv:2509.25541, 2025

    Qinsi Wang, Bo Liu, Tianyi Zhou, Jing Shi, Yueqian Lin, Yiran Chen, Hai Helen Li, Kun Wan, and Wentian Zhao. Vision-zero: Scalable vlm self-improvement via strategic gamified self-play.arXiv preprint arXiv:2509.25541, 2025. 3

  22. [22]

    iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

    Meghana Sunil, Manikandarajan Venmathimaran, and Muthu Subash Kavitha. ireasoner: Trajectory-aware intrinsic reasoning supervision for self-evolving large multimodal models.arXiv preprint arXiv:2601.05877,

  23. [23]

    Mm-zero: Self-evolving multi-model vision language models from zero data.arXiv preprint arXiv:2603.09206, 2026

    Zongxia Li, Hongyang Du, Chengsong Huang, Xiyang Wu, Lantao Yu, Yicheng He, Jing Xie, Xiaomin Wu, Zhichao Liu, Jiarui Zhang, et al. Mm-zero: Self-evolving multi-model vision language models from zero data.arXiv preprint arXiv:2603.09206, 2026. 3

  24. [24]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024. 5, 14

  25. [25]

    Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024

    Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos.arXiv preprint arXiv:2405.09711, 2024. 5, 14

  26. [26]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019. 5, 14

  27. [27]

    Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748– 42761, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Recasens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Perception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Systems, 36:42748– 42761, 2023. 5, 14

  28. [28]

    Next-qa: Next phase of question-answering to explaining temporal actions

    Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa: Next phase of question-answering to explaining temporal actions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9777–9786, 2021. 5, 14

  29. [29]

    Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, and Ying Shan. Video-holmes: Can mllm think like holmes for complex video reasoning?arXiv preprint arXiv:2505.21374, 2025. 5, 14

  30. [30]

    Thinking in space: How multimodal large language models see, remember, and recall spaces

    Jihan Yang, Shusheng Yang, Anjali W Gupta, Rilyn Han, Li Fei-Fei, and Saining Xie. Thinking in space: How multimodal large language models see, remember, and recall spaces. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10632–10643, 2025. 5, 14

  31. [31]

    Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    Kairui Hu, Penghao Wu, Fanyi Pu, Wang Xiao, Yuanhan Zhang, Xiang Yue, Bo Li, and Ziwei Liu. Video-mmmu: Evaluating knowledge acquisition from multi-discipline professional videos.arXiv preprint arXiv:2501.13826, 2025. 5, 14

  32. [32]

    Mmvu: Measuring expert-level multi-discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video understanding. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 8475–8489, 2025. 5, 15

  33. [33]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024. 5, 15

  34. [34]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025. 5, 16

  35. [35]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 6

  36. [36]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6 11 Appendix A Implementation Details A.1 Reward Implementation The full Questioner and Solver rewards are defined in §3. Here we give the implementation-leve...

  37. [37]

    Watch the video carefully and understand all details across the frames

  38. [38]

    Generate exactly one question that is directly related to the video content

  39. [39]

    Choose the question type from only one of: multiple choice (Yes/No or four options A/B/C/D, one correct), numerical (a specific numeric answer), or regression (a continuous value such as a measurement, quantity, or coordinate)

  40. [40]

    The question must require analysis or reasoning, not just description

  41. [41]

    Include units if applicable

    Provide the correct answer. Include units if applicable

  42. [42]

    Output format: <type>X</type> <question>Y</question> <answer>Z</answer> where X∈{multiple choice, numerical, regression}

    Output strictly in the three-block format below, with nothing else. Output format: <type>X</type> <question>Y</question> <answer>Z</answer> where X∈{multiple choice, numerical, regression}. Solver prompt.The Solver receives a thin chain-of-thought wrapper around the Questioner- generated query, with an additional sentence that instructs the Solver to emit...