pith. sign in

arxiv: 2605.21988 · v1 · pith:S24J2ZRYnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

Learning Spatiotemporal Sensitivity in Video LLMs via Counterfactual Reinforcement Learning

Pith reviewed 2026-05-22 07:42 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video large language modelsspatiotemporal sensitivitycounterfactual reinforcement learningshortcut mitigationdynamic video understandingDyBench benchmark
0
0 comments X

The pith

Counterfactual RL trains video LLMs to change answers only when video dynamics actually change.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Video LLMs frequently solve questions using single frames or language patterns instead of tracking motion and timing. Standard reinforcement learning on final correctness can strengthen these shortcuts because they still yield reward. The paper introduces a dual-branch method that generates counterfactual videos via horizontal flips and time reversal, then adds a reward that checks whether the model's answer correctly stays the same or changes between the original and counterfactual versions. This relational constraint makes it harder for static shortcuts to receive consistent reward across both branches, pushing the model toward genuine spatiotemporal reasoning.

Core claim

CRPO is a dual-branch reinforcement learning procedure in which each training example is paired with a horizontally flipped or temporally reversed counterpart. A Counterfactual Relation Reward is computed between the two branches so that the policy is rewarded only when its answers differ on dynamic questions and match on static questions, thereby reducing the viability of shortcut policies that ignore video dynamics.

What carries the argument

Counterfactual Relation Reward (CRR) applied across original and transformed video branches inside a policy optimization loop.

If this is right

  • DyBench pair-accuracy rises because fixed-answer shortcuts are penalized across branches.
  • TimeBlind independent accuracy improves as models must use temporal order rather than static cues.
  • General video QA performance remains competitive because the method does not discard useful static information.
  • Shortcut policies become inconsistent across branches and therefore receive lower average reward.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same relational-reward idea could be tested on audio or 3D scene models that currently exploit static background features.
  • Replacing flips and reversals with learned counterfactual generators might produce harder negatives and stronger sensitivity gains.
  • The pair-accuracy metric on DyBench could be adapted to measure shortcut robustness in other multimodal benchmarks.

Load-bearing premise

Horizontal flips and temporal reversals cleanly separate dynamic from static content while preserving question meaning and without creating new exploitable artifacts.

What would settle it

Run the trained model on DyBench pairs where the counterfactual video is replaced by an unrelated clip or by a version containing visible editing artifacts; if pair-accuracy collapses to baseline levels the claim is supported.

Figures

Figures reproduced from arXiv: 2605.21988 by Bohai Gu, Dazhao Du, Eric Liu, Fangqi Zhu, Jialong Qin, Jian Liu, Song Guo, Tao Han, Xi Chen, Yujia Zhang.

Figure 1
Figure 1. Figure 1: Current Video LLMs remain insensitive to spatiotemporal changes. Left: On the same scene, the model (a) answers a static question correctly, but (b) fails on a spatiotemporal question; (c) it gives the same prediction to a video and its temporal reversal. Right: Across MVBench and TempCompass sub-tasks, accuracy drops as the fraction of spatiotemporal questions increases. fraction of spatiotemporal questio… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of CRPO. Left: Given a video question, the Task Router selects a counterfactual transformation T (horizontal flip or temporal reversal). The original branch and the counterfactual branch each generate G rollouts. These rollouts are scored by branch-specific correctness or behav￾ioral rewards, format rewards, and the Counterfactual Relation Reward (CRR), and are then used for advantage estimation a… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of DyBench. 3,014 paired counterfactual videos across three sub-tasks: reversible dynamics, moving direction, and event sequence. rewards (2G values per prompt). This naturally places the two branches on a common scale (see Appendix F for a detailed analysis). Let Aˆ i and AˆT j denote the normalized advantages for the original and counterfactual rollouts, and let ρi , ρ T j be the corresponding i… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative example. For a temporally reversed video pair, the baseline Qwen3-VL predicts the same action label for both videos, whereas CRPO changes its answer and matches both ground truths. More examples are in Appendix G. introduced by a controlled transformation, and consequently have different correct answers. We then report pair accuracy (P-Acc), which counts a pair as correct only when the model an… view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics of CRPO vs. RL baselines. (a) correctness reward, (b) fraction of zero-advantage rollout groups, and (c) auxiliary reward [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Task Router output examples after applying the prompt above to randomly sampled videos. The router outputs a category label together with a short text-only reasoning chain. 2. Time-Reversal Test (Temporal Sensitivity). If we play the video in reverse, “putting down” the clothes would look like the person is picking them up or “taking” them (Option B). Because the direction of time fundamentally changes the… view at source ↗
Figure 7
Figure 7. Figure 7: The four reward components introduced by CRPO. D.2 Training dynamics on Qwen3-VL-8B [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training dynamics of CRPO vs. RL baselines on Qwen3-VL-8B. Same layout as [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: plots the results. 0 0.1 0.3 0.5 50 60 70 CRR weight λ (waug=0.5) Accuracy (%) (A) Vary λ DyBench VideoMME 0 0.3 0.5 0.7 50 60 70 Aug. branch weight waug (λ=0.3) Accuracy (%) (B) Vary waug DyBench VideoMME [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative example from DyBench (moving direction): identifying which way the yellow cube moves on opposite paired videos. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative example from DyBench (moving direction): identifying whether the camera pans left or right while filming the scissors. Which of the following actions appeared in the video? (A) moving away from shovel with your camera. (B) approaching shovel with your camera. (C) neither of the actions happened. Qwen3-VL: The camera is steadily moving closer to the shovel, indicating an approach towards it. Th… view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative example from DyBench (moving direction): identifying whether the camera is approaching or moving away from the shovel. This is a time-lapse video of a flower. What is the flower doing? (A) Blooming. (B) Furling. (C) Staying still. Qwen3-VL: The flower starts as a closed bud and gradually opens up, with petals unfurling and expanding outward, revealing more layers and colors. This process is ch… view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative example from DyBench (reversible dynamics): identifying whether a time￾lapse of a flower shows blooming or furling. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative example from TimeBlind: identifying whether the clay changes from a lump [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Qualitative example from TimeBlind: identifying whether the bubbles near the diver’s [PITH_FULL_IMAGE:figures/full_fig_p027_15.png] view at source ↗
read the original abstract

Video large language models (Video LLMs) achieve strong benchmark accuracy, yet often answer video questions through shortcuts such as single-frame cues and language priors rather than by tracking spatiotemporal dynamics. This issue is exacerbated in RL post-training, where correctness-only rewards can further reinforce shortcut policies that obtain high reward without tracking video dynamics. We address this by asking a controlled counterfactual question: if the visual world changed while the question remained fixed, should the answer change or stay the same? Based on this view, we propose \textbf{Counterfactual Relational Policy Optimization (CRPO)}, a dual-branch RL framework for improving \emph{spatiotemporal sensitivity}. CRPO constructs counterfactual videos through horizontal flips and temporal reversals, trains on both original and counterfactual branches, and introduces a \textbf{Counterfactual Relation Reward (CRR)} between their answers. CRR encourages answers to change for dynamic questions and remain unchanged for static questions. This cross-branch constraint makes it difficult for shortcut policies to be consistently rewarded across both branches. To evaluate this property, we introduce \textbf{DyBench}, a paired counterfactual video benchmark with 3,014 videos covering reversible dynamics, moving direction, and event sequence, together with a strict pair-accuracy metric that prevents fixed-answer shortcuts from inflating scores. Experiments show that CRPO outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench P-Acc by +7.7 and TimeBlind I-Acc by +8.2 over the base model, indicating improved spatiotemporal sensitivity rather than stronger reliance on static shortcuts. The project website can be found at https://ddz16.github.io/crpo.github.io/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Video LLMs often rely on static shortcuts rather than spatiotemporal dynamics, and that standard RL post-training exacerbates this. It proposes Counterfactual Relational Policy Optimization (CRPO), a dual-branch RL method that generates counterfactual videos via horizontal flips and temporal reversals, applies a Counterfactual Relation Reward (CRR) to encourage answer changes only on dynamic questions, and introduces DyBench (3,014 paired videos) with a strict pair-accuracy (P-Acc) metric. On Qwen3-VL-8B, CRPO yields +7.7 DyBench P-Acc and +8.2 TimeBlind I-Acc over the base model, which the authors interpret as evidence of genuine sensitivity gains.

Significance. If the central empirical claims hold after validation, the work would be a meaningful contribution to mitigating shortcut learning in video understanding. The CRR formulation and the paired DyBench benchmark with its P-Acc metric provide a concrete, falsifiable way to measure cross-branch consistency, and the project website offers a route to reproducibility. These elements strengthen the paper beyond typical RL fine-tuning results.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the reported gains of +7.7 DyBench P-Acc and +8.2 TimeBlind I-Acc are presented without error bars, statistical significance tests, ablation controls on the reward weighting, or details on how the 3,014 DyBench videos were selected and validated; these omissions make it impossible to assess whether the improvements are robust or attributable to the proposed CRR mechanism.
  2. [§3.2] §3.2 (Counterfactual Construction and CRR): the claim that CRR enforces genuine dynamic tracking rather than branch-specific shortcuts rests on the unverified assumption that horizontal flips and temporal reversals preserve question semantics while altering only the targeted dynamics and introduce no new exploitable artifacts (e.g., mirrored identities or detectable transformation signatures); no ablation or filtering procedure is described that rules out models learning to detect the transformation itself.
  3. [§4.2] §4.2 (DyBench and TimeBlind): the pair-accuracy metric is designed to penalize fixed-answer shortcuts, yet the paper provides no evidence that the benchmark questions were explicitly filtered for clean separability between dynamic and static cases or that transformation-detection shortcuts were ablated; without this, the +7.7 / +8.2 gains remain compatible with an alternative explanation of asymmetric shortcut exploitation across branches.
minor comments (2)
  1. [§3.3] Clarify the exact reward weighting hyper-parameter between original and counterfactual branches and report its sensitivity in an ablation; it is listed as a free parameter but its impact on the final numbers is not quantified.
  2. [Figure 2] Ensure all figures showing example counterfactual pairs include both the original and transformed frames side-by-side with the corresponding question and model answers to allow readers to visually inspect potential artifacts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of robustness, verification of assumptions, and potential alternative explanations, which we address point by point below. We have revised the manuscript to incorporate additional analyses, ablations, and clarifications where feasible.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported gains of +7.7 DyBench P-Acc and +8.2 TimeBlind I-Acc are presented without error bars, statistical significance tests, ablation controls on the reward weighting, or details on how the 3,014 DyBench videos were selected and validated; these omissions make it impossible to assess whether the improvements are robust or attributable to the proposed CRR mechanism.

    Authors: We agree that the original presentation omitted key elements for assessing robustness. In the revised manuscript, we now report error bars from three independent runs with different random seeds for all main results in §4 and the abstract. We added paired t-tests confirming statistical significance (p < 0.01) for the +7.7 and +8.2 gains. A new ablation table in the appendix varies the CRR weighting hyperparameter and shows that performance degrades without the cross-branch term, supporting attribution to CRR. For DyBench construction, we expanded §4.2 with details on sourcing from public video datasets, manual review by three annotators for dynamic content, and inter-annotator agreement statistics (Cohen's kappa = 0.82). These changes directly address the concerns. revision: yes

  2. Referee: [§3.2] §3.2 (Counterfactual Construction and CRR): the claim that CRR enforces genuine dynamic tracking rather than branch-specific shortcuts rests on the unverified assumption that horizontal flips and temporal reversals preserve question semantics while altering only the targeted dynamics and introduce no new exploitable artifacts (e.g., mirrored identities or detectable transformation signatures); no ablation or filtering procedure is described that rules out models learning to detect the transformation itself.

    Authors: This concern is well-taken, as unverified assumptions can weaken causal claims. We maintain that horizontal flips and temporal reversals are appropriate for the question types in DyBench because they alter direction or sequence without changing object identities or event categories, but we acknowledge the need for explicit checks. In the revision, we added an ablation in §3.2 training a lightweight video classifier to predict transformation type (original vs. flipped vs. reversed) from frame features alone, which achieves accuracy near chance (34%). We also describe a lightweight filtering step applied during data preparation that excludes pairs where the transformation visibly affects semantics (e.g., readable text or asymmetric objects). These additions provide supporting evidence, though we note that exhaustive artifact exclusion remains an open challenge for any synthetic counterfactual method. revision: partial

  3. Referee: [§4.2] §4.2 (DyBench and TimeBlind): the pair-accuracy metric is designed to penalize fixed-answer shortcuts, yet the paper provides no evidence that the benchmark questions were explicitly filtered for clean separability between dynamic and static cases or that transformation-detection shortcuts were ablated; without this, the +7.7 / +8.2 gains remain compatible with an alternative explanation of asymmetric shortcut exploitation across branches.

    Authors: We agree that demonstrating clean separability and ruling out transformation shortcuts strengthens the interpretation. In the revised §4.2, we now explicitly describe how questions were labeled as dynamic or static according to whether the ground-truth answer differs under the counterfactual transformation, with examples provided. We added an analysis comparing answer consistency across branches for the base model versus CRPO, showing that CRPO increases appropriate changes on dynamic pairs while preserving consistency on static ones. The transformation-detection ablation referenced in our response to the §3.2 comment further mitigates the asymmetric shortcut concern. While these revisions make the alternative explanation less plausible, we recognize that no benchmark can eliminate every conceivable shortcut in large models. revision: yes

Circularity Check

0 steps flagged

No circularity: CRR and DyBench defined independently of reported gains

full rationale

The derivation introduces CRPO as a dual-branch RL method that applies horizontal flips and temporal reversals to create counterfactual videos, then defines Counterfactual Relation Reward (CRR) to penalize inconsistent answers across branches for dynamic questions while rewarding consistency for static ones. DyBench is separately constructed as a paired benchmark with its own pair-accuracy metric. Neither the reward formulation nor the benchmark metric is defined in terms of the final performance numbers on Qwen3-VL-8B; the +7.7 P-Acc and +8.2 I-Acc improvements are presented as empirical outcomes rather than quantities that reduce to the training loop by construction. No self-citation chains, ansatz smuggling, or uniqueness theorems from prior author work are invoked to force the central result. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Approach rests on the untested premise that the chosen counterfactual transformations isolate spatiotemporal dynamics without side effects; standard RL assumptions about reward shaping are also invoked but not detailed.

free parameters (1)
  • Reward weighting between original and counterfactual branches
    Typical RL hyperparameter whose specific value is not reported in abstract but required for the dual-branch training to function.
axioms (1)
  • domain assumption Counterfactual transformations (horizontal flip, temporal reversal) preserve question semantics while altering only the targeted dynamics.
    Invoked when constructing the two branches and defining when answers should change or stay the same.

pith-pipeline@v0.9.0 · 5873 in / 1285 out tokens · 44793 ms · 2026-05-22T07:42:22.554688+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

68 extracted references · 68 canonical work pages · 16 internal anchors

  1. [1]

    STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

    Emad Bahrami, Olga Zatsarynna, Parth Pathak, Sunando Sengupta, Juergen Gall, and Mohsen Fayyaz. Strive: Structured spatiotemporal exploration for reinforcement learning in video question answering.arXiv preprint arXiv:2604.01824, 2026

  2. [2]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  3. [3]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  4. [4]

    Temporalbench: Benchmarking fine- grained temporal understanding for multimodal video mod- els.arXiv preprint arXiv:2410.10818, 2024

    Mu Cai, Reuben Tan, Jianrui Zhang, Bocheng Zou, Kai Zhang, Feng Yao, Fangrui Zhu, Jing Gu, Yiwu Zhong, Yuzhang Shang, et al. Temporalbench: Benchmarking fine-grained temporal understanding for multimodal video models.arXiv preprint arXiv:2410.10818, 2024

  5. [5]

    Perceptionlm: Open-access data and models for detailed visual understanding.arXiv:2504.13180, 2025

    Jang Hyun Cho, Andrea Madotto, Effrosyni Mavroudi, Triantafyllos Afouras, Tushar Nagarajan, Muhammad Maaz, Yale Song, Tengyu Ma, Shuming Hu, Suyog Jain, et al. Perceptionlm: Open- access data and models for detailed visual understanding.arXiv preprint arXiv:2504.13180, 2025

  6. [6]

    Molmo2: Open Weights and Data for Vision-Language Models with Video Understanding and Grounding

    Christopher Clark, Jieyu Zhang, Zixian Ma, Jae Sung Park, Mohammadreza Salehi, Rohun Tripathi, Sangho Lee, Zhongzheng Ren, Chris Dongjoo Kim, Yinuo Yang, et al. Molmo2: Open weights and data for vision-language models with video understanding and grounding.arXiv preprint arXiv:2601.10611, 2026

  7. [7]

    Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

    Daniel Cores, Michael Dorkenwald, Manuel Mucientes, Cees GM Snoek, and Yuki M Asano. Lost in time: A new temporal benchmark for videollms.arXiv preprint arXiv:2410.07752, 2024

  8. [8]

    Vlmevalkit: An open-source toolkit for evaluating large multi-modality models

    Haodong Duan, Junming Yang, Yuxuan Qiao, Xinyu Fang, Lin Chen, Yuan Liu, Xiaoyi Dong, Yuhang Zang, Pan Zhang, Jiaqi Wang, et al. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models. InProceedings of the 32nd ACM international conference on multimedia, pages 11198–11201, 2024

  9. [9]

    Video-r1: Reinforcing video reasoning in mllms

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms. InNeurIPS, 2025. 10

  10. [10]

    OneThinker: All-in-one Reasoning Model for Image and Video

    Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, et al. Onethinker: All-in-one reasoning model for image and video.arXiv preprint arXiv:2512.03043, 2025

  11. [11]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 24108–24118, 2025

  12. [12]

    something something

    Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne Westphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag, et al. The" something something" video database for learning and evaluating visual common sense. InProceedings of the IEEE international conference on computer vision, pages 5842– 5850, 2017

  13. [13]

    Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1 incentivizes reasoning in llms through reinforcement learning.Nature, 645(8081):633–638, 2025

  14. [14]

    Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models

    Wenyi Hong, Yean Cheng, Zhuoyi Yang, Weihan Wang, Lefan Wang, Xiaotao Gu, Shiyu Huang, Yuxiao Dong, and Jie Tang. Motionbench: Benchmarking and improving fine-grained video motion understanding for vision language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 8450–8460, 2025

  15. [15]

    REINFORCE++: Stabilizing Critic-Free Policy Optimization with Global Advantage Normalization

    Jian Hu, Jason Klein Liu, Haotian Xu, and Wei Shen. Reinforce++: Stabilizing critic-free policy optimization with global normalization.arXiv preprint arXiv:2501.03262, 2025

  16. [16]

    Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

    Lianghua Huang, Xin Zhao, and Kaiqi Huang. Got-10k: A large high-diversity benchmark for generic object tracking in the wild.IEEE transactions on pattern analysis and machine intelligence, 43(5):1562–1577, 2019

  17. [17]

    Mhbench: Demystifying motion hallucination in videollms

    Ming Kong, Xianzhou Zeng, Luyuan Chen, Yadong Li, Bo Yan, and Qiang Zhu. Mhbench: Demystifying motion hallucination in videollms. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4401–4409, 2025

  18. [18]

    Ku, M., Chong, T., Leung, J., Shah, K., Yu, A., and Chen, W

    Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-qa benchmark for physical understanding via minimal video pairs.arXiv preprint arXiv:2506.09987, 2025

  19. [19]

    The language of actions: Recovering the syntax and semantics of goal-directed human activities

    Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 780–787, 2014

  20. [20]

    Revealing single frame bias for video-and-language learning

    Jie Lei, Tamara Berg, and Mohit Bansal. Revealing single frame bias for video-and-language learning. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 487–507, 2023

  21. [21]

    Timeblind: A spatio-temporal compositionality benchmark for video llms.arXiv preprint arXiv:2602.00288, 2026

    Baiqi Li, Kangyi Zhao, Ce Zhang, Chancharik Mitra, Jean de Dieu Nyandwi, and Gedas Bertasius. Timeblind: A spatio-temporal compositionality benchmark for video llms.arXiv preprint arXiv:2602.00288, 2026

  22. [22]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  23. [23]

    Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding

    Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2025

  24. [24]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 11

  25. [25]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforce- ment fine-tuning.arXiv preprint arXiv:2504.06958, 2025

  26. [26]

    Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

    Yun Li, Yiming Zhang, Tao Lin, XiangRui Liu, Wenxiao Cai, Zheng Liu, and Bo Zhao. Sti- bench: Are mllms ready for precise spatial-temporal world understanding? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5622–5632, 2025

  27. [27]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  28. [28]

    Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

    Shuming Liu, Mingchen Zhuge, Changsheng Zhao, Jun Chen, Lemeng Wu, Zechun Liu, Chenchen Zhu, Zhipeng Cai, Chong Zhou, Haozhe Liu, et al. Videoauto-r1: Video auto reasoning via thinking once, answering twice.arXiv preprint arXiv:2601.05175, 2026

  29. [29]

    Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? InFindings of the Association for Computational Linguistics: ACL 2024, pages 8731–8772, 2024

  30. [30]

    Part i: Tricks or traps? a deep dive into rl for llm reasoning

    Zihe Liu, Jiashun Liu, Yancheng He, Weixun Wang, Jiaheng Liu, Ling Pan, Xinyu Hu, Shaopan Xiong, Ju Huang, Jian Hu, et al. Part i: Tricks or traps? a deep dive into rl for llm reasoning. arXiv preprint arXiv:2508.08221, 2025

  31. [31]

    Deepvideo-r1: Video reinforce- ment fine-tuning via difficulty-aware regressive grpo

    Jinyoung Park, Jeehye Na, Jinyoung Kim, and Hyunwoo J Kim. Deepvideo-r1: Video reinforce- ment fine-tuning via difficulty-aware regressive grpo. InNeurIPS, 2025

  32. [32]

    A new era of intelligence with gemini 3, 2025

    Sundar Pichai, Demis Hassabis, and Koray Kavukcuoglu. A new era of intelligence with gemini 3, 2025. URL https://blog.google/intl/en-africa/company-news/ outreach-and-initiatives/a-new-era-of-intelligence-with-gemini-3/

  33. [33]

    LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Balakrishnan Varadarajan, Florian Bordes, et al. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding.arXiv preprint arXiv:2410.17434, 2024

  34. [34]

    Videoloom: A video large language model for joint spatial-temporal understanding.arXiv preprint arXiv:2601.07290, 2026

    Jiapeng Shi, Junke Wang, Zuyao You, Bo He, and Zuxuan Wu. Videoloom: A video large language model for joint spatial-temporal understanding.arXiv preprint arXiv:2601.07290, 2026

  35. [35]

    Video-xl: Extra-long vision language model for hour-scale video understanding

    Yan Shu, Zheng Liu, Peitian Zhang, Minghao Qin, Junjie Zhou, Zhengyang Liang, Tiejun Huang, and Bo Zhao. Video-xl: Extra-long vision language model for hour-scale video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 26160–26169, 2025

  36. [36]

    OpenAI GPT-5 System Card

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  37. [37]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024

  38. [38]

    Combining embedded accelerometers with computer vision for recognizing food preparation activities

    Sebastian Stein and Stephen J McKenna. Combining embedded accelerometers with computer vision for recognizing food preparation activities. InProceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing, pages 729–738, 2013

  39. [39]

    VL-Rethinker: Incentivizing Self-Reflection of Vision-Language Models with Reinforcement Learning

    Haozhe Wang, Chao Qu, Zuming Huang, Wei Chu, Fangzhen Lin, and Wenhu Chen. Vl- rethinker: Incentivizing self-reflection of vision-language models with reinforcement learning. arXiv preprint arXiv:2504.08837, 2025

  40. [40]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning. InNeurIPS, 2025. 12

  41. [41]

    Lvbench: An extreme long video understanding benchmark

    Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Ming Ding, Xiaotao Gu, Shiyu Huang, Bin Xu, et al. Lvbench: An extreme long video understanding benchmark. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 22958– 22967, 2025

  42. [42]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency.arXiv preprint arXiv:2508.18265, 2025

  43. [43]

    Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

    Xiyao Wang, Zhengyuan Yang, Chao Feng, Hongjin Lu, Linjie Li, Chung-Ching Lin, Kevin Lin, Furong Huang, and Lijuan Wang. Sota with less: Mcts-guided sample selection for data-efficient visual reasoning self-improvement.arXiv preprint arXiv:2504.07934, 2025

  44. [44]

    Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026

    Ziyue Wang, Sheng Jin, Zhongrong Zuo, Jiawei Wu, Han Qiu, Qi She, Hao Zhang, and Xudong Jiang. Video-ktr: Reinforcing video reasoning via key token attribution.arXiv preprint arXiv:2601.19686, 2026

  45. [45]

    Seeing the arrow of time in large multimodal models

    Zihui Xue, Mi Luo, and Kristen Grauman. Seeing the arrow of time in large multimodal models. InNeurIPS, 2025

  46. [46]

    CLEVRER: CoLlision Events for Video REpresentation and Reasoning

    Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, and Joshua B Tenenbaum. Clevrer: Collision events for video representation and reasoning.arXiv preprint arXiv:1910.01442, 2019

  47. [47]

    Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

    En Yu, Kangheng Lin, Liang Zhao, Yana Wei, Zining Zhu, Haoran Wei, Jianjian Sun, Zheng Ge, Xiangyu Zhang, Jingyu Wang, et al. Unhackable temporal rewarding for scalable video mllms.arXiv preprint arXiv:2502.12081, 2025

  48. [48]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, et al. Dapo: An open-source llm reinforcement learning system at scale.arXiv preprint arXiv:2503.14476, 2025

  49. [49]

    Videorefer suite: Advancing spatial-temporal object understanding with video llm

    Yuqian Yuan, Hang Zhang, Wentong Li, Zesen Cheng, Boqiang Zhang, Long Li, Xin Li, Deli Zhao, Wenqiao Zhang, Yueting Zhuang, et al. Videorefer suite: Advancing spatial-temporal object understanding with video llm. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18970–18980, 2025

  50. [50]

    VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. Videollama 3: Frontier multimodal foundation models for image and video understanding.arXiv preprint arXiv:2501.13106, 2025

  51. [51]

    LLaVA-Video: Video Instruction Tuning With Synthetic Data

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. Llava- video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713, 2024

  52. [52]

    Mmvu: Measuring expert-level multi-discipline video understanding

    Yilun Zhao, Haowei Zhang, Lujing Xie, Tongyan Hu, Guo Gan, Yitao Long, Zhiyuan Hu, Weiyuan Chen, Chuhan Li, Zhijian Xu, et al. Mmvu: Measuring expert-level multi-discipline video understanding. InProceedings of the Computer Vision and Pattern Recognition Confer- ence, pages 8475–8489, 2025

  53. [53]

    Mmr-v: What’s left unsaid? a benchmark for multimodal deep reasoning in videos.arXiv preprint arXiv:2506.04141, 2025

    Kejian Zhu, Zhuoran Jin, Hongbang Yuan, Jiachun Li, Shangqing Tu, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Mmr-v: What’s left unsaid? a benchmark for multimodal deep reasoning in videos.arXiv preprint arXiv:2506.04141, 2025

  54. [54]

    Which of the following best describes the action in the video?

    Orr Zohar, Xiaohan Wang, Yann Dubois, Nikhil Mehta, Tong Xiao, Philippe Hansen-Estruch, Licheng Yu, Xiaofang Wang, Felix Juefei-Xu, Ning Zhang, et al. Apollo: An exploration of video understanding in large multimodal models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 18891–18901, 2025. 13 A DyBench: Construction, Compar...

  55. [55]

    • NO (No Change): queries about verticality (top/bottom, high/low, tall/short), color, identity, size, or count

    Horizontal Flip Test (Spatial Sensitivity).If the video is mirrored horizontally (Left↔Right): • YES (Change): queries about left/right, clockwise/counter-clockwise, or horizontal orientation. • NO (No Change): queries about verticality (top/bottom, high/low, tall/short), color, identity, size, or count

  56. [56]

    • NO (No Change): counting, identity/color/material/shape/size, persistent states or attributes, motion direction or speed that are symmetric under reversal

    Time-Reversal Test (Temporal Sensitivity).If the video is played in REVERSE (frames played backwards, time flows backward): 18 • YES (Change): directional actions where forward ̸= backward (opening→closing, picking up→putting down, entering →exiting); questions about FIRST/LAST or BEFORE/AFTER; questions about cause/result of a sequence. • NO (No Change):...

  57. [57]

    The question does not rely on left/right orientation or horizontal positioning.NO (No Change)

    Horizontal Flip Test (Spatial Sensitivity).If we mirror the video horizontally (left becomes right), the action of the person putting down the clothes remains exactly the same. The question does not rely on left/right orientation or horizontal positioning.NO (No Change). 19 Q: This is a clip from a movie. What scene changes occur in this video? A: From th...

  58. [58]

    putting down

    Time-Reversal Test (Temporal Sensitivity).If we play the video in reverse, “putting down” the clothes would look like the person is picking them up or “taking” them (Option B). Because the direction of time fundamentally changes the nature of this action, the correct answer would change.YES (Change). Fails Flip Test, Passes Time-Reversal Test⇒Temporal. Ex...

  59. [59]

    Flipping horizontally does not affect the state of the device or the actions performed on it.NO (No Change)

    Horizontal Flip Test.The question asks about the reason an electric device did not work, specifically whether it was unplugged at the beginning or during the video. Flipping horizontally does not affect the state of the device or the actions performed on it.NO (No Change)

  60. [60]

    at the beginning of the video

    Time-Reversal Test.The question relies on the chronological order of events — distinguishing the state of the device “at the beginning of the video” versus an action occurring “during the video.” Under reversal, the beginning becomes the end, and an action like “unplugging” would look like “plugging in,” which would alter the correct answer.YES (Change). ...

  61. [61]

    Horizontal Flip Test.If the video is mirrored horizontally, the action of closing the refrigerator remains the same; it does not depend on left/right orientation.NO (No Change)

  62. [62]

    Closed” to “Opened

    Time-Reversal Test.If the video is played in reverse, the action of closing the refrigerator would appear as opening the refrigerator. The answer would change from “Closed” to “Opened.”YES (Change). Fails Flip Test, Passes Time-Reversal Test⇒Temporal. 20 Example 4•Source: PerceptionTest•Decision:Spatial Q:In which hand does the person conceal a small obje...

  63. [63]

    In which hand

    Horizontal Flip Test.If the video is mirrored horizontally, the person’s right hand will appear as their left hand. The answer to “In which hand. . . ” would change from “right” to “left.”YES (Change)

  64. [64]

    The flow of time does not change the identity of the hand holding the object.NO (No Change)

    Time-Reversal Test.If the video is played in reverse, the hand in which the object is concealed remains the same hand. The flow of time does not change the identity of the hand holding the object.NO (No Change). Passes Flip Test, Fails Time-Reversal Test⇒Spatial. Example 5•Source: PerceptionTest•Decision:Spatiotemporal Q:The person uses multiple similar o...

  65. [65]

    first object from the left,

    Horizontal Flip Test.The options explicitly rely on left/right spatial orientation (“first object from the left,” “second object from the left”). If the video is mirrored horizontally, the left-to-right order of the objects is reversed, changing the correct answer.YES (Change)

  66. [66]

    If the video is played in reverse, the original end of the game becomes the new beginning

    Time-Reversal Test.The question specifically asks about the location of the hidden objectat the beginning of the game. If the video is played in reverse, the original end of the game becomes the new beginning. Since the object is moved around during the occlusion game, its location at the end is generally different from its location at the start, so the a...

  67. [67]

    The answer remains the same.NO (No Change)

    Horizontal Flip Test.If the video is mirrored horizontally (left becomes right), the type of room the person is in (kitchen, living room/bedroom, bathroom) does not change. The answer remains the same.NO (No Change)

  68. [68]

    randomize on transformed videos

    Time-Reversal Test.If the video is played in reverse, the setting or room where the person is located does not change. The person stays in the same room regardless of the direction of time. NO (No Change). Fails both tests⇒Static. D Reward Analysis D.1 The four CRR reward components Figure 7 visualizes the four reward terms introduced by CRPO. The origina...