pith. machine review for the scientific record. sign in

arxiv: 2512.03963 · v3 · submitted 2025-12-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords temporal understandingmultimodal large language modelsreinforcement learningmulti-task learningvideo analysistemporal localizationGRPO
0
0 comments X

The pith

TempR1 strengthens multimodal large language models' grasp of time in videos and questions through a multi-task reinforcement learning framework that trains on diverse temporal patterns at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that exposing MLLMs to a broad set of temporal tasks during reinforcement learning produces better timing skills than training on isolated tasks. Current approaches are limited by narrow data and task types, which restricts how well models handle long videos or time-based questions. The authors address this by building a shared corpus and using a policy optimization method with rewards that differ based on how closely a model's predicted time interval matches the ground truth. When the central claim holds, models become more reliable at localizing events, detecting actions, and answering timing questions across new scenarios. A sympathetic reader would care because accurate temporal reasoning matters for any AI system that must interpret real-world video sequences where order and duration carry meaning.

Core claim

TempR1 is a temporal-aware multi-task reinforcement learning framework that curates a multi-task corpus exposing the model to diverse temporal structures and semantics, builds upon the Group Relative Policy Optimization algorithm to achieve stable cross-task optimization, categorizes temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and designs tailored localization rewards for each type, attaining state-of-the-art performance across multiple benchmarks while producing a strong synergistic effect from joint optimization that enhances both generalization and single-task performance.

What carries the argument

The three-category reward design for predicted-versus-ground-truth interval correspondence inside a multi-task GRPO optimization loop.

If this is right

  • State-of-the-art results on temporal localization, action detection, and time-sensitive question answering benchmarks.
  • Synergistic gains that improve both generalization to new temporal patterns and performance on any single task.
  • A scalable training paradigm that reduces the need for separate models per temporal skill.
  • More robust handling of fine-grained temporal dependencies in long-form video analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same reward categorization could be adapted to improve spatial or causal reasoning tasks in multimodal models.
  • Joint training might lower the volume of task-specific labels needed if the interval rewards transfer across domains.
  • Real-world deployment on noisy or uncurated video streams would test whether the observed synergies persist outside benchmark conditions.

Load-bearing premise

The curated multi-task corpus and three-category reward design will produce stable cross-task gains without negative transfer or overfitting to the chosen temporal patterns.

What would settle it

A controlled experiment showing that joint multi-task training causes performance drops on one or more individual temporal benchmarks relative to single-task training baselines.

Figures

Figures reproduced from arXiv: 2512.03963 by Deliang Fu, Gen Zhan, Junlin Li, Limin Wang, Li Yang, Li Zhang, Tao Wu, Yabin Zhang, Yiting Liao.

Figure 1
Figure 1. Figure 1: Performance comparison across five temporal under [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the TempR1 framework. We finetune the MLLM on a multi-task training corpus covering five temporal under￾standing tasks. Reinforcement learning is applied with rule-based rewards, including format and accuracy rewards, as well as localization rewards for three correspondence types: Type 1 (one-to-one, TG/DTG), Type 2 (many-to-one, VHD/GVQA), and Type 3 (many-to-many, TAL). These rewards jointly … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison with the Qwen2.5-VL-7B base model and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative result comparisons. (a) Comparison of two matching strategies for localization reward in the TAL task, showing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TempR1, a temporal-aware multi-task reinforcement learning framework for Multimodal Large Language Models (MLLMs). It curates a multi-task corpus exposing the model to diverse temporal structures and employs Group Relative Policy Optimization (GRPO) with tailored localization rewards for three categories of predicted-versus-ground-truth interval correspondences. The central claims are that TempR1 achieves state-of-the-art performance across multiple benchmarks and that joint optimization over complementary tasks produces a synergistic effect improving both generalization and single-task performance.

Significance. If the empirical results and absence of negative transfer are substantiated, the work would offer a scalable paradigm for temporal reasoning in MLLMs by moving beyond single-task RL limitations, with potential benefits for long-form video analysis tasks such as localization and time-sensitive QA.

major comments (2)
  1. Abstract: the assertion of SOTA performance and a 'strong synergistic effect' from joint optimization lacks any reference to quantitative tables, ablation results, or cross-task performance metrics, rendering the central empirical claims unverifiable from the provided text and undermining assessment of whether the curated corpus and three-category rewards actually deliver stable gains without negative transfer under GRPO.
  2. Abstract: the three-category reward design for interval correspondence is described only at a high level ('tailored localization rewards for each'); without details on reward magnitude normalization, per-category coverage, or monitoring for gradient interference, the claim that this design avoids negative transfer or overfitting to chosen temporal patterns cannot be evaluated and is load-bearing for the synergistic-effect result.
minor comments (1)
  1. Abstract: the phrasing 'establishing a scalable and principled paradigm' is forward-looking and should be tempered to reflect that the manuscript demonstrates an approach rather than a fully established paradigm.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below and indicate the specific revisions we will make to strengthen the presentation of our claims.

read point-by-point responses
  1. Referee: Abstract: the assertion of SOTA performance and a 'strong synergistic effect' from joint optimization lacks any reference to quantitative tables, ablation results, or cross-task performance metrics, rendering the central empirical claims unverifiable from the provided text and undermining assessment of whether the curated corpus and three-category rewards actually deliver stable gains without negative transfer under GRPO.

    Authors: We agree that the abstract would be strengthened by explicit cross-references to the supporting empirical evidence. In the revised version we will update the abstract to include concise pointers such as 'as demonstrated in Tables 1–3 and Section 4.3' for the SOTA results and 'detailed ablation in Section 4.4 showing cross-task gains without negative transfer' for the synergistic effect. These additions will make the central claims directly verifiable while preserving the abstract’s brevity. revision: yes

  2. Referee: Abstract: the three-category reward design for interval correspondence is described only at a high level ('tailored localization rewards for each'); without details on reward magnitude normalization, per-category coverage, or monitoring for gradient interference, the claim that this design avoids negative transfer or overfitting to chosen temporal patterns cannot be evaluated and is load-bearing for the synergistic-effect result.

    Authors: The abstract necessarily summarizes the approach at a high level; the full reward formulations for the three correspondence categories are already specified in Section 3.2. To directly address the concern, we will expand Section 3.2 (and add a short paragraph in the abstract if space allows) with explicit details on reward magnitude normalization, per-category coverage statistics from the multi-task corpus, and the monitoring protocol used during GRPO training to detect and mitigate gradient interference. These additions will allow readers to evaluate the design’s contribution to stable joint optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with external benchmarks

full rationale

The paper describes an empirical framework that curates a multi-task corpus, defines three-category localization rewards, and applies GRPO for joint optimization. Performance claims rest on experimental results across standard benchmarks rather than any closed mathematical derivation. No equations are presented that reduce a claimed prediction or synergistic effect back to fitted reward parameters or self-referential definitions. The approach is self-contained against external evaluation and does not invoke load-bearing self-citations or uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Information limited to abstract; key unstated elements include exact functional forms of the three localization rewards and the composition of the multi-task corpus.

free parameters (1)
  • Tailored localization rewards per correspondence type
    Specific reward functions for the three predicted-versus-ground-truth interval categories are central to the method but not specified.
axioms (1)
  • domain assumption Joint optimization over complementary temporal tasks produces synergistic generalization gains
    Invoked when claiming that multi-task training enhances both overall and single-task performance.

pith-pipeline@v0.9.0 · 5541 in / 1290 out tokens · 61091 ms · 2026-05-17T02:15:26.013806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 20 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Localizing mo- ments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 5

  3. [3]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 5, 6

  4. [4]

    Univg-r1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2505.14231, 2025

    Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2505.14231, 2025. 3

  5. [5]

    Dense events grounding in video

    Peijun Bao, Qian Zheng, and Yadong Mu. Dense events grounding in video. InProceedings of the AAAI Conference on Artificial Intelligence, pages 920–928, 2021. 1, 2

  6. [6]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 1, 2, 6, 7

  7. [7]

    Flashvtg: Feature layering and adaptive score handling network for video temporal grounding

    Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. Flashvtg: Feature layering and adaptive score handling network for video temporal grounding. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 9226–9236. IEEE, 2025. 2, 5

  8. [8]

    Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

    Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3

  9. [9]

    Visrl: Intention-driven visual perception via reinforced reasoning

    Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025

  10. [10]

    Video-R1: Reinforcing Video Reasoning in MLLMs

    Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

  11. [11]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 6, 7

  12. [12]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 1, 2, 5, 6

  13. [13]

    Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding.arXiv preprint arXiv:2508.07683, 2025

    Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, and Chengjiang Long. Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding.arXiv preprint arXiv:2508.07683, 2025. 2, 5, 6

  14. [14]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

  15. [15]

    Trace: Temporal grounding video llm via causal event modeling

    Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. TRACE: Temporal Grounding Video LLM via Causal Event Modeling.arXiv preprint arXiv:2410.05643, 2024. 2, 5

  16. [16]

    Vtimellm: Empower llm to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 2, 5

  17. [17]

    Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

    Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

  18. [18]

    Online Video Understanding: A Comprehen- sive Benchmark and Memory-Augmented Method.arXiv preprint arXiv:2501.00584, 2024

    Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xi- angyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online Video Understanding: A Comprehen- sive Benchmark and Memory-Augmented Method.arXiv preprint arXiv:2501.00584, 2024. 2

  19. [19]

    in the wild

    Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”.Computer Vision and Image Understanding, 155:1– 23, 2017. 1

  20. [20]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3

  21. [21]

    Knowing where to focus: Event-aware transformer for video grounding

    Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn. Knowing where to focus: Event-aware transformer for video grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023. 2, 5

  22. [22]

    Dense-captioning events in videos

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 1, 2, 6, 7

  23. [23]

    Detecting mo- ments and highlights in videos via natural language queries

    Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting mo- ments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846–11858, 2021. 1, 2, 5, 6

  24. [24]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2

  25. [25]

    Unmasked teacher: Towards training-efficient video foundation models

    Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 19948–19960, 2023. 2 9

  26. [26]

    Mo- mentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems, 36:65948–65966, 2023

    Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, and Yongdong Zhang. Mo- mentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems, 36:65948–65966, 2023. 2

  27. [27]

    VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2, 5, 6, 7, 8

  28. [28]

    Ground- inggpt: Language enhanced multi-modal grounding model

    Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Vu Tu, et al. Ground- inggpt: Language enhanced multi-modal grounding model. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 6657–6678, 2024. 2

  29. [29]

    Video-llava: Learning united visual repre- sentation by alignment before projection

    Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 2

  30. [30]

    Univtg: Towards unified video- language temporal grounding

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 2, 5

  31. [31]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

  32. [32]

    Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection

    Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3042–3051, 2022. 2, 5

  33. [33]

    r 2-tuning: Ef- ficient image-to-video transfer learning for video temporal grounding

    Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, and Chang Wen Chen. r 2-tuning: Ef- ficient image-to-video transfer learning for video temporal grounding. InEuropean Conference on Computer Vision, pages 421–438. Springer, 2024. 2

  34. [34]

    TempCompass: Do Video LLMs Really Understand Videos?

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 6, 7

  35. [35]

    Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, and Chang W Chen. E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding.Advances in Neural Information Processing Systems, 37:32076–32110,

  36. [36]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3

  37. [37]

    MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

    Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chen- liang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, et al. Museg: Reinforcing video temporal un- derstanding via timestamp-aware multi-segment grounding. arXiv preprint arXiv:2505.20715, 2025. 2, 5, 6, 7

  38. [38]

    Correlation-guided query-dependency calibration in video representation learning for temporal grounding.CoRR,

    WonJun Moon, Sangeek Hyun, Su Been Lee, and Jae-Pil Heo. Correlation-guided query-dependency calibration in video representation learning for temporal grounding.CoRR,

  39. [39]

    Query-dependent video representa- tion for moment retrieval and highlight detection

    WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representa- tion for moment retrieval and highlight detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23023–23033, 2023. 2, 5

  40. [40]

    SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

    Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. 3

  41. [41]

    Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023

    Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023. 6, 7

  42. [42]

    Chatvtg: Video temporal grounding via chat with video dialogue large language models

    Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 2, 5

  43. [43]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

  44. [44]

    Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2, 5

  45. [45]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Lim- its of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300, 2024. 7

  46. [46]

    VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

  47. [47]

    End-to-end dense video grounding via parallel regression.Computer Vi- sion and Image Understanding, 242:103980, 2024

    Fengyuan Shi, Weilin Huang, and Limin Wang. End-to-end dense video grounding via parallel regression.Computer Vi- sion and Image Understanding, 242:103980, 2024. 2, 6

  48. [48]

    Moviechat: From dense token to sparse memory for long video understanding

    Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

  49. [49]

    Tr- detr: Task-reciprocal transformer for joint moment retrieval and highlight detection

    Hao Sun, Mingyao Zhou, Wenjing Chen, and Wei Xie. Tr- detr: Task-reciprocal transformer for joint moment retrieval and highlight detection. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4998–5007, 2024. 2 10

  50. [51]

    Hierarchical semantic correspondence net- works for video paragraph grounding

    Chaolei Tan, Zihang Lin, Jian-Fang Hu, Wei-Shi Zheng, and Jianhuang Lai. Hierarchical semantic correspondence net- works for video paragraph grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18973–18982, 2023. 2, 6, 7

  51. [52]

    Hierarchical semantic correspondence net- works for video paragraph grounding

    Chaolei Tan, Zihang Lin, Jian-Fang Hu, Wei-Shi Zheng, and Jianhuang Lai. Hierarchical semantic correspondence net- works for video paragraph grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18973–18982, 2023. 2

  52. [53]

    Tspo: Temporal sampling policy optimization for long-form video language understanding.arXiv preprint arXiv:2508.04369, 2025

    Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xu- chong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, and Hao Sun. Tspo: Temporal sampling policy optimization for long-form video language understanding.arXiv preprint arXiv:2508.04369, 2025. 3

  53. [54]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 2

  54. [55]

    Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

    Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

  55. [56]

    InternVideo: General Video Foundation Models via Generative and Discriminative Learning

    Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 2

  56. [57]

    Internvideo2: Scaling foundation models for mul- timodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024. 2

  57. [58]

    Internvideo2: Scaling video foundation models for multimodal video understanding

    Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video- text llms for grounding text in videos.arXiv preprint arXiv:2403.10228, 2024. 2, 5

  58. [59]

    Effi- cient temporal extrapolation of multimodal large language models with temporal grounding bridge.arXiv preprint arXiv:2402.16050, 2024

    Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, and Zilong Zheng. Effi- cient temporal extrapolation of multimodal large language models with temporal grounding bridge.arXiv preprint arXiv:2402.16050, 2024. 2

  59. [60]

    Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

    Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2, 5, 6, 7

  60. [61]

    Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

    Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 3

  61. [62]

    Can i trust your answer? visually grounded video question answering

    Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 1, 2, 6, 7, 8

  62. [63]

    Bridging the gap: A unified video comprehension framework for mo- ment retrieval and highlight detection

    Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Heng- wei Bian, Yatai Ji, Yujiu Yang, and Xiu Li. Bridging the gap: A unified video comprehension framework for mo- ment retrieval and highlight detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18709–18719, 2024. 2

  63. [64]

    arXiv preprint arXiv:2109.14084 , year=

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2

  64. [65]

    Videochat-r1

    Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 2, 5, 6

  65. [66]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xi- angpeng Wei, Hao Zhou, Jingjing Li...

  66. [67]

    Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024. 2, 5

  67. [68]

    Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

    Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 2

  68. [69]

    Sc-captioner: Improving image captioning with self- correction by reinforcement learning

    Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self- correction by reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23145–23155, 2025. 3

  69. [70]

    Tinyllava-video-r1: Towards smaller lmms for video reasoning

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025. 3

  70. [71]

    Hacs: Human action clips and segments dataset for recognition and temporal localization

    Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019. 6

  71. [72]

    Group Sequence Policy Optimization

    Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 7 11

  72. [73]

    Rethinking the video sampling and reasoning strategies for temporal sentence grounding

    Jiahao Zhu, Daizong Liu, Pan Zhou, Xing Di, Yu Cheng, Song Yang, Wenzheng Xu, Zichuan Xu, Yao Wan, Lichao Sun, et al. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 590–600, 2022. 5 12