arxiv: 2512.03963 · v3 · submitted 2025-12-03 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

Tao Wu , Li Yang , Gen Zhan , Yabin Zhang , Yiting Liao , Junlin Li , Deliang Fu , Li Zhang

show 1 more author

Limin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords temporal understandingmultimodal large language modelsreinforcement learningmulti-task learningvideo analysistemporal localizationGRPO

0 comments

The pith

TempR1 strengthens multimodal large language models' grasp of time in videos and questions through a multi-task reinforcement learning framework that trains on diverse temporal patterns at once.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that exposing MLLMs to a broad set of temporal tasks during reinforcement learning produces better timing skills than training on isolated tasks. Current approaches are limited by narrow data and task types, which restricts how well models handle long videos or time-based questions. The authors address this by building a shared corpus and using a policy optimization method with rewards that differ based on how closely a model's predicted time interval matches the ground truth. When the central claim holds, models become more reliable at localizing events, detecting actions, and answering timing questions across new scenarios. A sympathetic reader would care because accurate temporal reasoning matters for any AI system that must interpret real-world video sequences where order and duration carry meaning.

Core claim

TempR1 is a temporal-aware multi-task reinforcement learning framework that curates a multi-task corpus exposing the model to diverse temporal structures and semantics, builds upon the Group Relative Policy Optimization algorithm to achieve stable cross-task optimization, categorizes temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and designs tailored localization rewards for each type, attaining state-of-the-art performance across multiple benchmarks while producing a strong synergistic effect from joint optimization that enhances both generalization and single-task performance.

What carries the argument

The three-category reward design for predicted-versus-ground-truth interval correspondence inside a multi-task GRPO optimization loop.

If this is right

State-of-the-art results on temporal localization, action detection, and time-sensitive question answering benchmarks.
Synergistic gains that improve both generalization to new temporal patterns and performance on any single task.
A scalable training paradigm that reduces the need for separate models per temporal skill.
More robust handling of fine-grained temporal dependencies in long-form video analysis.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reward categorization could be adapted to improve spatial or causal reasoning tasks in multimodal models.
Joint training might lower the volume of task-specific labels needed if the interval rewards transfer across domains.
Real-world deployment on noisy or uncurated video streams would test whether the observed synergies persist outside benchmark conditions.

Load-bearing premise

The curated multi-task corpus and three-category reward design will produce stable cross-task gains without negative transfer or overfitting to the chosen temporal patterns.

What would settle it

A controlled experiment showing that joint multi-task training causes performance drops on one or more individual temporal benchmarks relative to single-task training baselines.

Figures

Figures reproduced from arXiv: 2512.03963 by Deliang Fu, Gen Zhan, Junlin Li, Limin Wang, Li Yang, Li Zhang, Tao Wu, Yabin Zhang, Yiting Liao.

**Figure 2.** Figure 2: Overview of the TempR1 framework. We finetune the MLLM on a multi-task training corpus covering five temporal understanding tasks. Reinforcement learning is applied with rule-based rewards, including format and accuracy rewards, as well as localization rewards for three correspondence types: Type 1 (one-to-one, TG/DTG), Type 2 (many-to-one, VHD/GVQA), and Type 3 (many-to-many, TAL). These rewards jointly … view at source ↗

**Figure 3.** Figure 3: Comparison with the Qwen2.5-VL-7B base model and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative result comparisons. (a) Comparison of two matching strategies for localization reward in the TAL task, showing [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Enhancing the temporal understanding of Multimodal Large Language Models (MLLMs) is essential for advancing long-form video analysis, enabling tasks such as temporal localization, action detection, and time-sensitive question answering. While reinforcement learning (RL) has recently been explored for improving temporal reasoning, existing approaches are often confined to limited task types and data, restricting their generalization across diverse temporal understanding scenarios. To address this challenge, we present TempR1, a temporal-aware multi-task reinforcement learning framework that systematically strengthens MLLMs' temporal comprehension. We curate a multi-task corpus that exposes the model to diverse temporal structures and semantics, and build upon the Group Relative Policy Optimization (GRPO) algorithm to achieve stable and effective cross-task optimization. Specifically, we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each, enabling TempR1 to capture fine-grained temporal dependencies and adapt to different temporal patterns. Extensive experiments demonstrate that TempR1 attains state-of-the-art performance across multiple benchmarks. Moreover, its joint optimization over complementary tasks yields a strong synergistic effect, enhancing both generalization and single-task performance, establishing a scalable and principled paradigm for temporal reasoning in MLLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TempR1 adds a three-way split of temporal interval matches plus custom rewards inside GRPO, but the abstract gives no numbers or ablations to back the synergy claim.

read the letter

The paper's main contribution is a simple but explicit way to bucket temporal tasks by how a model's predicted interval lines up with ground truth—exact match, partial overlap, or no overlap—and then attach a different localization reward to each bucket. They run this inside GRPO on a curated multi-task video corpus. That framing is new enough to notice; most prior RL work on video MLLMs sticks to one or two task flavors without this kind of reward tailoring.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces TempR1, a temporal-aware multi-task reinforcement learning framework for Multimodal Large Language Models (MLLMs). It curates a multi-task corpus exposing the model to diverse temporal structures and employs Group Relative Policy Optimization (GRPO) with tailored localization rewards for three categories of predicted-versus-ground-truth interval correspondences. The central claims are that TempR1 achieves state-of-the-art performance across multiple benchmarks and that joint optimization over complementary tasks produces a synergistic effect improving both generalization and single-task performance.

Significance. If the empirical results and absence of negative transfer are substantiated, the work would offer a scalable paradigm for temporal reasoning in MLLMs by moving beyond single-task RL limitations, with potential benefits for long-form video analysis tasks such as localization and time-sensitive QA.

major comments (2)

Abstract: the assertion of SOTA performance and a 'strong synergistic effect' from joint optimization lacks any reference to quantitative tables, ablation results, or cross-task performance metrics, rendering the central empirical claims unverifiable from the provided text and undermining assessment of whether the curated corpus and three-category rewards actually deliver stable gains without negative transfer under GRPO.
Abstract: the three-category reward design for interval correspondence is described only at a high level ('tailored localization rewards for each'); without details on reward magnitude normalization, per-category coverage, or monitoring for gradient interference, the claim that this design avoids negative transfer or overfitting to chosen temporal patterns cannot be evaluated and is load-bearing for the synergistic-effect result.

minor comments (1)

Abstract: the phrasing 'establishing a scalable and principled paradigm' is forward-looking and should be tempered to reflect that the manuscript demonstrates an approach rather than a fully established paradigm.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments on our manuscript. We address each major comment below and indicate the specific revisions we will make to strengthen the presentation of our claims.

read point-by-point responses

Referee: Abstract: the assertion of SOTA performance and a 'strong synergistic effect' from joint optimization lacks any reference to quantitative tables, ablation results, or cross-task performance metrics, rendering the central empirical claims unverifiable from the provided text and undermining assessment of whether the curated corpus and three-category rewards actually deliver stable gains without negative transfer under GRPO.

Authors: We agree that the abstract would be strengthened by explicit cross-references to the supporting empirical evidence. In the revised version we will update the abstract to include concise pointers such as 'as demonstrated in Tables 1–3 and Section 4.3' for the SOTA results and 'detailed ablation in Section 4.4 showing cross-task gains without negative transfer' for the synergistic effect. These additions will make the central claims directly verifiable while preserving the abstract’s brevity. revision: yes
Referee: Abstract: the three-category reward design for interval correspondence is described only at a high level ('tailored localization rewards for each'); without details on reward magnitude normalization, per-category coverage, or monitoring for gradient interference, the claim that this design avoids negative transfer or overfitting to chosen temporal patterns cannot be evaluated and is load-bearing for the synergistic-effect result.

Authors: The abstract necessarily summarizes the approach at a high level; the full reward formulations for the three correspondence categories are already specified in Section 3.2. To directly address the concern, we will expand Section 3.2 (and add a short paragraph in the abstract if space allows) with explicit details on reward magnitude normalization, per-category coverage statistics from the multi-task corpus, and the monitoring protocol used during GRPO training to detect and mitigate gradient interference. These additions will allow readers to evaluate the design’s contribution to stable joint optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical RL method with external benchmarks

full rationale

The paper describes an empirical framework that curates a multi-task corpus, defines three-category localization rewards, and applies GRPO for joint optimization. Performance claims rest on experimental results across standard benchmarks rather than any closed mathematical derivation. No equations are presented that reduce a claimed prediction or synergistic effect back to fitted reward parameters or self-referential definitions. The approach is self-contained against external evaluation and does not invoke load-bearing self-citations or uniqueness theorems from the authors' prior work.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Information limited to abstract; key unstated elements include exact functional forms of the three localization rewards and the composition of the multi-task corpus.

free parameters (1)

Tailored localization rewards per correspondence type
Specific reward functions for the three predicted-versus-ground-truth interval categories are central to the method but not specified.

axioms (1)

domain assumption Joint optimization over complementary temporal tasks produces synergistic generalization gains
Invoked when claiming that multi-task training enhances both overall and single-task performance.

pith-pipeline@v0.9.0 · 5541 in / 1290 out tokens · 61091 ms · 2026-05-17T02:15:26.013806+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we categorize temporal tasks into three correspondence types between predicted intervals and ground-truth instances, and design tailored localization rewards for each
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

R(TAL)loc = Rnum + Rmatch with DP-based matching and exp(-|Npred-Ngt|/...)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

72 extracted references · 72 canonical work pages · 20 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Localizing mo- ments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language. InProceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 5

work page 2017
[3]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 2, 5, 6

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Univg-r1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2505.14231, 2025

Sule Bai, Mingxing Li, Yong Liu, Jing Tang, Haoji Zhang, Lei Sun, Xiangxiang Chu, and Yansong Tang. Univg-r1: Reasoning guided universal visual grounding with reinforce- ment learning.arXiv preprint arXiv:2505.14231, 2025. 3

work page arXiv 2025
[5]

Dense events grounding in video

Peijun Bao, Qian Zheng, and Yadong Mu. Dense events grounding in video. InProceedings of the AAAI Conference on Artificial Intelligence, pages 920–928, 2021. 1, 2

work page 2021
[6]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. InProceed- ings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 1, 2, 6, 7

work page 2015
[7]

Flashvtg: Feature layering and adaptive score handling network for video temporal grounding

Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. Flashvtg: Feature layering and adaptive score handling network for video temporal grounding. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 9226–9236. IEEE, 2025. 2, 5

work page 2025
[8]

Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025

Yukang Chen, Wei Huang, Baifeng Shi, Qinghao Hu, Han- rong Ye, Ligeng Zhu, Zhijian Liu, Pavlo Molchanov, Jan Kautz, Xiaojuan Qi, et al. Scaling rl to long videos.arXiv preprint arXiv:2507.07966, 2025. 3

work page arXiv 2025
[9]

Visrl: Intention-driven visual perception via reinforced reasoning

Zhangquan Chen, Xufang Luo, and Dongsheng Li. Visrl: Intention-driven visual perception via reinforced reasoning. arXiv preprint arXiv:2503.07523, 2025

work page arXiv 2025
[10]

Video-R1: Reinforcing Video Reasoning in MLLMs

Kaituo Feng, Kaixiong Gong, Bohao Li, Zonghao Guo, Yibing Wang, Tianshuo Peng, Junfei Wu, Xiaoying Zhang, Benyou Wang, and Xiangyu Yue. Video-r1: Reinforcing video reasoning in mllms.arXiv preprint arXiv:2503.21776,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 24108–24118, 2025. 6, 7

work page 2025
[12]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on com- puter vision, pages 5267–5275, 2017. 1, 2, 5, 6

work page 2017
[13]

Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding.arXiv preprint arXiv:2508.07683, 2025

Chaohong Guo, Xun Mo, Yongwei Nie, Xuemiao Xu, Chao Xu, Fei Yu, and Chengjiang Long. Tar-tvg: Enhancing vlms with timestamp anchor-constrained reasoning for temporal video grounding.arXiv preprint arXiv:2508.07683, 2025. 2, 5, 6

work page arXiv 2025
[14]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Trace: Temporal grounding video llm via causal event modeling

Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. TRACE: Temporal Grounding Video LLM via Causal Event Modeling.arXiv preprint arXiv:2410.05643, 2024. 2, 5

work page arXiv 2024
[16]

Vtimellm: Empower llm to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 14271–14280, 2024. 2, 5

work page 2024
[17]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

Wenxuan Huang, Bohan Jia, Zijie Zhai, Shaosheng Cao, Zheyu Ye, Fei Zhao, Zhe Xu, Yao Hu, and Shaohui Lin. Vision-r1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Online Video Understanding: A Comprehen- sive Benchmark and Memory-Augmented Method.arXiv preprint arXiv:2501.00584, 2024

Zhenpeng Huang, Xinhao Li, Jiaqi Li, Jing Wang, Xi- angyu Zeng, Cheng Liang, Tao Wu, Xi Chen, Liang Li, and Limin Wang. Online Video Understanding: A Comprehen- sive Benchmark and Memory-Augmented Method.arXiv preprint arXiv:2501.00584, 2024. 2

work page arXiv 2024
[19]

in the wild

Haroon Idrees, Amir R Zamir, Yu-Gang Jiang, Alex Gorban, Ivan Laptev, Rahul Sukthankar, and Mubarak Shah. The thumos challenge on action recognition for videos “in the wild”.Computer Vision and Image Understanding, 155:1– 23, 2017. 1

work page 2017
[20]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Knowing where to focus: Event-aware transformer for video grounding

Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn. Knowing where to focus: Event-aware transformer for video grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023. 2, 5

work page 2023
[22]

Dense-captioning events in videos

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos. In Proceedings of the IEEE international conference on com- puter vision, pages 706–715, 2017. 1, 2, 6, 7

work page 2017
[23]

Detecting mo- ments and highlights in videos via natural language queries

Jie Lei, Tamara L Berg, and Mohit Bansal. Detecting mo- ments and highlights in videos via natural language queries. Advances in Neural Information Processing Systems, 34: 11846–11858, 2021. 1, 2, 5, 6

work page 2021
[24]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Unmasked teacher: Towards training-efficient video foundation models

Kunchang Li, Yali Wang, Yizhuo Li, Yi Wang, Yinan He, Limin Wang, and Yu Qiao. Unmasked teacher: Towards training-efficient video foundation models. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 19948–19960, 2023. 2 9

work page 2023
[26]

Mo- mentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems, 36:65948–65966, 2023

Pandeng Li, Chen-Wei Xie, Hongtao Xie, Liming Zhao, Lei Zhang, Yun Zheng, Deli Zhao, and Yongdong Zhang. Mo- mentdiff: Generative video moment retrieval from random to real.Advances in neural information processing systems, 36:65948–65966, 2023. 2

work page 2023
[27]

VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

Xinhao Li, Ziang Yan, Desen Meng, Lu Dong, Xiangyu Zeng, Yinan He, Yali Wang, Yu Qiao, Yi Wang, and Limin Wang. Videochat-r1: Enhancing spatio-temporal perception via reinforcement fine-tuning.arXiv preprint arXiv:2504.06958, 2025. 2, 5, 6, 7, 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Ground- inggpt: Language enhanced multi-modal grounding model

Zhaowei Li, Qi Xu, Dong Zhang, Hang Song, Yiqing Cai, Qi Qi, Ran Zhou, Junting Pan, Zefeng Li, Vu Tu, et al. Ground- inggpt: Language enhanced multi-modal grounding model. InProceedings of the 62nd Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Papers), pages 6657–6678, 2024. 2

work page 2024
[29]

Video-llava: Learning united visual repre- sentation by alignment before projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual repre- sentation by alignment before projection. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, pages 5971–5984, 2024. 2

work page 2024
[30]

Univtg: Towards unified video- language temporal grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 2, 5

work page 2023
[31]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

work page 2023
[32]

Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection

Ye Liu, Siyuan Li, Yang Wu, Chang-Wen Chen, Ying Shan, and Xiaohu Qie. Umt: Unified multi-modal transformers for joint video moment retrieval and highlight detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3042–3051, 2022. 2, 5

work page 2022
[33]

r 2-tuning: Ef- ficient image-to-video transfer learning for video temporal grounding

Ye Liu, Jixuan He, Wanhua Li, Junsik Kim, Donglai Wei, Hanspeter Pfister, and Chang Wen Chen. r 2-tuning: Ef- ficient image-to-video transfer learning for video temporal grounding. InEuropean Conference on Computer Vision, pages 421–438. Springer, 2024. 2

work page 2024
[34]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcom- pass: Do video llms really understand videos?arXiv preprint arXiv:2403.00476, 2024. 6, 7

work page internal anchor Pith review arXiv 2024
[35]

Ye Liu, Zongyang Ma, Zhongang Qi, Yang Wu, Ying Shan, and Chang W Chen. E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding.Advances in Neural Information Processing Systems, 37:32076–32110,

work page
[36]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual- rft: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

MUSEG: Reinforcing Video Temporal Understanding via Timestamp-Aware Multi-Segment Grounding

Fuwen Luo, Shengfeng Lou, Chi Chen, Ziyue Wang, Chen- liang Li, Weizhou Shen, Jiyue Guo, Peng Li, Ming Yan, Ji Zhang, et al. Museg: Reinforcing video temporal un- derstanding via timestamp-aware multi-segment grounding. arXiv preprint arXiv:2505.20715, 2025. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Correlation-guided query-dependency calibration in video representation learning for temporal grounding.CoRR,

WonJun Moon, Sangeek Hyun, Su Been Lee, and Jae-Pil Heo. Correlation-guided query-dependency calibration in video representation learning for temporal grounding.CoRR,

work page
[39]

Query-dependent video representa- tion for moment retrieval and highlight detection

WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo. Query-dependent video representa- tion for moment retrieval and highlight detection. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23023–23033, 2023. 2, 5

work page 2023
[40]

SpaceR: Reinforcing MLLMs in Video Spatial Reasoning

Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Rein- forcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023

Viorica Patraucean, Lucas Smaira, Ankush Gupta, Adria Re- casens, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Mateusz Malinowski, Yi Yang, Carl Doersch, et al. Per- ception test: A diagnostic benchmark for multimodal video models.Advances in Neural Information Processing Sys- tems, 36:42748–42761, 2023. 6, 7

work page 2023
[42]

Chatvtg: Video temporal grounding via chat with video dialogue large language models

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 2, 5

work page 2024
[43]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 2

work page 2021
[44]

Timechat: A time-sensitive multimodal large lan- guage model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large lan- guage model for long video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, pages 14313–14323, 2024. 2, 5

work page 2024
[45]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. DeepSeekMath: Pushing the Lim- its of Mathematical Reasoning in Open Language Models. arXiv preprint arXiv:2402.03300, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

VLM-R1: A Stable and Generalizable R1-style Large Vision-Language Model

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generaliz- able r1-style large vision-language model.arXiv preprint arXiv:2504.07615, 2025. 3

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

End-to-end dense video grounding via parallel regression.Computer Vi- sion and Image Understanding, 242:103980, 2024

Fengyuan Shi, Weilin Huang, and Limin Wang. End-to-end dense video grounding via parallel regression.Computer Vi- sion and Image Understanding, 242:103980, 2024. 2, 6

work page 2024
[48]

Moviechat: From dense token to sparse memory for long video understanding

Enxin Song, Wenhao Chai, Guanhong Wang, Yucheng Zhang, Haoyang Zhou, Feiyang Wu, Haozhe Chi, Xun Guo, Tian Ye, Yanting Zhang, et al. Moviechat: From dense token to sparse memory for long video understanding. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18221–18232, 2024. 2

work page 2024
[49]

Tr- detr: Task-reciprocal transformer for joint moment retrieval and highlight detection

Hao Sun, Mingyao Zhou, Wenjing Chen, and Wei Xie. Tr- detr: Task-reciprocal transformer for joint moment retrieval and highlight detection. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 4998–5007, 2024. 2 10

work page 2024
[51]

Hierarchical semantic correspondence net- works for video paragraph grounding

Chaolei Tan, Zihang Lin, Jian-Fang Hu, Wei-Shi Zheng, and Jianhuang Lai. Hierarchical semantic correspondence net- works for video paragraph grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18973–18982, 2023. 2, 6, 7

work page 2023
[52]

Hierarchical semantic correspondence net- works for video paragraph grounding

Chaolei Tan, Zihang Lin, Jian-Fang Hu, Wei-Shi Zheng, and Jianhuang Lai. Hierarchical semantic correspondence net- works for video paragraph grounding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18973–18982, 2023. 2

work page 2023
[53]

Tspo: Temporal sampling policy optimization for long-form video language understanding.arXiv preprint arXiv:2508.04369, 2025

Canhui Tang, Zifan Han, Hongbo Sun, Sanping Zhou, Xu- chong Zhang, Xin Wei, Ye Yuan, Huayu Zhang, Jinglin Xu, and Hao Sun. Tspo: Temporal sampling policy optimization for long-form video language understanding.arXiv preprint arXiv:2508.04369, 2025. 3

work page arXiv 2025
[54]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

Qi Wang, Yanrui Yu, Ye Yuan, Rui Mao, and Tianfei Zhou. Videorft: Incentivizing video reasoning capability in mllms via reinforced fine-tuning.arXiv preprint arXiv:2505.12434,

work page arXiv
[56]

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, et al. Internvideo: General video foundation models via generative and discriminative learning.arXiv preprint arXiv:2212.03191, 2022. 2

work page internal anchor Pith review Pith/arXiv arXiv 2022
[57]

Internvideo2: Scaling foundation models for mul- timodal video understanding

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024. 2

work page 2024
[58]

Internvideo2: Scaling video foundation models for multimodal video understanding

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video- text llms for grounding text in videos.arXiv preprint arXiv:2403.10228, 2024. 2, 5

work page arXiv 2024
[59]

Effi- cient temporal extrapolation of multimodal large language models with temporal grounding bridge.arXiv preprint arXiv:2402.16050, 2024

Yuxuan Wang, Yueqian Wang, Pengfei Wu, Jianxin Liang, Dongyan Zhao, Yang Liu, and Zilong Zheng. Effi- cient temporal extrapolation of multimodal large language models with temporal grounding bridge.arXiv preprint arXiv:2402.16050, 2024. 2

work page arXiv 2024
[60]

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang, Ziheng Wang, Boshen Xu, Yang Du, Kejun Lin, Zihan Xiao, Zihao Yue, Jianzhong Ju, Liang Zhang, Dingyi Yang, et al. Time-r1: Post-training large vision lan- guage model for temporal video grounding.arXiv preprint arXiv:2503.13377, 2025. 2, 5, 6, 7

work page internal anchor Pith review arXiv 2025
[61]

Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025

Jiaer Xia, Yuhang Zang, Peng Gao, Yixuan Li, and Kaiyang Zhou. Visionary-r1: Mitigating shortcuts in vi- sual reasoning with reinforcement learning.arXiv preprint arXiv:2505.14677, 2025. 3

work page arXiv 2025
[62]

Can i trust your answer? visually grounded video question answering

Junbin Xiao, Angela Yao, Yicong Li, and Tat-Seng Chua. Can i trust your answer? visually grounded video question answering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13204– 13214, 2024. 1, 2, 6, 7, 8

work page 2024
[63]

Bridging the gap: A unified video comprehension framework for mo- ment retrieval and highlight detection

Yicheng Xiao, Zhuoyan Luo, Yong Liu, Yue Ma, Heng- wei Bian, Yatai Ji, Yujiu Yang, and Xiu Li. Bridging the gap: A unified video comprehension framework for mo- ment retrieval and highlight detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18709–18719, 2024. 2

work page 2024
[64]

arXiv preprint arXiv:2109.14084 , year=

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding.arXiv preprint arXiv:2109.14084, 2021. 2

work page arXiv 2021
[65]

Videochat-r1

Ziang Yan, Xinhao Li, Yinan He, Zhengrong Yue, Xiangyu Zeng, Yali Wang, Yu Qiao, Limin Wang, and Yi Wang. Videochat-r1. 5: Visual test-time scaling to reinforce mul- timodal reasoning by iterative perception.arXiv preprint arXiv:2509.21100, 2025. 2, 5, 6

work page arXiv 2025
[66]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xi- angpeng Wei, Hao Zhou, Jingjing Li...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[67]

Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhen- grong Yue, Yi Wang, et al. Timesuite: Improving mllms for long video understanding via grounded tuning.arXiv preprint arXiv:2410.19702, 2024. 2, 5

work page arXiv 2024
[68]

Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding

Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video un- derstanding.arXiv preprint arXiv:2306.02858, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Sc-captioner: Improving image captioning with self- correction by reinforcement learning

Lin Zhang, Xianfang Zeng, Kangcong Li, Gang Yu, and Tao Chen. Sc-captioner: Improving image captioning with self- correction by reinforcement learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23145–23155, 2025. 3

work page 2025
[70]

Tinyllava-video-r1: Towards smaller lmms for video reasoning

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reason- ing.arXiv preprint arXiv:2504.09641, 2025. 3

work page arXiv 2025
[71]

Hacs: Human action clips and segments dataset for recognition and temporal localization

Hang Zhao, Antonio Torralba, Lorenzo Torresani, and Zhicheng Yan. Hacs: Human action clips and segments dataset for recognition and temporal localization. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision, pages 8668–8678, 2019. 6

work page 2019
[72]

Group Sequence Policy Optimization

Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen Yu, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, and Junyang Lin. Group sequence policy optimization.arXiv preprint arXiv:2507.18071, 2025. 7 11

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Rethinking the video sampling and reasoning strategies for temporal sentence grounding

Jiahao Zhu, Daizong Liu, Pan Zhou, Xing Di, Yu Cheng, Song Yang, Wenzheng Xu, Zichuan Xu, Yao Wan, Lichao Sun, et al. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. InFindings of the Association for Computational Linguistics: EMNLP 2022, pages 590–600, 2022. 5 12

work page 2022