arxiv: 2503.13377 · v3 · submitted 2025-03-17 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Ye Wang , Ziheng Wang , Boshen Xu , Yang Du , Kejun Lin , Zihan Xiao , Zihao Yue , Jianzhong Ju

show 9 more authors

Liang Zhang Dingyi Yang Xiangnan Fang Zewen He Zhenbo Luo Wenxuan Wang Junqi Lin Jian Luan Qin Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords temporal video groundinglarge vision-language modelsreinforcement learningpost-trainingvideo understandingdata-efficient learningbenchmark evaluation

0 comments

The pith

Reinforcement learning post-training enables large vision-language models to achieve state-of-the-art temporal video grounding with only 2.5K training examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes Time-R1, a framework that uses reinforcement learning with verifiable rewards to post-train large vision-language models for the task of temporal video grounding. The approach addresses limitations in generalization that arise from standard supervised fine-tuning methods. It includes a data-efficient strategy called TimeRFT that exposes the model to increasingly difficult samples from a specially curated dataset. A new evaluation benchmark called TVGBench is introduced to test performance across eleven query types with balanced video and query distributions. The result is improved accuracy on locating video segments and better overall video comprehension capabilities.

Core claim

The central discovery is that applying reinforcement learning with verifiable rewards in a post-training stage allows large vision-language models to develop stronger reasoning for temporal video grounding. This is demonstrated through the Time-R1 framework, which combines with TimeRFT data strategies to reach state-of-the-art results on multiple datasets while using far less training data than previous approaches. The method also enhances the model's general video understanding abilities beyond the specific task.

What carries the argument

Time-R1, a reasoning-guided post-training framework that applies reinforcement learning with verifiable rewards to improve large vision-language models on temporal video grounding tasks.

If this is right

Large vision-language models can generalize better to new video grounding queries after this RL post-training.
Only a small amount of data, around 2.5K examples, suffices for achieving top performance on downstream benchmarks.
The model shows gains in general video understanding capabilities in addition to the grounding task.
The TVGBench provides a balanced way to evaluate different types of language queries for video segments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar RL-based post-training could be applied to other temporal reasoning tasks in video understanding.
The verifiable reward mechanism might help reduce the need for massive supervised datasets in multimodal AI training.
Future work could test whether this approach scales to even longer videos or more open-ended queries.

Load-bearing premise

The reinforcement learning process with verifiable rewards on the curated dataset leads to true generalization improvements instead of overfitting to the reward signals or the way the benchmarks are built.

What would settle it

Performance on a newly collected temporal video grounding test set with query types and video distributions not seen during training or benchmark construction would show no gains or even drops compared to baseline models.

read the original abstract

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Time-R1, a reasoning-guided RL post-training framework for LVLMs on Temporal Video Grounding that uses verifiable rewards, combined with the TimeRFT progressive data strategy on a curated 2.5K-sample RL-friendly dataset. It also introduces TVGBench, a small balanced benchmark covering 11 query types. The central claim is that this yields SOTA performance on multiple downstream TVG datasets while improving general video understanding, all with limited training data.

Significance. If the verifiable reward and generalization claims hold after verification, the work would advance data-efficient post-training for multimodal video models by showing that targeted RL can outperform standard SFT. The small comprehensive benchmark could also become a useful evaluation tool for temporal reasoning in LVLMs.

major comments (3)

[Methods] Methods section (Time-R1 description): the verifiable reward is never formally defined or given a mathematical formulation (e.g., no equation for how segment localization or query-type correctness is scored). This is load-bearing for the generalization claim, because without the exact reward it is impossible to determine whether reported gains reflect improved reasoning or exploitation of patterns in the 2.5K curated dataset.
[Experiments] Experiments section: no ablation is presented that removes the RL stage or substitutes a different reward while keeping the same data and base model. Without this isolation, the SOTA numbers cannot be confidently attributed to the reasoning-guided RL rather than the progressive data strategy or benchmark construction.
[Results] Results tables (downstream dataset evaluations): performance figures are reported without error bars, standard deviations across seeds, or statistical significance tests. Given the small 2.5K training set and the claim of robust generalization across 11 query types, this omission leaves open the possibility that gains are within variance or specific to TVGBench distribution.

minor comments (2)

[Abstract] Abstract: the claim of 'improving its general video understanding capabilities' is stated without naming the specific metrics or held-out tasks used to measure this improvement.
[Figures and Tables] Figure captions and tables: axis labels and legend entries are occasionally too small or use inconsistent abbreviations (e.g., 'IoU@0.5' vs 'R@0.5'), reducing readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each of the major comments below, indicating the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Methods] Methods section (Time-R1 description): the verifiable reward is never formally defined or given a mathematical formulation (e.g., no equation for how segment localization or query-type correctness is scored). This is load-bearing for the generalization claim, because without the exact reward it is impossible to determine whether reported gains reflect improved reasoning or exploitation of patterns in the 2.5K curated dataset.

Authors: We agree that providing a formal mathematical definition of the verifiable reward is crucial for clarity and to support our claims about improved reasoning. Although the reward mechanism is described in the Methods section, it lacks an explicit equation. In the revised manuscript, we will introduce a mathematical formulation for the reward function, including terms for temporal localization accuracy (based on intersection-over-union with ground-truth segments) and correctness for different query types. This will help demonstrate that the performance gains arise from enhanced reasoning capabilities rather than overfitting to the curated dataset. revision: yes
Referee: [Experiments] Experiments section: no ablation is presented that removes the RL stage or substitutes a different reward while keeping the same data and base model. Without this isolation, the SOTA numbers cannot be confidently attributed to the reasoning-guided RL rather than the progressive data strategy or benchmark construction.

Authors: We appreciate this point regarding the need to isolate the effect of the RL stage. The current manuscript includes comparisons between our full Time-R1 approach and standard SFT baselines using the same 2.5K data, but does not explicitly ablate the RL component or test alternative rewards. We will add these ablation studies in the revised version, including results from SFT-only training and variants with modified reward functions, to more rigorously attribute the improvements to the reasoning-guided RL framework. revision: yes
Referee: [Results] Results tables (downstream dataset evaluations): performance figures are reported without error bars, standard deviations across seeds, or statistical significance tests. Given the small 2.5K training set and the claim of robust generalization across 11 query types, this omission leaves open the possibility that gains are within variance or specific to TVGBench distribution.

Authors: We acknowledge the value of statistical reporting for assessing robustness, particularly given the limited training data size. The original results were obtained from single runs due to computational constraints, but we will conduct additional experiments with multiple seeds in the revision and report mean performance with standard deviations. We will also include statistical significance tests to confirm that the observed improvements are significant and generalizable beyond the TVGBench distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an empirical post-training framework (Time-R1 via RL with verifiable reward, TimeRFT progressive strategy on curated 2.5K RL-friendly data, and TVGBench) whose central claims are SOTA results on downstream datasets plus improved general video understanding. No equations, mathematical derivations, or self-citations appear in the abstract or description that reduce performance gains to fitted parameters, self-definitions, or closed loops. The verifiable reward and data curation are positioned as external contributions, with evaluation on multiple datasets and a new benchmark; claims rest on reported empirical outcomes rather than any construction that equates outputs to inputs by definition. This is the common case of a self-contained empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that verifiable rewards can be defined without circular dependence on the test distribution.

pith-pipeline@v0.9.0 · 5575 in / 1126 out tokens · 27911 ms · 2026-05-17T02:34:23.564245+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward
IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

GRPO training. The LVLM F (·) takes the video frames x1, . . . , xt and the language query q as input and generates G candidate responses o1, . . . , oG
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EvoGround: Self-Evolving Video Agents for Video Temporal Grounding
cs.CV 2026-05 unverdicted novelty 7.0

A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 7.0

VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...
OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding
cs.CV 2026-04 unverdicted novelty 7.0

OmniVTG creates a new large-scale open-world VTG dataset using iterative concept-gap filling and timestamped captioning, paired with a three-stage self-correction CoT paradigm that yields SOTA zero-shot results on fou...
Towards Temporal Compositional Reasoning in Long-Form Sports Videos
cs.CV 2026-04 unverdicted novelty 7.0

SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.
CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning
cs.CV 2026-01 unverdicted novelty 7.0

CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
cs.CV 2025-05 conditional novelty 7.0

Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.
Co-Evolving Policy Distillation
cs.LG 2026-04 unverdicted novelty 6.0

CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...
Video-ToC: Video Tree-of-Cue Reasoning
cs.CV 2026-04 unverdicted novelty 6.0

Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.
STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.
GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking
cs.CV 2026-02 unverdicted novelty 6.0

GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.
AdaTooler-V: Adaptive Tool-Use for Images and Videos
cs.CV 2025-12 conditional novelty 6.0

AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...
VISD: Enhancing Video Reasoning via Structured Self-Distillation
cs.CV 2026-05 unverdicted novelty 5.0

VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
cs.SD 2026-04 unverdicted novelty 5.0

TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning
cs.CV 2025-12 unverdicted novelty 5.0

TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.
OneThinker: All-in-one Reasoning Model for Image and Video
cs.CV 2025-12 unverdicted novelty 5.0

OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.
VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning
cs.CV 2025-04 unverdicted novelty 5.0

Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.
RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation
cs.CV 2026-05 unverdicted novelty 4.0

RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.
Empowering Video Translation using Multimodal Large Language Models
cs.CV 2026-04 unverdicted novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track
cs.SD 2026-04 unverdicted novelty 3.0

A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...
AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method
cs.CV 2026-04 unverdicted novelty 3.0

An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.
From System 1 to System 2: A Survey of Reasoning Large Language Models
cs.AI 2025-02 accept novelty 3.0

The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 20 Pith papers · 10 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Ht- step: Aligning instructional articles with how-to videos

Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. Ht- step: Aligning instructional articles with how-to videos. Advances in Neural Information Processing Systems, 36:50310–50326, 2023. 6

work page 2023
[3]

Localizing moments in video with natural language

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 1, 3, 6

work page 2017
[4]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Activitynet: A large-scale video benchmark for human activity understanding

Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 1, 3, 6, 7

work page 2015
[6]

Quo vadis, action recognition? a new model and the kinetics dataset

Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308,

work page
[7]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3. https://github.com/Deep-Agent/R1-V, 2025. 4

work page 2025
[8]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. 14

work page 2023
[9]

Space-time gestures

Trevor Darrell and Alex Pentland. Space-time gestures. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 335–340. IEEE, 1993. 1

work page 1993
[10]

Gemini 2.5: Our most intelligent ai model

Google DeepMind. Gemini 2.5: Our most intelligent ai model. Google DeepMind, 2025. Model ID: gemini-2.5-pro-preview-03-25. 8, 14, 15

work page 2025
[11]

DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

BERT: pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019. 3

work page 2019
[13]

Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 7

work page 2025
[14]

Temporal localization of actions with actoms

Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. Temporal localization of actions with actoms. IEEE transactions on pattern analysis and machine intelligence, 35(11):2782–2795, 2013. 1

work page 2013
[15]

Tall: Temporal activity localization via language query

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017. 1, 3

work page 2017
[16]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 6

work page 2022
[17]

Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3302–3310,

work page
[18]

Trace: Temporal grounding video llm via causal event modeling

Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling. arXiv preprint arXiv:2410.05643, 2024. 2, 3, 8, 9, 14, 15

work page arXiv 2024
[19]

Revisionllm: Recursive vision-language model for temporal grounding in hour-long videos

Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu, Thomas Seidl, and Gedas Bertasius. Revisionllm: Recursive vision-language model for temporal grounding in hour-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3 10

work page 2025
[20]

Lora: Low-rank adaptation of large language models

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 8, 14

work page 2022
[21]

Vtimellm: Empower llm to grasp video moments

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14271–14280, 2024. 8

work page 2024
[22]

Knowing where to focus: Event-aware transformer for video grounding

Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023. 1, 2, 3, 7, 8

work page 2023
[23]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

work page
[24]

Retrieving actions in movies

Ivan Laptev and Patrick Pérez. Retrieving actions in movies. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007. 1

work page 2007
[25]

imove: Instance-motion-aware video understanding

Jiaze Li, Yaya Shi, Zongyang Ma, Haoran Xu, Feng Cheng, Huihui Xiao, Ruiwen Kang, Fan Yang, Tingting Gao, and Di Zhang. imove: Instance-motion-aware video understanding. arXiv preprint arXiv:2502.11594,

work page arXiv
[26]

Mvbench: A comprehensive multi-modal video understanding benchmark

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 3, 7

work page 2024
[27]

Videochat-flash: Hierarchical compression for long-context video modeling

Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang. Videochat-flash: Hierarchical compression for long-context video modeling. arXiv preprint arXiv:2501.00574, 2024. 3, 6, 8, 14, 15

work page arXiv 2024
[28]

Improved visual-spatial reasoning via r1-zero-like training

Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training. arXiv preprint arXiv:2504.00883, 2025. 4

work page arXiv 2025
[29]

Egocentric video-language pretraining

Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022. 1, 3

work page 2022
[30]

Univtg: Towards unified video-language temporal grounding

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 1, 8

work page 2023
[31]

TempCompass: Do Video LLMs Really Understand Videos?

Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024. 7

work page internal anchor Pith review arXiv 2024
[32]

Visual-RFT: Visual Reinforcement Fine-Tuning

Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Egoschema: A diagnostic benchmark for very long-form video language understanding

Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212–46244, 2023. 7

work page 2023
[34]

Walk these ways: Tuning robot control for generalization with multiplicity of behavior

Gabriel B Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. In Conference on Robot Learning, pages 22–31. PMLR, 2023. 3

work page 2023
[35]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365, 2025. 4

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 6

work page 2019
[37]

Snag: Scalable and accurate video grounding

Fangzhou Mu, Sicheng Mo, and Yin Li. Snag: Scalable and accurate video grounding. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18930–18940, 2024. 3, 8

work page 2024
[38]

Queryd: A video dataset with high-quality text and audio narrations

Andreea-Maria Oncescu, Joao F Henriques, Yang Liu, Andrew Zisserman, and Samuel Albanie. Queryd: A video dataset with high-quality text and audio narrations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2265–2269. IEEE, 2021. 6

work page 2021
[39]

Openai o1, 2024

OpenAI. Openai o1, 2024. 2, 4

work page 2024
[40]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. 4 11

work page 2022
[41]

Chatvtg: Video temporal grounding via chat with video dialogue large language models

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 3, 8

work page 2024
[42]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR,

work page
[43]

Grounding action descriptions in videos

Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36, 2013. 6

work page 2013
[44]

Timechat: A time-sensitive multimodal large language model for long video understanding

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024. 1, 3, 6, 8, 14, 15

work page 2024
[45]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 16

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

Hollywood in homes: Crowdsourcing data collection for activity understanding

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. 1, 3, 6, 7

work page 2016
[47]

Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017. 3

work page internal anchor Pith review Pith/arXiv arXiv 2017
[48]

Reason-rft: Reinforcement fine-tuning for visual reasoning

Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752,

work page arXiv
[49]

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Hawkeye: Training video-text llms for grounding text in videos, 2024

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos, 2024. 8

work page 2024
[51]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

work page 2020
[52]

Number it: Temporal grounding videos like flipping manga

Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, and Xu Yang. Number it: Temporal grounding videos like flipping manga. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3

work page 2025
[53]

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. 2

work page 2023
[54]

Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. 6, 14

work page 2023
[55]

Egolife: Towards egocentric life assistant

Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, and Ziwei Liu. Egolife: Towards egocentric life assistant. In Proceed...

work page 2025
[56]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. 4, 14, 16

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024. 4 12

work page 2024
[58]

A closer look at temporal sentence grounding in videos: Dataset and metric

Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, and Wenwu Zhu. A closer look at temporal sentence grounding in videos: Dataset and metric. In Proceedings of the 2nd international workshop on human-centric multimedia analysis, pages 13–21, 2021. 5

work page 2021
[59]

Hierarchical video-moment retrieval and step-captioning

Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oguz, Yashar Mehdad, and Mohit Bansal. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023. 6

work page 2023
[60]

Timesuite: Improving MLLMs for long video understanding via grounded tuning

Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, and Limin Wang. Timesuite: Improving MLLMs for long video understanding via grounded tuning. In The Thirteenth International Conference on Learning Representations, 2025. 1, 3, 6, 8, 14, 15

work page 2025
[61]

Temporal sentence grounding in videos: A survey and future directions

Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Temporal sentence grounding in videos: A survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10443– 10465, 2023. 1

work page 2023
[62]

Multi-scale 2d temporal adjacency networks for moment localization with natural language

Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, and Jiebo Luo. Multi-scale 2d temporal adjacency networks for moment localization with natural language. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 7

work page 2021
[63]

Learning 2d temporal adjacent networks for moment localization with natural language

Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 8

work page 2020
[64]

Tinyllava-video-r1: Towards smaller lmms for video reasoning

Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reasoning. arXiv preprint arXiv:2504.09641, 2025. 4

work page arXiv 2025
[65]

Videoexpert: Augmented llm for temporal-sensitive video understanding

Henghao Zhao, Ge-Peng Ji, Rui Yan, Huan Xiong, and Zechao Li. Videoexpert: Augmented llm for temporal-sensitive video understanding. arXiv preprint arXiv:2504.07519, 2025. 2

work page arXiv 2025
[66]

goes back to the pink bucket

Jiahao Zhu, Daizong Liu, Pan Zhou, Xing Di, Yu Cheng, Song Yang, Wenzheng Xu, Zichuan Xu, Yao Wan, Lichao Sun, and Zeyu Xiong. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022. 8, 9 13 A Limitations 14 B Implementation Details 14 C Additio...

work page 2022
[67]

The man is initially seen standing on a skateboard

work page
[68]

He then proceeds to tie his shoelaces

work page
[69]

After tying his shoelaces, he glides on the skateboard. Given this sequence, the correct answer is: (D) gliding on the skateboard and then tying his shoelaces Time-R1 thinking (ours) : The sequence of actions performed by the man in the video is as follows: first, he ties his shoelaces, and then he glides on the skateboard. This can be observed from the v...

work page
[70]

The video shows a pineapple on a pink surface

work page
[71]

A hand appears in the frame and interacts with the pineapple

work page
[72]

Given this analysis, the pineapple is indeed being pushed forward by a person

The hand lifts the pineapple, indicating that it is being moved by the person. Given this analysis, the pineapple is indeed being pushed forward by a person. Time-R1 thinking (ours) : The video shows a hand reaching into the frame and grasping the pineapple. The hand then lifts the pineapple, indicating that the pineapple is not being pushed forward by a ...

work page
[73]

This is the first major action

Folding the Dress: The video starts with C folding a blue dress on the ironing board. This is the first major action

work page
[74]

Placing the Dress on the Ironing Board: After folding the dress, C places it on the ironing board

work page
[75]

Ironing the Dress: C then irons the dress, ensuring it is smooth and free of wrinkles

work page
[76]

Now, let's evaluate the options: (A) C folds the dress, places it on the ironing board, and then hangs it up

Hanging the Dress: Finally, C hangs the dress up, completing the task. Now, let's evaluate the options: (A) C folds the dress, places it on the ironing board, and then hangs it up. (B) C folds the dress, places it on the ironing board, and then packs it. (C) C folds the dress, places it on the ironing board, and then washes it. (D) C folds the dress, plac...

work page
[77]

- Examples: - person opens a book over their head

Human Action (Simple) - Definition: Singular physical movements or basic interactions. - Examples: - person opens a book over their head. - The person gets out some ginger. - who did I talk to in the shopping mall?

work page
[78]

- Examples: - He is talking while several people are using rowing machines

Human Action (Complex) - Definition: Single continuous event with intricate components or concurrent elements. - Examples: - He is talking while several people are using rowing machines. - One man wearing blue shirt wearing a jumping leg extension and another man wearing red pants play on a field. - who did I interact with when I did activity of fixing ca...

work page
[79]

contains multiple actions, each with a clear start a nd end

Human Action (procedural) - Definition: contains multiple sequential events with explicit temporal boundaries. contains multiple actions, each with a clear start a nd end. - Examples: - The person procures a condiment from the pantry, takes a spoon from the drawer which he uses to scoop it into the pan, then returns the condiment to the pantry, places the...

work page
[80]

Posture descriptors, positional prepositions - Examples: - Several other people are in the background working out on the equipment

Human Pose - Definition: Static body positions or group configurations. Posture descriptors, positional prepositions - Examples: - Several other people are in the background working out on the equipment. - A young child is seen standing before a set of monkey bars

work page

Showing first 80 references.