pith. machine review for the scientific record. sign in

arxiv: 2503.13377 · v3 · submitted 2025-03-17 · 💻 cs.CV · cs.AI· cs.CL

Recognition: 3 theorem links

· Lean Theorem

Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding

Authors on Pith no claims yet

Pith reviewed 2026-05-17 02:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords temporal video groundinglarge vision-language modelsreinforcement learningpost-trainingvideo understandingdata-efficient learningbenchmark evaluation
0
0 comments X

The pith

Reinforcement learning post-training enables large vision-language models to achieve state-of-the-art temporal video grounding with only 2.5K training examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes Time-R1, a framework that uses reinforcement learning with verifiable rewards to post-train large vision-language models for the task of temporal video grounding. The approach addresses limitations in generalization that arise from standard supervised fine-tuning methods. It includes a data-efficient strategy called TimeRFT that exposes the model to increasingly difficult samples from a specially curated dataset. A new evaluation benchmark called TVGBench is introduced to test performance across eleven query types with balanced video and query distributions. The result is improved accuracy on locating video segments and better overall video comprehension capabilities.

Core claim

The central discovery is that applying reinforcement learning with verifiable rewards in a post-training stage allows large vision-language models to develop stronger reasoning for temporal video grounding. This is demonstrated through the Time-R1 framework, which combines with TimeRFT data strategies to reach state-of-the-art results on multiple datasets while using far less training data than previous approaches. The method also enhances the model's general video understanding abilities beyond the specific task.

What carries the argument

Time-R1, a reasoning-guided post-training framework that applies reinforcement learning with verifiable rewards to improve large vision-language models on temporal video grounding tasks.

If this is right

  • Large vision-language models can generalize better to new video grounding queries after this RL post-training.
  • Only a small amount of data, around 2.5K examples, suffices for achieving top performance on downstream benchmarks.
  • The model shows gains in general video understanding capabilities in addition to the grounding task.
  • The TVGBench provides a balanced way to evaluate different types of language queries for video segments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar RL-based post-training could be applied to other temporal reasoning tasks in video understanding.
  • The verifiable reward mechanism might help reduce the need for massive supervised datasets in multimodal AI training.
  • Future work could test whether this approach scales to even longer videos or more open-ended queries.

Load-bearing premise

The reinforcement learning process with verifiable rewards on the curated dataset leads to true generalization improvements instead of overfitting to the reward signals or the way the benchmarks are built.

What would settle it

Performance on a newly collected temporal video grounding test set with query types and video distributions not seen during training or benchmark construction would show no gains or even drops compared to baseline models.

read the original abstract

Temporal Video Grounding (TVG), the task of locating specific video segments based on language queries, is a core challenge in long-form video understanding. While recent Large Vision-Language Models (LVLMs) have shown early promise in tackling TVG through supervised fine-tuning (SFT), their abilities to generalize remain limited. To address this, we propose a novel post-training framework that enhances the generalization capabilities of LVLMs via reinforcement learning (RL). Specifically, our contributions span three key directions: (1) Time-R1: we introduce a reasoning-guided post-training framework via RL with verifiable reward to enhance the capabilities of LVLMs on the TVG task. (2) TimeRFT: we explore data-efficient post-training strategies on our curated RL-friendly dataset, which trains the model to progressively comprehend difficult samples, leading to better generalization. (3) TVGBench: we carefully construct a small yet comprehensive benchmark for LVLM evaluation, assessing 11 types of queries and featuring balanced distributions across both videos and queries. Extensive experiments demonstrate that Time-R1 achieves state-of-the-art performance across multiple downstream datasets using only 2.5K training data, while improving its general video understanding capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes Time-R1, a reasoning-guided RL post-training framework for LVLMs on Temporal Video Grounding that uses verifiable rewards, combined with the TimeRFT progressive data strategy on a curated 2.5K-sample RL-friendly dataset. It also introduces TVGBench, a small balanced benchmark covering 11 query types. The central claim is that this yields SOTA performance on multiple downstream TVG datasets while improving general video understanding, all with limited training data.

Significance. If the verifiable reward and generalization claims hold after verification, the work would advance data-efficient post-training for multimodal video models by showing that targeted RL can outperform standard SFT. The small comprehensive benchmark could also become a useful evaluation tool for temporal reasoning in LVLMs.

major comments (3)
  1. [Methods] Methods section (Time-R1 description): the verifiable reward is never formally defined or given a mathematical formulation (e.g., no equation for how segment localization or query-type correctness is scored). This is load-bearing for the generalization claim, because without the exact reward it is impossible to determine whether reported gains reflect improved reasoning or exploitation of patterns in the 2.5K curated dataset.
  2. [Experiments] Experiments section: no ablation is presented that removes the RL stage or substitutes a different reward while keeping the same data and base model. Without this isolation, the SOTA numbers cannot be confidently attributed to the reasoning-guided RL rather than the progressive data strategy or benchmark construction.
  3. [Results] Results tables (downstream dataset evaluations): performance figures are reported without error bars, standard deviations across seeds, or statistical significance tests. Given the small 2.5K training set and the claim of robust generalization across 11 query types, this omission leaves open the possibility that gains are within variance or specific to TVGBench distribution.
minor comments (2)
  1. [Abstract] Abstract: the claim of 'improving its general video understanding capabilities' is stated without naming the specific metrics or held-out tasks used to measure this improvement.
  2. [Figures and Tables] Figure captions and tables: axis labels and legend entries are occasionally too small or use inconsistent abbreviations (e.g., 'IoU@0.5' vs 'R@0.5'), reducing readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback on our work. We address each of the major comments below, indicating the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [Methods] Methods section (Time-R1 description): the verifiable reward is never formally defined or given a mathematical formulation (e.g., no equation for how segment localization or query-type correctness is scored). This is load-bearing for the generalization claim, because without the exact reward it is impossible to determine whether reported gains reflect improved reasoning or exploitation of patterns in the 2.5K curated dataset.

    Authors: We agree that providing a formal mathematical definition of the verifiable reward is crucial for clarity and to support our claims about improved reasoning. Although the reward mechanism is described in the Methods section, it lacks an explicit equation. In the revised manuscript, we will introduce a mathematical formulation for the reward function, including terms for temporal localization accuracy (based on intersection-over-union with ground-truth segments) and correctness for different query types. This will help demonstrate that the performance gains arise from enhanced reasoning capabilities rather than overfitting to the curated dataset. revision: yes

  2. Referee: [Experiments] Experiments section: no ablation is presented that removes the RL stage or substitutes a different reward while keeping the same data and base model. Without this isolation, the SOTA numbers cannot be confidently attributed to the reasoning-guided RL rather than the progressive data strategy or benchmark construction.

    Authors: We appreciate this point regarding the need to isolate the effect of the RL stage. The current manuscript includes comparisons between our full Time-R1 approach and standard SFT baselines using the same 2.5K data, but does not explicitly ablate the RL component or test alternative rewards. We will add these ablation studies in the revised version, including results from SFT-only training and variants with modified reward functions, to more rigorously attribute the improvements to the reasoning-guided RL framework. revision: yes

  3. Referee: [Results] Results tables (downstream dataset evaluations): performance figures are reported without error bars, standard deviations across seeds, or statistical significance tests. Given the small 2.5K training set and the claim of robust generalization across 11 query types, this omission leaves open the possibility that gains are within variance or specific to TVGBench distribution.

    Authors: We acknowledge the value of statistical reporting for assessing robustness, particularly given the limited training data size. The original results were obtained from single runs due to computational constraints, but we will conduct additional experiments with multiple seeds in the revision and report mean performance with standard deviations. We will also include statistical significance tests to confirm that the observed improvements are significant and generalizable beyond the TVGBench distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper presents an empirical post-training framework (Time-R1 via RL with verifiable reward, TimeRFT progressive strategy on curated 2.5K RL-friendly data, and TVGBench) whose central claims are SOTA results on downstream datasets plus improved general video understanding. No equations, mathematical derivations, or self-citations appear in the abstract or description that reduce performance gains to fitted parameters, self-definitions, or closed loops. The verifiable reward and data curation are positioned as external contributions, with evaluation on multiple datasets and a new benchmark; claims rest on reported empirical outcomes rather than any construction that equates outputs to inputs by definition. This is the common case of a self-contained empirical methods paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are stated. The approach implicitly assumes that verifiable rewards can be defined without circular dependence on the test distribution.

pith-pipeline@v0.9.0 · 5575 in / 1126 out tokens · 27911 ms · 2026-05-17T02:34:23.564245+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. EvoGround: Self-Evolving Video Agents for Video Temporal Grounding

    cs.CV 2026-05 unverdicted novelty 7.0

    A proposer-solver agent pair achieves supervised-level video temporal grounding and fine-grained captioning from 2.5K unlabeled videos via self-reinforcing evolution.

  2. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 7.0

    VISD improves VideoLLM reasoning performance and training efficiency by combining structured multi-dimensional self-distillation feedback with RL via direction-magnitude decoupling, curriculum scheduling, and EMA stab...

  3. OmniVTG: A Large-Scale Dataset and Training Paradigm for Open-World Video Temporal Grounding

    cs.CV 2026-04 unverdicted novelty 7.0

    OmniVTG creates a new large-scale open-world VTG dataset using iterative concept-gap filling and timestamped captioning, paired with a three-stage self-correction CoT paradigm that yields SOTA zero-shot results on fou...

  4. Towards Temporal Compositional Reasoning in Long-Form Sports Videos

    cs.CV 2026-04 unverdicted novelty 7.0

    SportsTime benchmark and CoTR method improve multimodal AI's temporal compositional reasoning and evidence grounding in long-form sports videos.

  5. CamReasoner: Reinforcing Camera Movement Understanding via Structured Spatial Reasoning

    cs.CV 2026-01 unverdicted novelty 7.0

    CamReasoner uses structured O-T-A reasoning and RL on 56k samples to lift camera movement classification from 73.8% to 78.4% and VQA from 60.9% to 74.5% on Qwen2.5-VL-7B.

  6. Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

    cs.CV 2025-05 conditional novelty 7.0

    Video-Holmes benchmark shows top MLLMs achieve at most 45% accuracy on tasks needing integration of multiple clues from suspense films, unlike existing perception-focused tests.

  7. Co-Evolving Policy Distillation

    cs.LG 2026-04 unverdicted novelty 6.0

    CoPD integrates multiple expert capabilities by running parallel RLVR training with bidirectional online policy distillation among experts, outperforming mixed RLVR and sequential OPD while surpassing domain-specific ...

  8. Video-ToC: Video Tree-of-Cue Reasoning

    cs.CV 2026-04 unverdicted novelty 6.0

    Video-ToC adds tree-guided cue localization, demand-based RL rewards, and automated datasets to video LLMs, reporting better results than prior methods on six understanding benchmarks plus a hallucination test.

  9. STRIVE: Structured Spatiotemporal Exploration for Reinforcement Learning in Video Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    STRIVE stabilizes RL for video QA by creating spatiotemporal video variants and using importance-aware sampling, yielding consistent gains over baselines on six benchmarks.

  10. GraphThinker: Reinforcing Temporally Grounded Video Reasoning with Event Graph Thinking

    cs.CV 2026-02 unverdicted novelty 6.0

    GraphThinker reduces temporal hallucinations in video reasoning by constructing event-based scene graphs and applying visual attention rewards in reinforcement finetuning.

  11. AdaTooler-V: Adaptive Tool-Use for Images and Videos

    cs.CV 2025-12 conditional novelty 6.0

    AdaTooler-V trains MLLMs to adaptively use vision tools via AT-GRPO reinforcement learning and new datasets, reaching 89.8% on V* and outperforming GPT-4o.

  12. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD adds structured privileged feedback from a judge model and a direction-magnitude decoupling trick to let VideoLLMs learn token-level credit assignment while keeping RL stable, yielding higher accuracy and roughly...

  13. VISD: Enhancing Video Reasoning via Structured Self-Distillation

    cs.CV 2026-05 unverdicted novelty 5.0

    VISD improves VideoLLM reasoning by adding multi-dimensional diagnostic self-distillation and RL decoupling, yielding higher accuracy, better grounding, and nearly 2x faster training convergence.

  14. Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt

    cs.SD 2026-04 unverdicted novelty 5.0

    TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.

  15. TempR1: Improving Temporal Understanding of MLLMs via Temporal-Aware Multi-Task Reinforcement Learning

    cs.CV 2025-12 unverdicted novelty 5.0

    TempR1 applies temporal-aware multi-task RL using GRPO and three types of localization rewards to achieve SOTA temporal understanding in MLLMs with synergistic gains from joint optimization.

  16. OneThinker: All-in-one Reasoning Model for Image and Video

    cs.CV 2025-12 unverdicted novelty 5.0

    OneThinker unifies image and video reasoning in one model across 10 tasks via a 600k corpus, CoT-annotated SFT, and EMA-GRPO reinforcement learning, reporting strong results on 31 benchmarks plus some cross-task transfer.

  17. VideoChat-R1: Enhancing Spatio-Temporal Perception via Reinforcement Fine-Tuning

    cs.CV 2025-04 unverdicted novelty 5.0

    Reinforcement fine-tuning with temporal rewards produces VideoChat-R1, a video MLLM showing large gains on spatio-temporal perception benchmarks such as +31.8 temporal grounding and +31.2 object tracking.

  18. RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

    cs.CV 2026-05 unverdicted novelty 4.0

    RCoT-Seg uses GRPO-reinforced keyframe selection from a CoT-start corpus followed by SAM2 mask propagation to improve video object segmentation under implicit temporal instructions over prior MLLM sampling methods.

  19. Empowering Video Translation using Multimodal Large Language Models

    cs.CV 2026-04 unverdicted novelty 4.0

    The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

  20. APRVOS: 1st Place Winner of 5th PVUW MeViS-Audio Track

    cs.SD 2026-04 unverdicted novelty 3.0

    A staged pipeline using ASR transcription, visual existence verification, Sa2VA coarse segmentation, and agent-guided SAM3 refinement won first place in the PVUW MeViS-Audio track by decomposing audio-conditioned Ref-...

  21. AgentRVOS for MeViS-Text Track of 5th PVUW Challenge: 3rd Method

    cs.CV 2026-04 unverdicted novelty 3.0

    An agent-augmented Sa2VA pipeline for referring video object segmentation placed third in the MeViS-Text track of the 5th PVUW Challenge by adding verification, search, and refinement stages.

  22. From System 1 to System 2: A Survey of Reasoning Large Language Models

    cs.AI 2025-02 accept novelty 3.0

    The survey organizes the shift of LLMs toward deliberate System 2 reasoning, covering model construction techniques, performance on math and coding benchmarks, and future research directions.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · cited by 20 Pith papers · 10 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025. 2, 4

  2. [2]

    Ht- step: Aligning instructional articles with how-to videos

    Triantafyllos Afouras, Effrosyni Mavroudi, Tushar Nagarajan, Huiyu Wang, and Lorenzo Torresani. Ht- step: Aligning instructional articles with how-to videos. Advances in Neural Information Processing Systems, 36:50310–50326, 2023. 6

  3. [3]

    Localizing moments in video with natural language

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision, pages 5803–5812, 2017. 1, 3, 6

  4. [4]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report. a...

  5. [5]

    Activitynet: A large-scale video benchmark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 1, 3, 6, 7

  6. [6]

    Quo vadis, action recognition? a new model and the kinetics dataset

    Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6299–6308,

  7. [7]

    R1-v: Reinforcing super generalization ability in vision-language models with less than $3

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3. https://github.com/Deep-Agent/R1-V, 2025. 4

  8. [8]

    Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

    Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023. 14

  9. [9]

    Space-time gestures

    Trevor Darrell and Alex Pentland. Space-time gestures. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition, pages 335–340. IEEE, 1993. 1

  10. [10]

    Gemini 2.5: Our most intelligent ai model

    Google DeepMind. Gemini 2.5: Our most intelligent ai model. Google DeepMind, 2025. Model ID: gemini-2.5-pro-preview-03-25. 8, 14, 15

  11. [11]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    DeepSeek-AI. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024. 7

  12. [12]

    BERT: pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, pages 4171–4186, 2019. 3

  13. [13]

    Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3, 7

  14. [14]

    Temporal localization of actions with actoms

    Adrien Gaidon, Zaid Harchaoui, and Cordelia Schmid. Temporal localization of actions with actoms. IEEE transactions on pattern analysis and machine intelligence, 35(11):2782–2795, 2013. 1

  15. [15]

    Tall: Temporal activity localization via language query

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. Tall: Temporal activity localization via language query. In Proceedings of the IEEE international conference on computer vision, pages 5267–5275, 2017. 1, 3

  16. [16]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18995–19012, 2022. 1, 6

  17. [17]

    Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding

    Yongxin Guo, Jingyu Liu, Mingda Li, Dingxin Cheng, Xiaoying Tang, Dianbo Sui, Qingbin Liu, Xi Chen, and Kevin Zhao. Vtg-llm: Integrating timestamp knowledge into video llms for enhanced video temporal grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 3302–3310,

  18. [18]

    Trace: Temporal grounding video llm via causal event modeling

    Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang. Trace: Temporal grounding video llm via causal event modeling. arXiv preprint arXiv:2410.05643, 2024. 2, 3, 8, 9, 14, 15

  19. [19]

    Revisionllm: Recursive vision-language model for temporal grounding in hour-long videos

    Tanveer Hannan, Md Mohaiminul Islam, Jindong Gu, Thomas Seidl, and Gedas Bertasius. Revisionllm: Recursive vision-language model for temporal grounding in hour-long videos. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3 10

  20. [20]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. ICLR, 1(2):3, 2022. 8, 14

  21. [21]

    Vtimellm: Empower llm to grasp video moments

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14271–14280, 2024. 8

  22. [22]

    Knowing where to focus: Event-aware transformer for video grounding

    Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn. Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 13846–13856, 2023. 1, 2, 3, 7, 8

  23. [23]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,

  24. [24]

    Retrieving actions in movies

    Ivan Laptev and Patrick Pérez. Retrieving actions in movies. In 2007 IEEE 11th International Conference on Computer Vision, pages 1–8. IEEE, 2007. 1

  25. [25]

    imove: Instance-motion-aware video understanding

    Jiaze Li, Yaya Shi, Zongyang Ma, Haoran Xu, Feng Cheng, Huihui Xiao, Ruiwen Kang, Fan Yang, Tingting Gao, and Di Zhang. imove: Instance-motion-aware video understanding. arXiv preprint arXiv:2502.11594,

  26. [26]

    Mvbench: A comprehensive multi-modal video understanding benchmark

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, et al. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22195–22206, 2024. 3, 7

  27. [27]

    Videochat-flash: Hierarchical compression for long-context video modeling

    Xinhao Li, Yi Wang, Jiashuo Yu, Xiangyu Zeng, Yuhan Zhu, Haian Huang, Jianfei Gao, Kunchang Li, Yinan He, Chenting Wang, Yu Qiao, Yali Wang, and Limin Wang. Videochat-flash: Hierarchical compression for long-context video modeling. arXiv preprint arXiv:2501.00574, 2024. 3, 6, 8, 14, 15

  28. [28]

    Improved visual-spatial reasoning via r1-zero-like training

    Zhenyi Liao, Qingsong Xie, Yanhao Zhang, Zijian Kong, Haonan Lu, Zhenyu Yang, and Zhijie Deng. Improved visual-spatial reasoning via r1-zero-like training. arXiv preprint arXiv:2504.00883, 2025. 4

  29. [29]

    Egocentric video-language pretraining

    Kevin Qinghong Lin, Jinpeng Wang, Mattia Soldan, Michael Wray, Rui Yan, Eric Z Xu, Difei Gao, Rong-Cheng Tu, Wenzhe Zhao, Weijie Kong, et al. Egocentric video-language pretraining. Advances in Neural Information Processing Systems, 35:7575–7586, 2022. 1, 3

  30. [30]

    Univtg: Towards unified video-language temporal grounding

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shraman Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video-language temporal grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2794–2804, 2023. 1, 8

  31. [31]

    TempCompass: Do Video LLMs Really Understand Videos?

    Yuanxin Liu, Shicheng Li, Yi Liu, Yuxiang Wang, Shuhuai Ren, Lei Li, Sishuo Chen, Xu Sun, and Lu Hou. Tempcompass: Do video llms really understand videos? arXiv preprint arXiv:2403.00476, 2024. 7

  32. [32]

    Visual-RFT: Visual Reinforcement Fine-Tuning

    Ziyu Liu, Zeyi Sun, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Haodong Duan, Dahua Lin, and Jiaqi Wang. Visual-rft: Visual reinforcement fine-tuning. arXiv preprint arXiv:2503.01785, 2025. 4

  33. [33]

    Egoschema: A diagnostic benchmark for very long-form video language understanding

    Karttikeya Mangalam, Raiymbek Akshulakov, and Jitendra Malik. Egoschema: A diagnostic benchmark for very long-form video language understanding. Advances in Neural Information Processing Systems, 36:46212–46244, 2023. 7

  34. [34]

    Walk these ways: Tuning robot control for generalization with multiplicity of behavior

    Gabriel B Margolis and Pulkit Agrawal. Walk these ways: Tuning robot control for generalization with multiplicity of behavior. In Conference on Robot Learning, pages 22–31. PMLR, 2023. 3

  35. [35]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Botian Shi, Wenhai Wang, Junjun He, Kaipeng Zhang, et al. Mm-eureka: Exploring visual aha moment with rule-based large-scale reinforcement learning. arXiv preprint arXiv:2503.07365, 2025. 4

  36. [36]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips. In Proceedings of the IEEE/CVF international conference on computer vision, pages 2630–2640, 2019. 6

  37. [37]

    Snag: Scalable and accurate video grounding

    Fangzhou Mu, Sicheng Mo, and Yin Li. Snag: Scalable and accurate video grounding. 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18930–18940, 2024. 3, 8

  38. [38]

    Queryd: A video dataset with high-quality text and audio narrations

    Andreea-Maria Oncescu, Joao F Henriques, Yang Liu, Andrew Zisserman, and Samuel Albanie. Queryd: A video dataset with high-quality text and audio narrations. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2265–2269. IEEE, 2021. 6

  39. [39]

    Openai o1, 2024

    OpenAI. Openai o1, 2024. 2, 4

  40. [40]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022. 4 11

  41. [41]

    Chatvtg: Video temporal grounding via chat with video dialogue large language models

    Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1847–1856, 2024. 3, 8

  42. [42]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PmLR,

  43. [43]

    Grounding action descriptions in videos

    Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics, 1:25–36, 2013. 6

  44. [44]

    Timechat: A time-sensitive multimodal large language model for long video understanding

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14313–14323, 2024. 1, 3, 6, 8, 14, 15

  45. [45]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024. 16

  46. [46]

    Hollywood in homes: Crowdsourcing data collection for activity understanding

    Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of the European Conference on Computer Vision (ECCV), 2016. 1, 3, 6, 7

  47. [47]

    Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm

    David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017. 3

  48. [48]

    Reason-rft: Reinforcement fine-tuning for visual reasoning

    Huajie Tan, Yuheng Ji, Xiaoshuai Hao, Minglan Lin, Pengwei Wang, Zhongyuan Wang, and Shanghang Zhang. Reason-rft: Reinforcement fine-tuning for visual reasoning. arXiv preprint arXiv:2503.20752,

  49. [49]

    InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

    Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, et al. Internvid: A large-scale video-text dataset for multimodal understanding and generation. arXiv preprint arXiv:2307.06942, 2023. 6

  50. [50]

    Hawkeye: Training video-text llms for grounding text in videos, 2024

    Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. Hawkeye: Training video-text llms for grounding text in videos, 2024. 8

  51. [51]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  52. [52]

    Number it: Temporal grounding videos like flipping manga

    Yongliang Wu, Xinting Hu, Yuyang Sun, Yizhou Zhou, Wenbo Zhu, Fengyun Rao, Bernt Schiele, and Xu Yang. Number it: Temporal grounding videos like flipping manga. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 3

  53. [53]

    Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

    Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. 2

  54. [54]

    Vid2seq: Large-scale pretraining of a visual language model for dense video captioning

    Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10714–10726, 2023. 6, 14

  55. [55]

    Egolife: Towards egocentric life assistant

    Jingkang Yang, Shuai Liu, Hongming Guo, Yuhao Dong, Xiamengwei Zhang, Sicheng Zhang, Pengyun Wang, Zitang Zhou, Binzhu Xie, Ziyue Wang, Bei Ouyang, Zhengyu Lin, Marco Cominelli, Zhongang Cai, Yuanhan Zhang, Peiyuan Zhang, Fangzhou Hong, Joerg Widmer, Francesco Gringoli, Lei Yang, Bo Li, and Ziwei Liu. Egolife: Towards egocentric life assistant. In Proceed...

  56. [56]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025. 4, 14, 16

  57. [57]

    Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback

    Tianyu Yu, Yuan Yao, Haoye Zhang, Taiwen He, Yifeng Han, Ganqu Cui, Jinyi Hu, Zhiyuan Liu, Hai-Tao Zheng, Maosong Sun, et al. Rlhf-v: Towards trustworthy mllms via behavior alignment from fine-grained correctional human feedback. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13807–13816, 2024. 4 12

  58. [58]

    A closer look at temporal sentence grounding in videos: Dataset and metric

    Yitian Yuan, Xiaohan Lan, Xin Wang, Long Chen, Zhi Wang, and Wenwu Zhu. A closer look at temporal sentence grounding in videos: Dataset and metric. In Proceedings of the 2nd international workshop on human-centric multimedia analysis, pages 13–21, 2021. 5

  59. [59]

    Hierarchical video-moment retrieval and step-captioning

    Abhay Zala, Jaemin Cho, Satwik Kottur, Xilun Chen, Barlas Oguz, Yashar Mehdad, and Mohit Bansal. Hierarchical video-moment retrieval and step-captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23056–23065, 2023. 6

  60. [60]

    Timesuite: Improving MLLMs for long video understanding via grounded tuning

    Xiangyu Zeng, Kunchang Li, Chenting Wang, Xinhao Li, Tianxiang Jiang, Ziang Yan, Songze Li, Yansong Shi, Zhengrong Yue, Yi Wang, Yali Wang, Yu Qiao, and Limin Wang. Timesuite: Improving MLLMs for long video understanding via grounded tuning. In The Thirteenth International Conference on Learning Representations, 2025. 1, 3, 6, 8, 14, 15

  61. [61]

    Temporal sentence grounding in videos: A survey and future directions

    Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. Temporal sentence grounding in videos: A survey and future directions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(8):10443– 10465, 2023. 1

  62. [62]

    Multi-scale 2d temporal adjacency networks for moment localization with natural language

    Songyang Zhang, Houwen Peng, Jianlong Fu, Yijuan Lu, and Jiebo Luo. Multi-scale 2d temporal adjacency networks for moment localization with natural language. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021. 7

  63. [63]

    Learning 2d temporal adjacent networks for moment localization with natural language

    Songyang Zhang, Houwen Peng, Jianlong Fu, and Jiebo Luo. Learning 2d temporal adjacent networks for moment localization with natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020. 8

  64. [64]

    Tinyllava-video-r1: Towards smaller lmms for video reasoning

    Xingjian Zhang, Siwei Wen, Wenjun Wu, and Lei Huang. Tinyllava-video-r1: Towards smaller lmms for video reasoning. arXiv preprint arXiv:2504.09641, 2025. 4

  65. [65]

    Videoexpert: Augmented llm for temporal-sensitive video understanding

    Henghao Zhao, Ge-Peng Ji, Rui Yan, Huan Xiong, and Zechao Li. Videoexpert: Augmented llm for temporal-sensitive video understanding. arXiv preprint arXiv:2504.07519, 2025. 2

  66. [66]

    goes back to the pink bucket

    Jiahao Zhu, Daizong Liu, Pan Zhou, Xing Di, Yu Cheng, Song Yang, Wenzheng Xu, Zichuan Xu, Yao Wan, Lichao Sun, and Zeyu Xiong. Rethinking the video sampling and reasoning strategies for temporal sentence grounding. In Findings of the Association for Computational Linguistics: EMNLP 2022, 2022. 8, 9 13 A Limitations 14 B Implementation Details 14 C Additio...

  67. [67]

    The man is initially seen standing on a skateboard

  68. [68]

    He then proceeds to tie his shoelaces

  69. [69]

    After tying his shoelaces, he glides on the skateboard. Given this sequence, the correct answer is: (D) gliding on the skateboard and then tying his shoelaces Time-R1 thinking (ours) : The sequence of actions performed by the man in the video is as follows: first, he ties his shoelaces, and then he glides on the skateboard. This can be observed from the v...

  70. [70]

    The video shows a pineapple on a pink surface

  71. [71]

    A hand appears in the frame and interacts with the pineapple

  72. [72]

    Given this analysis, the pineapple is indeed being pushed forward by a person

    The hand lifts the pineapple, indicating that it is being moved by the person. Given this analysis, the pineapple is indeed being pushed forward by a person. Time-R1 thinking (ours) : The video shows a hand reaching into the frame and grasping the pineapple. The hand then lifts the pineapple, indicating that the pineapple is not being pushed forward by a ...

  73. [73]

    This is the first major action

    Folding the Dress: The video starts with C folding a blue dress on the ironing board. This is the first major action

  74. [74]

    Placing the Dress on the Ironing Board: After folding the dress, C places it on the ironing board

  75. [75]

    Ironing the Dress: C then irons the dress, ensuring it is smooth and free of wrinkles

  76. [76]

    Now, let's evaluate the options: (A) C folds the dress, places it on the ironing board, and then hangs it up

    Hanging the Dress: Finally, C hangs the dress up, completing the task. Now, let's evaluate the options: (A) C folds the dress, places it on the ironing board, and then hangs it up. (B) C folds the dress, places it on the ironing board, and then packs it. (C) C folds the dress, places it on the ironing board, and then washes it. (D) C folds the dress, plac...

  77. [77]

    - Examples: - person opens a book over their head

    Human Action (Simple) - Definition: Singular physical movements or basic interactions. - Examples: - person opens a book over their head. - The person gets out some ginger. - who did I talk to in the shopping mall?

  78. [78]

    - Examples: - He is talking while several people are using rowing machines

    Human Action (Complex) - Definition: Single continuous event with intricate components or concurrent elements. - Examples: - He is talking while several people are using rowing machines. - One man wearing blue shirt wearing a jumping leg extension and another man wearing red pants play on a field. - who did I interact with when I did activity of fixing ca...

  79. [79]

    contains multiple actions, each with a clear start a nd end

    Human Action (procedural) - Definition: contains multiple sequential events with explicit temporal boundaries. contains multiple actions, each with a clear start a nd end. - Examples: - The person procures a condiment from the pantry, takes a spoon from the drawer which he uses to scoop it into the pan, then returns the condiment to the pantry, places the...

  80. [80]

    Posture descriptors, positional prepositions - Examples: - Several other people are in the background working out on the equipment

    Human Pose - Definition: Static body positions or group configurations. Posture descriptors, positional prepositions - Examples: - Several other people are in the background working out on the equipment. - A young child is seen standing before a set of monkey bars

Showing first 80 references.