Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
Videoexpert: Augmented LLM for temporal- sensitive video understanding.CoRR, abs/2504.07519
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
baseline 1
citation-polarity summary
fields
cs.CV 2verdicts
UNVERDICTED 2roles
baseline 1polarities
baseline 1representative citing papers
Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.
citing papers explorer
-
A Paradigm Shift: Fully End-to-End Training for Temporal Sentence Grounding in Videos
Fully end-to-end training with a sentence-conditioned adapter outperforms frozen-backbone baselines for localizing video segments that match sentence queries.
-
Time-R1: Post-Training Large Vision Language Model for Temporal Video Grounding
Time-R1 applies RL with verifiable rewards to post-train LVLMs for temporal video grounding, reaching state-of-the-art results on multiple datasets using only 2.5K samples while also improving general video capabilities.