EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
fields
cs.CV 2verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
citing papers explorer
-
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
EchoPrune prunes video tokens via query relevance and temporal reconstruction error to let VideoLLMs handle up to 20x more frames under fixed budget with reported gains in accuracy and speed.
-
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.