VeRVE uses a shared MLLM backbone with contrastive alignment and LoRA training to surpass other MLLM methods on zero-shot video retrieval while enabling competitive moment retrieval and state-of-the-art composed retrieval without further training.
Msr-vtt: A large video description dataset for bridging video and language
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
VeRVE: Versatile Retrieval for Videos via Unified Embeddings
VeRVE uses a shared MLLM backbone with contrastive alignment and LoRA training to surpass other MLLM methods on zero-shot video retrieval while enabling competitive moment retrieval and state-of-the-art composed retrieval without further training.