VeRVE uses a shared MLLM backbone with contrastive alignment and LoRA training to surpass other MLLM methods on zero-shot video retrieval while enabling competitive moment retrieval and state-of-the-art composed retrieval without further training.
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
VeRVE: Versatile Retrieval for Videos via Unified Embeddings
VeRVE uses a shared MLLM backbone with contrastive alignment and LoRA training to surpass other MLLM methods on zero-shot video retrieval while enabling competitive moment retrieval and state-of-the-art composed retrieval without further training.