Use what you have: Video retrieval using representations from collaborative experts

· 1907 · arXiv 1907.13487

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

representative citing papers

Adversarial Video Promotion Against Text-to-Video Retrieval

cs.CV · 2025-08-09 · unverdicted · novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.

Demystifying CLIP Data

cs.CV · 2023-09-28 · accept · novelty 6.0

MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.

Text-Video Retrieval With Global-Local Contrastive Consistency Learning

cs.IR · 2026-05-18 · unverdicted · novelty 5.0

GLCCL uses a Global-Local Interaction Module and Contrastive Score Consistency loss to align text and video semantics more efficiently than attention-based methods on MSR-VTT, DiDeMo, and VATEX.

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

cs.CV · 2025-07-07 · unverdicted · novelty 5.0

VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.

citing papers explorer

Showing 4 of 4 citing papers.

Adversarial Video Promotion Against Text-to-Video Retrieval cs.CV · 2025-08-09 · unverdicted · none · ref 27
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
Demystifying CLIP Data cs.CV · 2023-09-28 · accept · none · ref 91
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
Text-Video Retrieval With Global-Local Contrastive Consistency Learning cs.IR · 2026-05-18 · unverdicted · none · ref 18
GLCCL uses a Global-Local Interaction Module and Contrastive Score Consistency loss to align text and video semantics more efficiently than attention-based methods on MSR-VTT, DiDeMo, and VATEX.
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents cs.CV · 2025-07-07 · unverdicted · none · ref 16
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.

Use what you have: Video retrieval using representations from collaborative experts

fields

years

verdicts

representative citing papers

citing papers explorer