Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
Use what you have: Video retrieval using representations from collaborative experts
4 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
GLCCL uses a Global-Local Interaction Module and Contrastive Score Consistency loss to align text and video semantics more efficiently than attention-based methods on MSR-VTT, DiDeMo, and VATEX.
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.
citing papers explorer
-
Adversarial Video Promotion Against Text-to-Video Retrieval
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
-
Demystifying CLIP Data
MetaCLIP curates balanced 400M-pair subsets from CommonCrawl that outperform CLIP data, reaching 70.8% zero-shot ImageNet accuracy on ViT-B versus CLIP's 68.3%.
-
Text-Video Retrieval With Global-Local Contrastive Consistency Learning
GLCCL uses a Global-Local Interaction Module and Contrastive Score Consistency loss to align text and video semantics more efficiently than attention-based methods on MSR-VTT, DiDeMo, and VATEX.
-
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.