Clip2tv: An empirical study on transformer-based methods for video-text retrieval

Zijian Gao, Jingyu Liu, Sheng Chen, Dedan Chang, Hao Zhang, Jinwei Yuan · 2021 · arXiv 2111.05610

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Adversarial Video Promotion Against Text-to-Video Retrieval

cs.CV · 2025-08-09 · unverdicted · novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

cs.CV · 2022-04-01 · unverdicted · novelty 7.0

Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis

cs.IR · 2026-03-07 · unverdicted · novelty 6.0

Short, simple captions describing single actions achieve higher retrieval recall than complex multi-step or fine-grained scene descriptions across all tested models.

citing papers explorer

Showing 3 of 3 citing papers.

Adversarial Video Promotion Against Text-to-Video Retrieval cs.CV · 2025-08-09 · unverdicted · none · ref 11
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language cs.CV · 2022-04-01 · unverdicted · none · ref 46
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis cs.IR · 2026-03-07 · unverdicted · none · ref 95
Short, simple captions describing single actions achieve higher retrieval recall than complex multi-step or fine-grained scene descriptions across all tested models.

Clip2tv: An empirical study on transformer-based methods for video-text retrieval

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer