Disentangled representa- tion learning for text-video retrieval

Qiang Wang, Yanhao Zhang, Yun Zheng, Pan Pan, Xian-Sheng Hua · 2022 · arXiv 2203.07111

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Adversarial Video Promotion Against Text-to-Video Retrieval

cs.CV · 2025-08-09 · unverdicted · novelty 7.0

Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

cs.CV · 2022-04-01 · unverdicted · novelty 7.0

Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis

cs.IR · 2026-03-07 · unverdicted · novelty 6.0

Short, simple captions describing single actions achieve higher retrieval recall than complex multi-step or fine-grained scene descriptions across all tested models.

PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance

cs.CV · 2024-11-04 · unverdicted · novelty 6.0

PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.

Text-Video Retrieval With Global-Local Contrastive Consistency Learning

cs.IR · 2026-05-18 · unverdicted · novelty 5.0

GLCCL uses a Global-Local Interaction Module and Contrastive Score Consistency loss to align text and video semantics more efficiently than attention-based methods on MSR-VTT, DiDeMo, and VATEX.

citing papers explorer

Showing 5 of 5 citing papers.

Adversarial Video Promotion Against Text-to-Video Retrieval cs.CV · 2025-08-09 · unverdicted · none · ref 40
Pioneers ViPro, the first attack to adversarially promote videos in text-to-video retrieval, using Modal Refinement to improve black-box transferability across multiple targets.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language cs.CV · 2022-04-01 · unverdicted · none · ref 94
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis cs.IR · 2026-03-07 · unverdicted · none · ref 97
Short, simple captions describing single actions achieve higher retrieval recall than complex multi-step or fine-grained scene descriptions across all tested models.
PPLLaVA: Varied Video Sequence Understanding With Prompt Guidance cs.CV · 2024-11-04 · unverdicted · none · ref 14
PPLLaVA uses CLIP-based alignment and prompt-guided convolution-style pooling to reduce visual tokens 18x in Video LLMs, achieving SOTA results on captioning, QA, and long-form reasoning benchmarks with higher throughput.
Text-Video Retrieval With Global-Local Contrastive Consistency Learning cs.IR · 2026-05-18 · unverdicted · none · ref 8
GLCCL uses a Global-Local Interaction Module and Contrastive Score Consistency loss to align text and video semantics more efficiently than attention-based methods on MSR-VTT, DiDeMo, and VATEX.

Disentangled representa- tion learning for text-video retrieval

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer