Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, Dong Shen · 2021 · arXiv 2109.04290

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 1 method 1

citation-polarity summary

background 2

representative citing papers

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions

cs.CV · 2025-03-18 · unverdicted · novelty 7.0

Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.

Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language

cs.CV · 2022-04-01 · unverdicted · novelty 7.0

Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.

Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis

cs.IR · 2026-03-07 · unverdicted · novelty 6.0

Short, simple captions describing single actions achieve higher retrieval recall than complex multi-step or fine-grained scene descriptions across all tested models.

InternVideo: General Video Foundation Models via Generative and Discriminative Learning

cs.CV · 2022-12-06 · unverdicted · novelty 5.0

InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.

citing papers explorer

Showing 4 of 4 citing papers.

Stitch-a-Demo: Video Demonstrations from Multistep Descriptions cs.CV · 2025-03-18 · unverdicted · none · ref 12
Stitch-a-Demo is a retrieval-based method that assembles visually coherent video demonstrations from multistep textual descriptions by training on weakly supervised procedural data with hard negatives.
Socratic Models: Composing Zero-Shot Multimodal Reasoning with Language cs.CV · 2022-04-01 · unverdicted · none · ref 73
Socratic Models compose zero-shot multimodal reasoning by prompting pretrained language and vision models to exchange information and enable new capabilities without finetuning.
Understanding the Performance Plateau in Text-to-Video Retrieval: A Comprehensive Empirical and Linguistic Analysis cs.IR · 2026-03-07 · unverdicted · none · ref 59
Short, simple captions describing single actions achieve higher retrieval recall than complex multi-step or fine-grained scene descriptions across all tested models.
InternVideo: General Video Foundation Models via Generative and Discriminative Learning cs.CV · 2022-12-06 · unverdicted · none · ref 91
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.

Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer