GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Linlin Zong; Nanding Wu; Wenxin Liang; Xianchao Zhang; Xinyue Liu; Yanyang Li; Yunzhuo Sun

arxiv: 2603.22121 · v2 · pith:PAXOKVOLnew · submitted 2026-03-23 · 💻 cs.CV · cs.AI

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

Yunzhuo Sun , Xinyue Liu , Yanyang Li , Nanding Wu , Linlin Zong , Xianchao Zhang , Wenxin Liang This is my paper

classification 💻 cs.CV cs.AI

keywords retrievaltemporalgenspanmomentmotionvideocorpusgeneration-calibrated

0 comments

read the original abstract

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language query, a task that is especially challenging for multi-verb queries where temporal action ordering is critical. Existing approaches often rely solely on text or static images and struggle to capture implicit motion dynamics, leading to retrieval errors and temporal misalignment. We propose GenSpan, a generation-calibrated VCMR framework that constructs short auxiliary videos from LLM-selected subtitle cues and decomposed sub-events, using these as temporal priors rather than direct retrieval targets. A token selector filters candidate-video features aligned with generated motion, and a bidirectional state-space model efficiently predicts video-moment tuples. Experiments on TVR and ActivityNet-Captions demonstrate that GenSpan improves corpus-level retrieval and moment localization, particularly for complex multi-action queries, while reducing computational cost compared to state-of-the-art multimodal baselines.

This paper has not been read by Pith yet.

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

discussion (0)