pith. machine review for the scientific record. sign in

arxiv: 2605.02623 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.MM

Recognition: 2 theorem links

· Lean Theorem

Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:43 UTC · model grok-4.3

classification 💻 cs.CV cs.MM
keywords Generalized Moment RetrievalVideo Moment RetrievalBenchmarkSoccer VideosMultimodal Large Language ModelsTemporal LocalizationVideo Understanding
0
0 comments X

The pith

Generalized moment retrieval requires returning every matching video segment or an empty set when none match.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard video moment retrieval assumes each natural language query matches exactly one segment. This paper formulates Generalized Moment Retrieval to require the complete set of matches or an empty prediction for queries with multiple or zero matches. They introduce the Soccer-GMR benchmark built from soccer videos through a scalable semi-automated annotation pipeline with human verification. A unified evaluation protocol covers null-set rejection, localization, and overall performance. Baselines using a plug-and-play adapter for existing models and a tailored reward for multimodal large language models show gains while revealing current limitations.

Core claim

We formulate Generalized Moment Retrieval (GMR) as a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. We introduce Soccer-GMR, a large-scale benchmark on soccer videos constructed via a duration-flexible semi-automated pipeline with human verification, along with a unified evaluation protocol and baselines consisting of a lightweight GMR adapter for discriminative VMR models and a GMR-tailored GRPO reward for fine-tuning multimodal large language models.

What carries the argument

Generalized Moment Retrieval (GMR), the unified task of returning all relevant moments or an empty set, carried by the Soccer-GMR benchmark and the GMR adapter plus GRPO reward.

If this is right

  • Discriminative VMR models gain the ability to handle multiple moments and null queries through the plug-and-play GMR adapter.
  • Multimodal large language models improve on GMR tasks when fine-tuned with the GRPO reward.
  • Evaluation now requires complementary metrics for null-set rejection, positive-query localization, and end-to-end GMR performance.
  • Current methods exhibit consistent gains yet still show limitations on realistic queries with variable moment counts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The semi-automated annotation approach could scale to other video domains to test whether GMR performance generalizes beyond sports footage.
  • Practical video search systems may need explicit mechanisms for empty-set prediction to avoid returning irrelevant results on ambiguous queries.
  • Future models could be designed to output variable-length moment sets directly rather than adapting single-moment architectures.

Load-bearing premise

Soccer videos and the duration-flexible semi-automated pipeline with human verification produce annotations representative of real-world queries with multiple or null moments.

What would settle it

Apply the trained models to a benchmark from a different domain such as news or instructional videos and measure whether null-set rejection accuracy drops substantially.

Figures

Figures reproduced from arXiv: 2605.02623 by Luyuan Jiao, Lu Zhang, Siyu Cao, Yiming Ding, Yixuan Li, Zhiyong Liu, Zitong Wang.

Figure 1
Figure 1. Figure 1: Three retrieval scenarios in Generalized Moment Retrieval (GMR). Given a video and a natural language query, the view at source ↗
Figure 2
Figure 2. Figure 2: Duration-flexible semi-automated pipeline for GMR data construction. Stage I applies LLMs to extract structured view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of Soccer-GMR. Query types include null view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the GMR Adapter. A parallel ex view at source ↗
Figure 5
Figure 5. Figure 5: Gymnastics-GMR: Query types mix after balancing view at source ↗
Figure 6
Figure 6. Figure 6: Gymnastics-GMR: positive segment durations in view at source ↗
read the original abstract

Video Moment Retrieval (VMR) aims to localize temporal segments in videos that correspond to a natural language query, but typically assumes only a single matching moment for each query. This assumption does not always hold in real-world scenarios, where queries may correspond to multiple or no moments. Thus, we formulate Generalized Moment Retrieval (GMR), a unified setting that requires retrieving the complete set of relevant moments or predicting an empty set. To enable systematic study of GMR, we introduce Soccer-GMR, a large-scale benchmark built on challenging soccer videos that reflect general GMR scenarios, with realistic negative and positive queries. The benchmark is constructed via a duration-flexible semi-automated pipeline with human verification, enabling scalable data generation while maintaining high annotation quality. We further design a unified evaluation protocol with complementary metrics tailored for null-set rejection, positive-query localization, and end-to-end GMR performance. Finally, we establish strong baselines across two modeling paradigms: a lightweight plug-and-play GMR adapter for discriminative VMR models, and a GMR-tailored GRPO reward for fine-tuning multimodal large language models (MLLMs). Extensive experiments show consistent gains across all metrics and expose key limitations of current methods, positioning GMR as a more realistic and challenging benchmark for video-language understanding.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that conventional Video Moment Retrieval (VMR) assumes a single matching moment per query, which fails to capture real-world cases with multiple or zero relevant moments. It formulates Generalized Moment Retrieval (GMR) as a unified task requiring retrieval of the complete set of relevant moments or an empty set, introduces the Soccer-GMR benchmark built from soccer videos via a duration-flexible semi-automated pipeline with human verification, proposes a unified evaluation protocol with metrics for null-set rejection, positive localization, and end-to-end performance, and presents two baselines (a lightweight GMR adapter for discriminative VMR models and a GRPO reward for MLLM fine-tuning). Extensive experiments are reported to show consistent gains and to highlight limitations of prior methods.

Significance. If the benchmark construction and reported gains hold under scrutiny, the work is significant for shifting video-language research toward more realistic query scenarios that include null and multi-moment cases. The introduction of a large-scale benchmark, tailored evaluation protocol, and cross-paradigm baselines (discriminative adapter plus generative reward) provides concrete tools for the community and could expose systematic weaknesses in existing VMR approaches. The semi-automated annotation pipeline is a practical contribution if its quality controls are adequately documented.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): the assertion that soccer videos 'reflect general GMR scenarios' and that the resulting annotations are representative of real-world multi-moment or null queries is load-bearing for the claim that Soccer-GMR enables systematic study of GMR. The manuscript provides no cross-domain statistics, ablation on query-language diversity, or comparison against egocentric/narrative/surveillance video distributions, leaving the generalization risk unaddressed.
  2. [§4.1] §4.1 (GMR Adapter): the description of the plug-and-play adapter does not specify how the model is modified to output variable-length sets or to predict the empty set; without these architectural or loss-function details it is impossible to determine whether the reported gains arise from the adapter itself or from incidental changes in training.
  3. [§5] §5 (Experiments): the claim of 'consistent gains across all metrics' is central, yet the manuscript does not report ablations isolating the contribution of the human-verification step in the annotation pipeline or the effect of query-diversity controls; without these, the robustness of the null-set and multi-moment results cannot be verified.
minor comments (2)
  1. [Abstract] The abstract refers to 'realistic negative and positive queries' without a concise definition or example; adding one sentence would improve clarity.
  2. [§2 or §4] Notation for the empty-set prediction and the complementary metrics (null-set rejection, positive-query localization) should be introduced once in §2 or §4 and used consistently thereafter.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for improving clarity on generalization, model details, and experimental robustness. We address each point below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): the assertion that soccer videos 'reflect general GMR scenarios' and that the resulting annotations are representative of real-world multi-moment or null queries is load-bearing for the claim that Soccer-GMR enables systematic study of GMR. The manuscript provides no cross-domain statistics, ablation on query-language diversity, or comparison against egocentric/narrative/surveillance video distributions, leaving the generalization risk unaddressed.

    Authors: We agree that stronger justification for generalization is needed. Soccer videos were chosen because their event-rich structure naturally produces multi-moment and null-set queries, as described in §3. In the revision, we will expand the abstract and §3 with additional query-diversity statistics, qualitative examples, and a dedicated discussion of how these scenarios align with broader GMR use cases. Comprehensive cross-domain comparisons are not feasible without new data collection and fall outside the current scope; we will explicitly note this limitation and outline directions for future work. revision: partial

  2. Referee: [§4.1] §4.1 (GMR Adapter): the description of the plug-and-play adapter does not specify how the model is modified to output variable-length sets or to predict the empty set; without these architectural or loss-function details it is impossible to determine whether the reported gains arise from the adapter itself or from incidental changes in training.

    Authors: We apologize for the omitted details. The GMR adapter augments the base VMR model with a set-prediction head that employs bipartite matching loss to handle variable-length outputs and includes an explicit null-set prediction branch (via a learned threshold on a dedicated logit). We will insert the full architectural diagram, modified loss formulation, and training procedure into the revised §4.1 so that the source of the gains is transparent. revision: yes

  3. Referee: [§5] §5 (Experiments): the claim of 'consistent gains across all metrics' is central, yet the manuscript does not report ablations isolating the contribution of the human-verification step in the annotation pipeline or the effect of query-diversity controls; without these, the robustness of the null-set and multi-moment results cannot be verified.

    Authors: We concur that targeted ablations would strengthen the claims. We will add an ablation on the human-verification step performed on a held-out subset, reporting its effect on both annotation quality and downstream GMR metrics. We will also incorporate analysis of query-diversity controls (e.g., stratified subsets) into §5 to demonstrate robustness of the null-set and multi-moment results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new task and benchmark are self-contained contributions

full rationale

The paper formulates a new task (GMR), constructs a benchmark (Soccer-GMR) via a described pipeline, defines an evaluation protocol, and adapts existing models as baselines. No equations, derivations, or predictions are presented that reduce to fitted parameters, self-definitions, or self-citation chains from the same inputs. The central claims rest on the novelty of the unified setting and empirical results on the new data, without load-bearing reductions to prior fitted quantities or ansatzes imported from the authors' own work. This is a standard constructive benchmark paper whose contributions are independent of any circular loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new task definition and benchmark construction; it builds on existing VMR models and MLLM fine-tuning techniques without introducing additional free parameters or invented physical entities.

axioms (1)
  • domain assumption Existing discriminative VMR models and MLLMs can be adapted to the GMR setting via lightweight modules or reward functions.
    The baselines assume prior VMR and MLLM architectures provide a suitable foundation for the new unified retrieval task.
invented entities (2)
  • Generalized Moment Retrieval (GMR) no independent evidence
    purpose: Unified task requiring retrieval of any number of relevant moments or an empty set
    New formulation introduced to address limitations of single-moment VMR.
  • Soccer-GMR benchmark no independent evidence
    purpose: Large-scale dataset for evaluating GMR on challenging soccer videos
    Constructed specifically for this work via semi-automated pipeline.

pith-pipeline@v0.9.0 · 5545 in / 1369 out tokens · 39122 ms · 2026-05-08T18:43:27.208141+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 9 canonical work pages · 2 internal anchors

  1. [1]

    Adnen Abdessaied, Anna Rohrbach, Marcus Rohrbach, and Andreas Bulling

  2. [2]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Vˆ 2Dial: Unification of Video and Visual Dialog via Multimodal Experts. InProceedings of the Computer Vision and Pattern Recognition Conference. 8637– 8647

  3. [3]

    Adnen Abdessaied, Lei Shi, and Andreas Bulling. 2024. Multi-modal video dialog state tracking in the wild. InEuropean Conference on Computer Vision. Springer, 348–365. Ding et al. Table 9: Data Statistics of Gymnastics-GMR. ★ Duration-flexible instantiation:300swindows here versus150sin Soccer-GMR. Dataset Domain # Queries # Moments / # Videos Avg. Moment...

  4. [4]

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. 2017. Localizing moments in video with natural language. In Proceedings of the IEEE international conference on computer vision. 5803–5812

  5. [5]

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al . 2025. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631(2025)

  6. [6]

    Zhuo Cao, Heming Du, Bingqing Zhang, Xin Yu, Xue Li, and Sen Wang. 2025. When One Moment Isn’t Enough: Multi-Moment Retrieval with Cross-Moment Interactions.arXiv preprint arXiv:2510.17218(2025)

  7. [7]

    Zhuo Cao, Bingqing Zhang, Heming Du, Xin Yu, Xue Li, and Sen Wang. 2025. Flashvtg: Feature layering and adaptive score handling network for video tempo- ral grounding. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 9226–9236

  8. [8]

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan- der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. InEuropean conference on computer vision. Springer, 213–229

  9. [9]

    Houlun Chen, Xin Wang, Hong Chen, Zeyang Zhang, Wei Feng, Bin Huang, Jia Jia, and Wenwu Zhu. 2024. Verified: A video corpus moment retrieval benchmark for fine-grained video understanding.Advances in Neural Information Processing Systems37 (2024), 40393–40406

  10. [10]

    Qirui Chen, Shangzhe Di, and Weidi Xie. 2025. Grounded multi-hop videoqa in long-form egocentric videos. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 39. 2159–2167

  11. [11]

    Weixing Chen, Yang Liu, Binglin Chen, Jiandong Su, Yongsen Zheng, and Liang Lin. 2025. Cross-modal causal relation alignment for video question grounding. InProceedings of the Computer Vision and Pattern Recognition Conference. 24087– 24096

  12. [12]

    Xianke Chen, Daizong Liu, Xun Yang, Xirong Li, Jianfeng Dong, Meng Wang, and Xun Wang. 2025. Prvr: Partially relevant video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  13. [13]

    Adrien Deliege, Anthony Cioppa, Silvio Giancola, Meisam J Seikavandi, Jacob V Dueholm, Kamal Nasrollahi, Bernard Ghanem, Thomas B Moeslund, and Marc Van Droogenbroeck. 2021. Soccernet-v2: A dataset and benchmarks for holis- tic understanding of broadcast soccer videos. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4508–4519

  14. [14]

    Andong Deng, Tongjia Chen, Shoubin Yu, Taojiannan Yang, Lincoln Spencer, Yapeng Tian, Ajmal Saeed Mian, Mohit Bansal, and Chen Chen. 2025. Motion- grounded video reasoning: Understanding and perceiving motion at pixel level. InProceedings of the Computer Vision and Pattern Recognition Conference. 8625– 8636

  15. [15]

    Xiang Fang, Wanlong Fang, Daizong Liu, Xiaoye Qu, Jianfeng Dong, Pan Zhou, Renfu Li, Zichuan Xu, Lixing Chen, Panpan Zheng, et al. 2024. Not all inputs are valid: Towards open-set video moment retrieval using language. InProceedings of the 32nd ACM International Conference on Multimedia. 28–37

  16. [16]

    Tom Fawcett. 2006. An introduction to ROC analysis.Pattern recognition letters 27, 8 (2006), 861–874

  17. [17]

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- Fast Networks for Video Recognition. InProceedings of the IEEE/CVF International Conference on Computer Vision. 6202–6211

  18. [18]

    Kevin Flanagan, Dima Damen, and Michael Wray. 2025. Moment of Untruth: Dealing with Negative Queries in Video Moment Retrieval. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV). IEEE, 5336–5345

  19. [19]

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Nevatia. 2017. TALL: Temporal Activity Localization via Language Query. InICCV

  20. [20]

    Yongxin Guo, Jingyu Liu, Mingda Li, Qingbin Liu, Xi Chen, and Xiaoying Tang

  21. [21]

    Trace: Temporal grounding video llm via causal event modeling.arXiv preprint arXiv:2410.05643(2024)

  22. [22]

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. 2022. Lora: Low-rank adaptation of large language models.Iclr1, 2 (2022), 3

  23. [23]

    Jinhyun Jang, Jungin Park, Jin Kim, Hyeongjun Kwon, and Kwanghoon Sohn

  24. [24]

    In Proceedings of the IEEE/CVF International Conference on Computer Vision

    Knowing where to focus: Event-aware transformer for video grounding. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13846– 13856

  25. [25]

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles

  26. [26]

    InProceedings of the IEEE international conference on computer vision

    Dense-captioning events in videos. InProceedings of the IEEE international conference on computer vision. 706–715

  27. [27]

    Yogesh Kumar, Uday Agarwal, Manish Gupta, and Anand Mishra. 2025. Aligning moments in time using video queries. InProceedings of the IEEE/CVF International Conference on Computer Vision. 20215–20225

  28. [28]

    Xiaohan Lan, Yitian Yuan, Xin Wang, Zhi Wang, and Wenwu Zhu. 2023. A survey on temporal sentence grounding in videos.ACM Transactions on Multimedia Computing, Communications and Applications19, 2 (2023), 1–33

  29. [29]

    Jungsoo Lee, Janghoon Cho, Hyojin Park, Munawar Hayat, Kyuwoong Hwang, Fatih Porikli, and Sungha Choi. 2025. Generalized contrastive learning for universal multimodal retrieval.arXiv preprint arXiv:2509.25638(2025)

  30. [30]

    Jie Lei, Tamara L Berg, and Mohit Bansal. 2021. Detecting moments and highlights in videos via natural language queries.Advances in Neural Information Processing Systems34 (2021), 11846–11858

  31. [31]

    Jie Lei, Licheng Yu, Tamara L Berg, and Mohit Bansal. 2020. Tvr: A large-scale dataset for video-subtitle moment retrieval. InEuropean Conference on Computer Vision. Springer, 447–463

  32. [32]

    Juncheng Li, Junlin Xie, Long Qian, Linchao Zhu, Siliang Tang, Fei Wu, Yi Yang, Yueting Zhuang, and Xin Eric Wang. 2022. Compositional temporal grounding with structured variational cross-graph correspondence learning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3032– 3041

  33. [33]

    Renjie Liang, Chongzhi Zhang, Li Li, Jing Wang, Xizhou Zhu, and Aixin Sun

  34. [34]

    InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region

    Tvr-ranking: A dataset for ranked video moment retrieval with imprecise queries. InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 231–239

  35. [35]

    Huabin Liu, Filip Ilievski, and Cees GM Snoek. 2025. Commonsense video ques- tion answering through video-grounded entailment tree reasoning. InProceedings of the Computer Vision and Pattern Recognition Conference. 3262–3271

  36. [36]

    Meng Liu, Liqiang Nie, Yunxiao Wang, Meng Wang, and Yong Rui. 2023. A survey on video moment localization.Comput. Surveys55, 9 (2023), 1–37

  37. [37]

    Hongxu Ma, Guanshuo Wang, Fufu Yu, Qiong Jia, and Shouhong Ding. 2025. Ms-detr: Towards effective video moment retrieval and highlight detection by joint motion-semantic learning. InProceedings of the 33rd ACM International Conference on Multimedia. 4514–4523

  38. [38]

    WonJun Moon, Sangeek Hyun, SuBeen Lee, and Jae-Pil Heo. 2023. Correlation- guided query-dependency calibration for video temporal grounding.arXiv preprint arXiv:2311.08835(2023)

  39. [39]

    WonJun Moon, Sangeek Hyun, SangUk Park, Dongchan Park, and Jae-Pil Heo

  40. [40]

    InProceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Query-dependent video representation for moment retrieval and highlight detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 23023–23033

  41. [41]

    Shraman Pramanick, Effrosyni Mavroudi, Yale Song, Rama Chellappa, Lorenzo Torresani, and Triantafyllos Afouras. 2025. Enrich and Detect: Video Temporal Grounding with Multimodal LLMs. InProceedings of the IEEE/CVF International Conference on Computer Vision. 24297–24308

  42. [42]

    You Qin, Qilong Wu, Yicong Li, Wei Ji, Li Li, Pengcheng Cai, Lina Wei, and Roger Zimmermann. 2025. Generalized video moment retrieval. InThe Thirteenth International Conference on Learning Representations

  43. [43]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning Transferable Visual Models from Natural Language Supervision. InInternational Conference on Machine Learning. 8748–8763

  44. [44]

    Jiayuan Rao, Haoning Wu, Hao Jiang, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2025. Towards universal soccer video understanding. InProceedings of the Computer Vision and Pattern Recognition Conference. 8384–8394

  45. [45]

    Michaela Regneri, Marcus Rohrbach, Dominikus Wetzel, Stefan Thater, Bernt Schiele, and Manfred Pinkal. 2013. Grounding action descriptions in videos. Transactions of the Association for Computational Linguistics1 (2013), 25–36

  46. [46]

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. 2024. Timechat: A time-sensitive multimodal large language model for long video understand- ing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14313–14323

  47. [47]

    Dian Shao, Yue Zhao, Bo Dai, and Dahua Lin. 2020. FineGym: A Hierarchical Video Dataset for Fine-grained Action Understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  48. [48]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. 2024. Deepseekmath: Pushing Retrieving Any Relevant Moments: Benchmark and Models for Generalized Moment Retrieval the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300(2024)

  49. [49]

    StatsBomb. 2018. StatsBomb Open Data. https://github.com/statsbomb/open- data. Accessed: 2025

  50. [50]

    Jiankang Wang, Zhihan Zhang, Zhihang Liu, Yang Li, Jiannan Ge, Hongtao Xie, and Yongdong Zhang. 2026. Spacevllm: Endowing multimodal large language model with spatio-temporal video grounding capability. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 40. 9912–9920

  51. [51]

    Jongbhin Woo, Hyeonggon Ryu, Youngjoon Jang, Jae Won Cho, and Joon Son Chung. 2024. Let me finish my sentence: Video temporal grounding with holistic text understanding. InProceedings of the 32nd ACM International Conference on Multimedia. 8199–8208

  52. [52]

    Jianlong Wu, Wei Liu, Ye Liu, Meng Liu, Liqiang Nie, Zhouchen Lin, and Chang Wen Chen. 2025. A survey on video temporal grounding with multi- modal large language model.IEEE Transactions on Pattern Analysis and Machine Intelligence(2025)

  53. [53]

    Eric Xing, Pranavi Kolouju, Robert Pless, Abby Stylianou, and Nathan Jacobs

  54. [54]

    InProceedings of the Computer Vision and Pattern Recognition Conference

    Context-cir: Learning from concepts in text for composed image retrieval. InProceedings of the Computer Vision and Pattern Recognition Conference. 19638– 19648

  55. [55]

    Nakyeong Yang, Minsung Kim, Seunghyun Yoon, Joongbo Shin, and Kyomin Jung. 2024. A New Framework for Evaluating Faithfulness of Video Moment Re- trieval against Multiple Distractors. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 2869–2878

  56. [56]

    Boqiang Zhang, Kehan Li, Zesen Cheng, Zhiqiang Hu, Yuqian Yuan, Guanzheng Chen, Sicong Leng, Yuming Jiang, Hang Zhang, Xin Li, et al. 2025. Videollama 3: Frontier multimodal foundation models for image and video understanding. arXiv preprint arXiv:2501.13106(2025)

  57. [57]

    Hao Zhang, Aixin Sun, Wei Jing, and Joey Tianyi Zhou. 2023. Temporal sentence grounding in videos: A survey and future directions.IEEE Transactions on Pattern Analysis and Machine Intelligence45, 8 (2023), 10443–10465

  58. [58]

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. 2025. Bridging modal- ities: Improving universal multimodal retrieval by multimodal large language models. InProceedings of the Computer Vision and Pattern Recognition Conference. 9274–9285

  59. [59]

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Ziwei Liu, and Chunyuan Li. 2024. Llava-video: Video instruction tuning with synthetic data.arXiv preprint arXiv:2410.02713(2024)

  60. [60]

    Pengcheng Zhao, Zhixian He, Fuwei Zhang, Shujin Lin, and Fan Zhou. 2025. Ld-detr: Loop decoder detection transformer for video moment retrieval and highlight detection.arXiv preprint arXiv:2501.10787(2025)