pith. sign in

arxiv: 2601.12193 · v3 · submitted 2026-01-17 · 💻 cs.CV

VeRVE: Versatile Retrieval for Videos via Unified Embeddings

Pith reviewed 2026-05-16 12:50 UTC · model grok-4.3

classification 💻 cs.CV
keywords video retrievalmultimodal LLMcontrastive alignmentmoment localizationcomposed queriesLoRA fine-tuningunified embeddingsreranking
0
0 comments X

The pith

A shared MLLM backbone with contrastive embeddings unifies corpus-level video retrieval, moment localization, and composed multimodal queries in one model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that contrastive alignment of visual and textual embeddings from a single multimodal LLM backbone, trained efficiently with LoRA on 700K pairs, supports embedding-based search across multiple retrieval tasks. This architecture outperforms other MLLM methods on zero-shot video retrieval and adapts without further training to handle fine-grained moment localization and flexible composed queries. With added reranking training on top of the embeddings, performance reaches levels comparable to specialized state-of-the-art systems. A reader would care because the approach replaces multiple task-specific models with one general system that still delivers strong results on diverse video search needs.

Core claim

VeRVE establishes that contrastive alignment in a shared MLLM backbone creates a unified embedding space for efficient candidate search. The model, trained via LoRA on 700K visual-text pairs, exceeds prior MLLM retrieval methods on zero-shot tasks, transfers directly to moment retrieval and composed queries, and after reranking training matches specialized models while surpassing other MLLM systems.

What carries the argument

The contrastively aligned visual-textual embedding space from the shared MLLM backbone, which powers fast embedding search followed by optional reranking.

If this is right

  • Zero-shot video retrieval exceeds other MLLM-based systems on established benchmarks.
  • The same embeddings transfer to moment-level localization without extra architecture.
  • Composed multimodal queries achieve state-of-the-art zero-shot results.
  • Reranking training closes the gap to specialized retrieval models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • A single model could replace separate systems for different video search scenarios in practice.
  • Larger MLLM backbones might extend the approach to even finer retrieval granularity.
  • The contrastive alignment step may transfer to other multimodal domains beyond video.

Load-bearing premise

The shared MLLM backbone with contrastive alignment generalizes to fine-grained moment localization and composed multimodal queries without task-specific architectural changes or heavy retraining.

What would settle it

On standard benchmarks such as MSR-VTT or ActivityNet, the embedding search fails to place relevant videos or moments in top-k results in zero-shot tests, or reranking training does not reach parity with specialized models.

Figures

Figures reproduced from arXiv: 2601.12193 by Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Shaunak Halbe, Toufiq Parag, Vimal Bhat.

Figure 1
Figure 1. Figure 1: VIRTUE supports corpus-level retrieval with reranking, zero-shot composed video retrieval, and zero-shot moment localization within a single architecture. The table highlights that VIRTUE uniquely offers unified embeddings and versatile capabilities. ⋄ indicates models that, while architecturally capable of processing multimodal inputs, have not demonstrated composed video retrieval capability. while exten… view at source ↗
Figure 2
Figure 2. Figure 2: VIRTUE-Embed uses the final hidden state of the EOS token as an embedding anchor, and aligns visual content and text descriptions through contrastive learning. novel re-ranking objective along with the different negative sampling strategies we employ. Beyond corpus-level video retrieval, VIRTUE-Embed directly supports moment local￾ization (Section 3.3) and composed video retrieval (Sec￾tion 3.4) in a zero-… view at source ↗
Figure 3
Figure 3. Figure 3: VIRTUE-Ranker re-scores each query–video pair by feeding them jointly through the MLLM and projecting the EOS hidden state to a pointwise matching score. dings across modalities, the model learns a unified repre￾sentation space suitable for multimodal video retrieval. 3.2. VIRTUE-Ranker: Candidate Refinement Once trained with the image/video to text contrastive ob￾jective, VIRTUE-Embed can be used at infer… view at source ↗
read the original abstract

Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces VeRVE, an MLLM-based framework for versatile video retrieval using a shared backbone with contrastive alignment of visual and textual embeddings via LoRA on 700K pairs. It claims to unify corpus-level retrieval, fine-grained moment localization, and composed multimodal queries in one architecture, surpassing other MLLM methods on zero-shot video retrieval, achieving competitive zero-shot moment retrieval without further training, SOTA results on zero-shot composed retrieval, and with additional reranking training, performance comparable to specialized models.

Significance. If the central claims hold under rigorous verification, this would represent a meaningful advance by showing that a single LoRA-adapted MLLM embedding space can support multiple video retrieval tasks—including temporal localization and multimodal composition—without task-specific architectures, potentially simplifying the landscape while matching specialized performance on key benchmarks.

major comments (2)
  1. [Abstract and moment-retrieval experiments] The claim of competitive zero-shot moment retrieval 'without further training' (abstract) rests on the assumption that global contrastive alignment on clip-level pairs encodes precise temporal boundaries. Standard contrastive losses optimize for global similarity and can collapse local distinctions; the manuscript must provide ablations or boundary-precision metrics (e.g., in the moment-retrieval experiments) showing that no implicit post-processing or proposal mechanism is required, as specialized models explicitly add temporal convolutions or generators to address this.
  2. [Reranking and final results sections] The comparability to SOTA specialized models is achieved only after 'additional training for reranking candidates identified in the embedding-based search' (abstract). This reranking stage is load-bearing for the strongest claim; the manuscript must detail the reranker architecture, whether it re-uses the same unified embeddings or introduces new components, the training data volume, and a direct comparison isolating the contribution of the initial embedding stage versus the reranker.
minor comments (2)
  1. [Method] Clarify the precise embedding extraction procedure for moment-level queries (e.g., how start/end boundaries are represented in the shared space) and include the exact contrastive loss formulation with temperature and negative sampling details.
  2. [Experiments] Add explicit dataset statistics, baseline implementations, and full metric tables (R@1, R@5, mAP) for all tasks to allow direct comparison with prior MLLM and specialized work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses
  1. Referee: [Abstract and moment-retrieval experiments] The claim of competitive zero-shot moment retrieval 'without further training' (abstract) rests on the assumption that global contrastive alignment on clip-level pairs encodes precise temporal boundaries. Standard contrastive losses optimize for global similarity and can collapse local distinctions; the manuscript must provide ablations or boundary-precision metrics (e.g., in the moment-retrieval experiments) showing that no implicit post-processing or proposal mechanism is required, as specialized models explicitly add temporal convolutions or generators to address this.

    Authors: We appreciate the referee's emphasis on this distinction. In VeRVE, zero-shot moment retrieval operates directly on the frame-level embeddings from the shared MLLM backbone by evaluating query similarity over sliding temporal windows of fixed stride; no proposal generators, temporal convolutions, or post-processing beyond standard top-k selection on segment scores are employed. The global contrastive objective on clip pairs does produce embeddings that support this segmentation without collapse in practice, as evidenced by our competitive results. To address the request for explicit verification, we will add boundary-precision ablations (including mean IoU at multiple thresholds and comparisons against models with explicit temporal modules) in the revised moment-retrieval experiments section. revision: yes

  2. Referee: [Reranking and final results sections] The comparability to SOTA specialized models is achieved only after 'additional training for reranking candidates identified in the embedding-based search' (abstract). This reranking stage is load-bearing for the strongest claim; the manuscript must detail the reranker architecture, whether it re-uses the same unified embeddings or introduces new components, the training data volume, and a direct comparison isolating the contribution of the initial embedding stage versus the reranker.

    Authors: We agree that the reranking stage requires fuller exposition to substantiate the strongest claims. The reranker re-uses the identical unified MLLM embeddings for candidate scoring and augments them with a lightweight cross-modal fusion head; it does not introduce separate modality-specific encoders. In the revised manuscript we will expand the reranking subsection to describe the architecture in detail, report the exact volume of additional training pairs used, and include an ablation that isolates the embedding-stage recall from the final reranked performance. This will clarify the incremental contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents a standard contrastive training pipeline on 700K visual-text pairs using LoRA-adapted MLLM embeddings, followed by empirical evaluation on external zero-shot retrieval benchmarks for corpus, moment, and composed queries. No derivation step reduces by construction to its own fitted inputs or self-citations; performance claims rest on reported results against independent test sets rather than definitional equivalence or renamed fits. The unified-embedding claim is an empirical outcome of the described alignment objective, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard multimodal learning assumptions about embedding alignment and the effectiveness of parameter-efficient fine-tuning, with no new entities postulated and minimal free parameters beyond standard training choices.

axioms (1)
  • domain assumption Contrastive alignment of visual and textual embeddings from a shared MLLM backbone enables effective retrieval across multiple task types
    Invoked to support efficient embedding-based candidate search for corpus, moment, and composed queries.

pith-pipeline@v0.9.0 · 5546 in / 1378 out tokens · 46005 ms · 2026-05-16T12:50:32.582513+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

  1. [1]

    Flamingo: a visual language model for few-shot learning,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

  2. [2]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

  3. [3]

    A clip-hitchhiker’s guide to long video retrieval, 2022

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. A clip-hitchhiker’s guide to long video retrieval, 2022. 1, 3

  4. [4]

    Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022

    Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 1, 3

  5. [5]

    Perception encoder: The best visual embeddings are not at the output of the net- work, 2025

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Mon- teiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Doll´ar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the net- work, 2025. 6

  6. [6]

    Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 5

  7. [7]

    Collecting highly parallel data for paraphrase evaluation

    David Chen and William Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, 2011. Association for Computa- tional Linguistics. 2, 6, 11

  8. [8]

    Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3

  9. [9]

    Gramian multimodal representation learning and alignment, 2025

    Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal representation learning and alignment, 2025. 2, 3

  10. [10]

    Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge

    Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. Nv- retriever: Improving text embedding models with effective hard-negative mining, 2025. 6

  11. [11]

    Rui Meng et. al. Vlm2vec-v2: Advancing multimodal em- bedding for videos, images, and visual documents, 2025. 6, 13

  12. [12]

    Zhuoning Guo et. al. Towards universal video retrieval: Gen- eralizing video embedding via synthesized multimodal pyra- mid curriculum, 2025. 6, 7, 13

  13. [13]

    Tall: Temporal activity localization via language query,

    Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Neva- tia. Tall: Temporal activity localization via language query,

  14. [14]

    Localizing mo- ments in video with natural language, 2017

    Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language, 2017. 2, 6, 11

  15. [15]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 4, 11

  16. [16]

    Vtimellm: Empower llm to grasp video moments,

    Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments,

  17. [17]

    Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024

    Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, and Zeynep Akata. Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024. 1

  18. [18]

    E5-v: Universal embeddings with multi- modal large language models, 2024

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multi- modal large language models, 2024. 1, 3

  19. [19]

    Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025. 1, 2, 3, 6, 7, 13

  20. [20]

    Language-free training for zero-shot video grounding, 2022

    Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, and Kwanghoon Sohn. Language-free training for zero-shot video grounding, 2022. 8

  21. [21]

    Dense-captioning events in videos,

    Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos,

  22. [22]

    BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022. 3

  23. [23]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 1, 3

  24. [24]

    Mvbench: A comprehensive multi- modal video understanding benchmark, 2024

    Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark, 2024. 8

  25. [25]

    Univtg: Towards unified video- language temporal grounding, 2023

    Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding, 2023. 8

  26. [26]

    Visual instruction tuning, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1, 3 9

  27. [27]

    Lamra: Large multimodal model as your advanced retrieval assistant,

    Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant,

  28. [28]

    Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024

    Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Sheng- peng Ji, and Min Xia. Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024. 2

  29. [29]

    Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021. 1, 3

  30. [30]

    The surprising effectiveness of multimodal large language models for video moment retrieval.ArXiv, abs/2406.18113, 2024

    Boris Meinardus, Anil Batra, Anna Rohrbach, and Mar- cus Rohrbach. The surprising effectiveness of multimodal large language models for video moment retrieval.ArXiv, abs/2406.18113, 2024. 2

  31. [31]

    Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

    Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 3

  32. [32]

    End-to-end learning of visual representations from uncurated instruc- tional videos, 2020

    Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos, 2020. 3

  33. [33]

    Gen- erative representational instruction tuning, 2025

    Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Gen- erative representational instruction tuning, 2025. 8, 12

  34. [34]

    Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024

    Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024. 2, 3, 8

  35. [35]

    Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024

    Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024. 2, 3, 8

  36. [36]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3

  37. [37]

    Timechat: A time-sensitive multimodal large language model for long video understanding, 2024

    Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding, 2024. 3, 8

  38. [38]

    Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi

    Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrish- nan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi. Vidla: Video-language align- ment at scale, 2024. 3

  39. [39]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, 2018. Association for Computational Li...

  40. [40]

    Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra

    Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding, 2024. 3

  41. [41]

    Composed video retrieval via enriched context and discriminative embeddings, 2024

    Omkar Thawakar, Muzammal Naseer, Rao Muhammad An- wer, Salman Khan, Michael Felsberg, Mubarak Shah, and Fahad Shahbaz Khan. Composed video retrieval via enriched context and discriminative embeddings, 2024. 2, 7, 13

  42. [42]

    Repre- sentation learning with contrastive predictive coding, 2019

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 4

  43. [43]

    Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11409–11421, 2024

    Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11409–11421, 2024. 1, 2, 3, 7, 11, 13

  44. [44]

    Cogvlm: Visual expert for pretrained language models, 2024

    Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024. 3

  45. [45]

    Internvideo2: Scaling foundation models for multimodal video understanding, 2024

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling foundation models for multimodal video understanding, 2024. 1, 2, 3, 6, 7, 13

  46. [46]

    im invincible im unstoppable i’m a lion, 2024

    Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. im invincible im unstoppable i’m a lion, 2024. 3, 7, 8

  47. [47]

    Msr-vtt: A large video description dataset for bridging video and language

    Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016. 2, 6, 11

  48. [48]

    Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024

    Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024. 3

  49. [49]

    Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025

    Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, and Limin Wang. Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025. 1, 3, 6, 7, 13

  50. [50]

    Self-chained image-language model for video localization and question answering, 2023

    Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering, 2023. 8

  51. [51]

    Long-clip: Unlocking the long-text capability of clip, 2024

    Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip, 2024. 6, 7

  52. [52]

    Gme: Improving universal multimodal retrieval by multimodal llms, 2025

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2025. 1, 2, 3

  53. [53]

    Llava-video: Video instruction tuning with synthetic data, 2025

    Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. 1

  54. [54]

    Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2024

    Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2024. 6, 7 10 Appendix A. Additional Implementation Details All VIRTUE models use Qwen2...

  55. [55]

    We employ the AdamW optimizer with a cosine learning rate schedule and mixed-precision training (BF16)

    and2e−5for the video-text stage (stage 2). We employ the AdamW optimizer with a cosine learning rate schedule and mixed-precision training (BF16). At inference time, we include dual-softmax based re-ordering before feeding the candidates to the re-ranker only for theVIRTUE-Ranker based results in Tabs. 1 to 3. A.1. Evaluation Datasets We provide detailed ...