VeRVE: Versatile Retrieval for Videos via Unified Embeddings

Bhagyashree Puranik; Jayakrishnan Unnikrishnan; Kushan Thakkar; Shaunak Halbe; Toufiq Parag; Vimal Bhat

arxiv: 2601.12193 · v3 · submitted 2026-01-17 · 💻 cs.CV

VeRVE: Versatile Retrieval for Videos via Unified Embeddings

Shaunak Halbe , Bhagyashree Puranik , Jayakrishnan Unnikrishnan , Kushan Thakkar , Vimal Bhat , Toufiq Parag This is my paper

Pith reviewed 2026-05-16 12:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords video retrievalmultimodal LLMcontrastive alignmentmoment localizationcomposed queriesLoRA fine-tuningunified embeddingsreranking

0 comments

The pith

A shared MLLM backbone with contrastive embeddings unifies corpus-level video retrieval, moment localization, and composed multimodal queries in one model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that contrastive alignment of visual and textual embeddings from a single multimodal LLM backbone, trained efficiently with LoRA on 700K pairs, supports embedding-based search across multiple retrieval tasks. This architecture outperforms other MLLM methods on zero-shot video retrieval and adapts without further training to handle fine-grained moment localization and flexible composed queries. With added reranking training on top of the embeddings, performance reaches levels comparable to specialized state-of-the-art systems. A reader would care because the approach replaces multiple task-specific models with one general system that still delivers strong results on diverse video search needs.

Core claim

VeRVE establishes that contrastive alignment in a shared MLLM backbone creates a unified embedding space for efficient candidate search. The model, trained via LoRA on 700K visual-text pairs, exceeds prior MLLM retrieval methods on zero-shot tasks, transfers directly to moment retrieval and composed queries, and after reranking training matches specialized models while surpassing other MLLM systems.

What carries the argument

The contrastively aligned visual-textual embedding space from the shared MLLM backbone, which powers fast embedding search followed by optional reranking.

If this is right

Zero-shot video retrieval exceeds other MLLM-based systems on established benchmarks.
The same embeddings transfer to moment-level localization without extra architecture.
Composed multimodal queries achieve state-of-the-art zero-shot results.
Reranking training closes the gap to specialized retrieval models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A single model could replace separate systems for different video search scenarios in practice.
Larger MLLM backbones might extend the approach to even finer retrieval granularity.
The contrastive alignment step may transfer to other multimodal domains beyond video.

Load-bearing premise

The shared MLLM backbone with contrastive alignment generalizes to fine-grained moment localization and composed multimodal queries without task-specific architectural changes or heavy retraining.

What would settle it

On standard benchmarks such as MSR-VTT or ActivityNet, the embedding search fails to place relevant videos or moments in top-k results in zero-shot tests, or reranking training does not reach parity with specialized models.

Figures

Figures reproduced from arXiv: 2601.12193 by Bhagyashree Puranik, Jayakrishnan Unnikrishnan, Kushan Thakkar, Shaunak Halbe, Toufiq Parag, Vimal Bhat.

**Figure 1.** Figure 1: VIRTUE supports corpus-level retrieval with reranking, zero-shot composed video retrieval, and zero-shot moment localization within a single architecture. The table highlights that VIRTUE uniquely offers unified embeddings and versatile capabilities. ⋄ indicates models that, while architecturally capable of processing multimodal inputs, have not demonstrated composed video retrieval capability. while exten… view at source ↗

**Figure 2.** Figure 2: VIRTUE-Embed uses the final hidden state of the EOS token as an embedding anchor, and aligns visual content and text descriptions through contrastive learning. novel re-ranking objective along with the different negative sampling strategies we employ. Beyond corpus-level video retrieval, VIRTUE-Embed directly supports moment localization (Section 3.3) and composed video retrieval (Section 3.4) in a zero-… view at source ↗

**Figure 3.** Figure 3: VIRTUE-Ranker re-scores each query–video pair by feeding them jointly through the MLLM and projecting the EOS hidden state to a pointwise matching score. dings across modalities, the model learns a unified representation space suitable for multimodal video retrieval. 3.2. VIRTUE-Ranker: Candidate Refinement Once trained with the image/video to text contrastive objective, VIRTUE-Embed can be used at infer… view at source ↗

read the original abstract

Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VeRVE puts corpus, moment, and composed video retrieval into one LoRA-tuned MLLM embedding space, but the moment results look like they may lean on reranking more than the shared embeddings alone can deliver.

read the letter

The paper's core move is training a single MLLM backbone with LoRA on 700k visual-text pairs using contrastive alignment, then using the resulting embeddings for corpus-level search, zero-shot moment localization, and composed multimodal queries. It reports beating other MLLM retrieval methods on zero-shot tasks and reaching parity with specialized models once a reranker is trained on top. The practical payoff is clear: one model handles three different retrieval modes without separate encoders or heavy retraining for each.

Referee Report

2 major / 2 minor

Summary. The paper introduces VeRVE, an MLLM-based framework for versatile video retrieval using a shared backbone with contrastive alignment of visual and textual embeddings via LoRA on 700K pairs. It claims to unify corpus-level retrieval, fine-grained moment localization, and composed multimodal queries in one architecture, surpassing other MLLM methods on zero-shot video retrieval, achieving competitive zero-shot moment retrieval without further training, SOTA results on zero-shot composed retrieval, and with additional reranking training, performance comparable to specialized models.

Significance. If the central claims hold under rigorous verification, this would represent a meaningful advance by showing that a single LoRA-adapted MLLM embedding space can support multiple video retrieval tasks—including temporal localization and multimodal composition—without task-specific architectures, potentially simplifying the landscape while matching specialized performance on key benchmarks.

major comments (2)

[Abstract and moment-retrieval experiments] The claim of competitive zero-shot moment retrieval 'without further training' (abstract) rests on the assumption that global contrastive alignment on clip-level pairs encodes precise temporal boundaries. Standard contrastive losses optimize for global similarity and can collapse local distinctions; the manuscript must provide ablations or boundary-precision metrics (e.g., in the moment-retrieval experiments) showing that no implicit post-processing or proposal mechanism is required, as specialized models explicitly add temporal convolutions or generators to address this.
[Reranking and final results sections] The comparability to SOTA specialized models is achieved only after 'additional training for reranking candidates identified in the embedding-based search' (abstract). This reranking stage is load-bearing for the strongest claim; the manuscript must detail the reranker architecture, whether it re-uses the same unified embeddings or introduces new components, the training data volume, and a direct comparison isolating the contribution of the initial embedding stage versus the reranker.

minor comments (2)

[Method] Clarify the precise embedding extraction procedure for moment-level queries (e.g., how start/end boundaries are represented in the shared space) and include the exact contrastive loss formulation with temperature and negative sampling details.
[Experiments] Add explicit dataset statistics, baseline implementations, and full metric tables (R@1, R@5, mAP) for all tasks to allow direct comparison with prior MLLM and specialized work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions.

read point-by-point responses

Referee: [Abstract and moment-retrieval experiments] The claim of competitive zero-shot moment retrieval 'without further training' (abstract) rests on the assumption that global contrastive alignment on clip-level pairs encodes precise temporal boundaries. Standard contrastive losses optimize for global similarity and can collapse local distinctions; the manuscript must provide ablations or boundary-precision metrics (e.g., in the moment-retrieval experiments) showing that no implicit post-processing or proposal mechanism is required, as specialized models explicitly add temporal convolutions or generators to address this.

Authors: We appreciate the referee's emphasis on this distinction. In VeRVE, zero-shot moment retrieval operates directly on the frame-level embeddings from the shared MLLM backbone by evaluating query similarity over sliding temporal windows of fixed stride; no proposal generators, temporal convolutions, or post-processing beyond standard top-k selection on segment scores are employed. The global contrastive objective on clip pairs does produce embeddings that support this segmentation without collapse in practice, as evidenced by our competitive results. To address the request for explicit verification, we will add boundary-precision ablations (including mean IoU at multiple thresholds and comparisons against models with explicit temporal modules) in the revised moment-retrieval experiments section. revision: yes
Referee: [Reranking and final results sections] The comparability to SOTA specialized models is achieved only after 'additional training for reranking candidates identified in the embedding-based search' (abstract). This reranking stage is load-bearing for the strongest claim; the manuscript must detail the reranker architecture, whether it re-uses the same unified embeddings or introduces new components, the training data volume, and a direct comparison isolating the contribution of the initial embedding stage versus the reranker.

Authors: We agree that the reranking stage requires fuller exposition to substantiate the strongest claims. The reranker re-uses the identical unified MLLM embeddings for candidate scoring and augments them with a lightweight cross-modal fusion head; it does not introduce separate modality-specific encoders. In the revised manuscript we will expand the reranking subsection to describe the architecture in detail, report the exact volume of additional training pairs used, and include an ablation that isolates the embedding-stage recall from the final reranked performance. This will clarify the incremental contribution of each component. revision: yes

Circularity Check

0 steps flagged

No circularity detected in derivation chain

full rationale

The paper presents a standard contrastive training pipeline on 700K visual-text pairs using LoRA-adapted MLLM embeddings, followed by empirical evaluation on external zero-shot retrieval benchmarks for corpus, moment, and composed queries. No derivation step reduces by construction to its own fitted inputs or self-citations; performance claims rest on reported results against independent test sets rather than definitional equivalence or renamed fits. The unified-embedding claim is an empirical outcome of the described alignment objective, not a self-referential loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on standard multimodal learning assumptions about embedding alignment and the effectiveness of parameter-efficient fine-tuning, with no new entities postulated and minimal free parameters beyond standard training choices.

axioms (1)

domain assumption Contrastive alignment of visual and textual embeddings from a shared MLLM backbone enables effective retrieval across multiple task types
Invoked to support efficient embedding-based candidate search for corpus, moment, and composed queries.

pith-pipeline@v0.9.0 · 5546 in / 1378 out tokens · 46005 ms · 2026-05-16T12:50:32.582513+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone... LInfoNCE = −log exp(sim(qi, ci)/τ) / Σ exp(sim(qi, cj)/τ)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages

[1]

Flamingo: a visual language model for few-shot learning,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

work page
[2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

work page 2025
[3]

A clip-hitchhiker’s guide to long video retrieval, 2022

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. A clip-hitchhiker’s guide to long video retrieval, 2022. 1, 3

work page 2022
[4]

Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 1, 3

work page 2022
[5]

Perception encoder: The best visual embeddings are not at the output of the net- work, 2025

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Mon- teiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Doll´ar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the net- work, 2025. 6

work page 2025
[6]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 5

work page 1952
[7]

Collecting highly parallel data for paraphrase evaluation

David Chen and William Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, 2011. Association for Computa- tional Linguistics. 2, 6, 11

work page 2011
[8]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3

work page 2024
[9]

Gramian multimodal representation learning and alignment, 2025

Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal representation learning and alignment, 2025. 2, 3

work page 2025
[10]

Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge

Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. Nv- retriever: Improving text embedding models with effective hard-negative mining, 2025. 6

work page 2025
[11]

Rui Meng et. al. Vlm2vec-v2: Advancing multimodal em- bedding for videos, images, and visual documents, 2025. 6, 13

work page 2025
[12]

Zhuoning Guo et. al. Towards universal video retrieval: Gen- eralizing video embedding via synthesized multimodal pyra- mid curriculum, 2025. 6, 7, 13

work page 2025
[13]

Tall: Temporal activity localization via language query,

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Neva- tia. Tall: Temporal activity localization via language query,

work page
[14]

Localizing mo- ments in video with natural language, 2017

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language, 2017. 2, 6, 11

work page 2017
[15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 4, 11

work page 2021
[16]

Vtimellm: Empower llm to grasp video moments,

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments,

work page
[17]

Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024

Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, and Zeynep Akata. Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024. 1

work page 2024
[18]

E5-v: Universal embeddings with multi- modal large language models, 2024

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multi- modal large language models, 2024. 1, 3

work page 2024
[19]

Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025. 1, 2, 3, 6, 7, 13

work page 2025
[20]

Language-free training for zero-shot video grounding, 2022

Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, and Kwanghoon Sohn. Language-free training for zero-shot video grounding, 2022. 8

work page 2022
[21]

Dense-captioning events in videos,

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos,

work page
[22]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022. 3

work page 2022
[23]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 1, 3

work page 2023
[24]

Mvbench: A comprehensive multi- modal video understanding benchmark, 2024

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark, 2024. 8

work page 2024
[25]

Univtg: Towards unified video- language temporal grounding, 2023

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding, 2023. 8

work page 2023
[26]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1, 3 9

work page 2023
[27]

Lamra: Large multimodal model as your advanced retrieval assistant,

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant,

work page
[28]

Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024

Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Sheng- peng Ji, and Min Xia. Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024. 2

work page 2024
[29]

Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021. 1, 3

work page 2021
[30]

The surprising effectiveness of multimodal large language models for video moment retrieval.ArXiv, abs/2406.18113, 2024

Boris Meinardus, Anil Batra, Anna Rohrbach, and Mar- cus Rohrbach. The surprising effectiveness of multimodal large language models for video moment retrieval.ArXiv, abs/2406.18113, 2024. 2

work page arXiv 2024
[31]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 3

work page 2019
[32]

End-to-end learning of visual representations from uncurated instruc- tional videos, 2020

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos, 2020. 3

work page 2020
[33]

Gen- erative representational instruction tuning, 2025

Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Gen- erative representational instruction tuning, 2025. 8, 12

work page 2025
[34]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024. 2, 3, 8

work page 2024
[35]

Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024. 2, 3, 8

work page 2024
[36]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3

work page 2021
[37]

Timechat: A time-sensitive multimodal large language model for long video understanding, 2024

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding, 2024. 3, 8

work page 2024
[38]

Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi

Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrish- nan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi. Vidla: Video-language align- ment at scale, 2024. 3

work page 2024
[39]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, 2018. Association for Computational Li...

work page 2018
[40]

Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding, 2024. 3

work page 2024
[41]

Composed video retrieval via enriched context and discriminative embeddings, 2024

Omkar Thawakar, Muzammal Naseer, Rao Muhammad An- wer, Salman Khan, Michael Felsberg, Mubarak Shah, and Fahad Shahbaz Khan. Composed video retrieval via enriched context and discriminative embeddings, 2024. 2, 7, 13

work page 2024
[42]

Repre- sentation learning with contrastive predictive coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 4

work page 2019
[43]

Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11409–11421, 2024

Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11409–11421, 2024. 1, 2, 3, 7, 11, 13

work page 2024
[44]

Cogvlm: Visual expert for pretrained language models, 2024

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024. 3

work page 2024
[45]

Internvideo2: Scaling foundation models for multimodal video understanding, 2024

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling foundation models for multimodal video understanding, 2024. 1, 2, 3, 6, 7, 13

work page 2024
[46]

im invincible im unstoppable i’m a lion, 2024

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. im invincible im unstoppable i’m a lion, 2024. 3, 7, 8

work page 2024
[47]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016. 2, 6, 11

work page 2016
[48]

Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024. 3

work page 2024
[49]

Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025

Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, and Limin Wang. Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025. 1, 3, 6, 7, 13

work page 2025
[50]

Self-chained image-language model for video localization and question answering, 2023

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering, 2023. 8

work page 2023
[51]

Long-clip: Unlocking the long-text capability of clip, 2024

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip, 2024. 6, 7

work page 2024
[52]

Gme: Improving universal multimodal retrieval by multimodal llms, 2025

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2025. 1, 2, 3

work page 2025
[53]

Llava-video: Video instruction tuning with synthetic data, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. 1

work page 2025
[54]

Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2024

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2024. 6, 7 10 Appendix A. Additional Implementation Details All VIRTUE models use Qwen2...

work page 2024
[55]

We employ the AdamW optimizer with a cosine learning rate schedule and mixed-precision training (BF16)

and2e−5for the video-text stage (stage 2). We employ the AdamW optimizer with a cosine learning rate schedule and mixed-precision training (BF16). At inference time, we include dual-softmax based re-ordering before feeding the candidates to the re-ranker only for theVIRTUE-Ranker based results in Tabs. 1 to 3. A.1. Evaluation Datasets We provide detailed ...

work page

[1] [1]

Flamingo: a visual language model for few-shot learning,

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...

work page

[2] [2]

Qwen2.5-vl technical report, 2025

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...

work page 2025

[3] [3]

A clip-hitchhiker’s guide to long video retrieval, 2022

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. A clip-hitchhiker’s guide to long video retrieval, 2022. 1, 3

work page 2022

[4] [4]

Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022

Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 1, 3

work page 2022

[5] [5]

Perception encoder: The best visual embeddings are not at the output of the net- work, 2025

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Mon- teiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Doll´ar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the net- work, 2025. 6

work page 2025

[6] [6]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 5

work page 1952

[7] [7]

Collecting highly parallel data for paraphrase evaluation

David Chen and William Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, 2011. Association for Computa- tional Linguistics. 2, 6, 11

work page 2011

[8] [8]

Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks

Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3

work page 2024

[9] [9]

Gramian multimodal representation learning and alignment, 2025

Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal representation learning and alignment, 2025. 2, 3

work page 2025

[10] [10]

Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge

Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. Nv- retriever: Improving text embedding models with effective hard-negative mining, 2025. 6

work page 2025

[11] [11]

Rui Meng et. al. Vlm2vec-v2: Advancing multimodal em- bedding for videos, images, and visual documents, 2025. 6, 13

work page 2025

[12] [12]

Zhuoning Guo et. al. Towards universal video retrieval: Gen- eralizing video embedding via synthesized multimodal pyra- mid curriculum, 2025. 6, 7, 13

work page 2025

[13] [13]

Tall: Temporal activity localization via language query,

Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Neva- tia. Tall: Temporal activity localization via language query,

work page

[14] [14]

Localizing mo- ments in video with natural language, 2017

Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language, 2017. 2, 6, 11

work page 2017

[15] [15]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 4, 11

work page 2021

[16] [16]

Vtimellm: Empower llm to grasp video moments,

Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments,

work page

[17] [17]

Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024

Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, and Zeynep Akata. Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024. 1

work page 2024

[18] [18]

E5-v: Universal embeddings with multi- modal large language models, 2024

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multi- modal large language models, 2024. 1, 3

work page 2024

[19] [19]

Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025. 1, 2, 3, 6, 7, 13

work page 2025

[20] [20]

Language-free training for zero-shot video grounding, 2022

Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, and Kwanghoon Sohn. Language-free training for zero-shot video grounding, 2022. 8

work page 2022

[21] [21]

Dense-captioning events in videos,

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos,

work page

[22] [22]

BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022. 3

work page 2022

[23] [23]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 1, 3

work page 2023

[24] [24]

Mvbench: A comprehensive multi- modal video understanding benchmark, 2024

Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark, 2024. 8

work page 2024

[25] [25]

Univtg: Towards unified video- language temporal grounding, 2023

Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding, 2023. 8

work page 2023

[26] [26]

Visual instruction tuning, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1, 3 9

work page 2023

[27] [27]

Lamra: Large multimodal model as your advanced retrieval assistant,

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant,

work page

[28] [28]

Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024

Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Sheng- peng Ji, and Min Xia. Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024. 2

work page 2024

[29] [29]

Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021. 1, 3

work page 2021

[30] [30]

The surprising effectiveness of multimodal large language models for video moment retrieval.ArXiv, abs/2406.18113, 2024

Boris Meinardus, Anil Batra, Anna Rohrbach, and Mar- cus Rohrbach. The surprising effectiveness of multimodal large language models for video moment retrieval.ArXiv, abs/2406.18113, 2024. 2

work page arXiv 2024

[31] [31]

Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 3

work page 2019

[32] [32]

End-to-end learning of visual representations from uncurated instruc- tional videos, 2020

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos, 2020. 3

work page 2020

[33] [33]

Gen- erative representational instruction tuning, 2025

Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Gen- erative representational instruction tuning, 2025. 8, 12

work page 2025

[34] [34]

Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024

Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024. 2, 3, 8

work page 2024

[35] [35]

Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024

Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024. 2, 3, 8

work page 2024

[36] [36]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3

work page 2021

[37] [37]

Timechat: A time-sensitive multimodal large language model for long video understanding, 2024

Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding, 2024. 3, 8

work page 2024

[38] [38]

Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi

Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrish- nan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi. Vidla: Video-language align- ment at scale, 2024. 3

work page 2024

[39] [39]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, 2018. Association for Computational Li...

work page 2018

[40] [40]

Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra

Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding, 2024. 3

work page 2024

[41] [41]

Composed video retrieval via enriched context and discriminative embeddings, 2024

Omkar Thawakar, Muzammal Naseer, Rao Muhammad An- wer, Salman Khan, Michael Felsberg, Mubarak Shah, and Fahad Shahbaz Khan. Composed video retrieval via enriched context and discriminative embeddings, 2024. 2, 7, 13

work page 2024

[42] [42]

Repre- sentation learning with contrastive predictive coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 4

work page 2019

[43] [43]

Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11409–11421, 2024

Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11409–11421, 2024. 1, 2, 3, 7, 11, 13

work page 2024

[44] [44]

Cogvlm: Visual expert for pretrained language models, 2024

Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024. 3

work page 2024

[45] [45]

Internvideo2: Scaling foundation models for multimodal video understanding, 2024

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling foundation models for multimodal video understanding, 2024. 1, 2, 3, 6, 7, 13

work page 2024

[46] [46]

im invincible im unstoppable i’m a lion, 2024

Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. im invincible im unstoppable i’m a lion, 2024. 3, 7, 8

work page 2024

[47] [47]

Msr-vtt: A large video description dataset for bridging video and language

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016. 2, 6, 11

work page 2016

[48] [48]

Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024

Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024. 3

work page 2024

[49] [49]

Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025

Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, and Limin Wang. Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025. 1, 3, 6, 7, 13

work page 2025

[50] [50]

Self-chained image-language model for video localization and question answering, 2023

Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering, 2023. 8

work page 2023

[51] [51]

Long-clip: Unlocking the long-text capability of clip, 2024

Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip, 2024. 6, 7

work page 2024

[52] [52]

Gme: Improving universal multimodal retrieval by multimodal llms, 2025

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2025. 1, 2, 3

work page 2025

[53] [53]

Llava-video: Video instruction tuning with synthetic data, 2025

Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. 1

work page 2025

[54] [54]

Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2024

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2024. 6, 7 10 Appendix A. Additional Implementation Details All VIRTUE models use Qwen2...

work page 2024

[55] [55]

We employ the AdamW optimizer with a cosine learning rate schedule and mixed-precision training (BF16)

and2e−5for the video-text stage (stage 2). We employ the AdamW optimizer with a cosine learning rate schedule and mixed-precision training (BF16). At inference time, we include dual-softmax based re-ordering before feeding the candidates to the re-ranker only for theVIRTUE-Ranker based results in Tabs. 1 to 3. A.1. Evaluation Datasets We provide detailed ...

work page