VeRVE: Versatile Retrieval for Videos via Unified Embeddings
Pith reviewed 2026-05-16 12:50 UTC · model grok-4.3
The pith
A shared MLLM backbone with contrastive embeddings unifies corpus-level video retrieval, moment localization, and composed multimodal queries in one model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VeRVE establishes that contrastive alignment in a shared MLLM backbone creates a unified embedding space for efficient candidate search. The model, trained via LoRA on 700K visual-text pairs, exceeds prior MLLM retrieval methods on zero-shot tasks, transfers directly to moment retrieval and composed queries, and after reranking training matches specialized models while surpassing other MLLM systems.
What carries the argument
The contrastively aligned visual-textual embedding space from the shared MLLM backbone, which powers fast embedding search followed by optional reranking.
If this is right
- Zero-shot video retrieval exceeds other MLLM-based systems on established benchmarks.
- The same embeddings transfer to moment-level localization without extra architecture.
- Composed multimodal queries achieve state-of-the-art zero-shot results.
- Reranking training closes the gap to specialized retrieval models.
Where Pith is reading between the lines
- A single model could replace separate systems for different video search scenarios in practice.
- Larger MLLM backbones might extend the approach to even finer retrieval granularity.
- The contrastive alignment step may transfer to other multimodal domains beyond video.
Load-bearing premise
The shared MLLM backbone with contrastive alignment generalizes to fine-grained moment localization and composed multimodal queries without task-specific architectural changes or heavy retraining.
What would settle it
On standard benchmarks such as MSR-VTT or ActivityNet, the embedding search fails to place relevant videos or moments in top-k results in zero-shot tests, or reranking training does not reach parity with specialized models.
Figures
read the original abstract
Modern video retrieval systems are expected to handle diverse tasks ranging from corpus-level retrieval, fine-grained moment localization to flexible multimodal querying. Specialized architectures achieve strong retrieval performance by training modality-specific encoders on massive datasets, but they lack the ability to process composed multimodal queries. In contrast, multimodal LLM (MLLM)-based methods support rich multimodal search but their retrieval performance remains well below that of specialized systems. We present VeRVE, an MLLM-based versatile video retrieval framework that integrates corpus and moment-level retrieval capabilities while accommodating composed multimodal queries within a single architecture. We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone to facilitate efficient embedding-based candidate search. Our embedding model, trained efficiently using low-rank adaptation (LoRA) on 700K paired visual-text data samples, surpasses other MLLM-based methods on zero-shot video retrieval tasks. Additionally, we demonstrate that the same model can be adapted without further training to achieve competitive results on zero-shot moment retrieval, and state of the art results for zero-shot composed video retrieval. With additional training for reranking candidates identified in the embedding-based search, our model substantially outperforms existing MLLM-based retrieval systems and achieves retrieval performance comparable to state of the art specialized models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VeRVE, an MLLM-based framework for versatile video retrieval using a shared backbone with contrastive alignment of visual and textual embeddings via LoRA on 700K pairs. It claims to unify corpus-level retrieval, fine-grained moment localization, and composed multimodal queries in one architecture, surpassing other MLLM methods on zero-shot video retrieval, achieving competitive zero-shot moment retrieval without further training, SOTA results on zero-shot composed retrieval, and with additional reranking training, performance comparable to specialized models.
Significance. If the central claims hold under rigorous verification, this would represent a meaningful advance by showing that a single LoRA-adapted MLLM embedding space can support multiple video retrieval tasks—including temporal localization and multimodal composition—without task-specific architectures, potentially simplifying the landscape while matching specialized performance on key benchmarks.
major comments (2)
- [Abstract and moment-retrieval experiments] The claim of competitive zero-shot moment retrieval 'without further training' (abstract) rests on the assumption that global contrastive alignment on clip-level pairs encodes precise temporal boundaries. Standard contrastive losses optimize for global similarity and can collapse local distinctions; the manuscript must provide ablations or boundary-precision metrics (e.g., in the moment-retrieval experiments) showing that no implicit post-processing or proposal mechanism is required, as specialized models explicitly add temporal convolutions or generators to address this.
- [Reranking and final results sections] The comparability to SOTA specialized models is achieved only after 'additional training for reranking candidates identified in the embedding-based search' (abstract). This reranking stage is load-bearing for the strongest claim; the manuscript must detail the reranker architecture, whether it re-uses the same unified embeddings or introduces new components, the training data volume, and a direct comparison isolating the contribution of the initial embedding stage versus the reranker.
minor comments (2)
- [Method] Clarify the precise embedding extraction procedure for moment-level queries (e.g., how start/end boundaries are represented in the shared space) and include the exact contrastive loss formulation with temperature and negative sampling details.
- [Experiments] Add explicit dataset statistics, baseline implementations, and full metric tables (R@1, R@5, mAP) for all tasks to allow direct comparison with prior MLLM and specialized work.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments. We address each major point below, providing clarifications and committing to revisions that strengthen the manuscript without misrepresenting our contributions.
read point-by-point responses
-
Referee: [Abstract and moment-retrieval experiments] The claim of competitive zero-shot moment retrieval 'without further training' (abstract) rests on the assumption that global contrastive alignment on clip-level pairs encodes precise temporal boundaries. Standard contrastive losses optimize for global similarity and can collapse local distinctions; the manuscript must provide ablations or boundary-precision metrics (e.g., in the moment-retrieval experiments) showing that no implicit post-processing or proposal mechanism is required, as specialized models explicitly add temporal convolutions or generators to address this.
Authors: We appreciate the referee's emphasis on this distinction. In VeRVE, zero-shot moment retrieval operates directly on the frame-level embeddings from the shared MLLM backbone by evaluating query similarity over sliding temporal windows of fixed stride; no proposal generators, temporal convolutions, or post-processing beyond standard top-k selection on segment scores are employed. The global contrastive objective on clip pairs does produce embeddings that support this segmentation without collapse in practice, as evidenced by our competitive results. To address the request for explicit verification, we will add boundary-precision ablations (including mean IoU at multiple thresholds and comparisons against models with explicit temporal modules) in the revised moment-retrieval experiments section. revision: yes
-
Referee: [Reranking and final results sections] The comparability to SOTA specialized models is achieved only after 'additional training for reranking candidates identified in the embedding-based search' (abstract). This reranking stage is load-bearing for the strongest claim; the manuscript must detail the reranker architecture, whether it re-uses the same unified embeddings or introduces new components, the training data volume, and a direct comparison isolating the contribution of the initial embedding stage versus the reranker.
Authors: We agree that the reranking stage requires fuller exposition to substantiate the strongest claims. The reranker re-uses the identical unified MLLM embeddings for candidate scoring and augments them with a lightweight cross-modal fusion head; it does not introduce separate modality-specific encoders. In the revised manuscript we will expand the reranking subsection to describe the architecture in detail, report the exact volume of additional training pairs used, and include an ablation that isolates the embedding-stage recall from the final reranked performance. This will clarify the incremental contribution of each component. revision: yes
Circularity Check
No circularity detected in derivation chain
full rationale
The paper presents a standard contrastive training pipeline on 700K visual-text pairs using LoRA-adapted MLLM embeddings, followed by empirical evaluation on external zero-shot retrieval benchmarks for corpus, moment, and composed queries. No derivation step reduces by construction to its own fitted inputs or self-citations; performance claims rest on reported results against independent test sets rather than definitional equivalence or renamed fits. The unified-embedding claim is an empirical outcome of the described alignment objective, not a self-referential loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Contrastive alignment of visual and textual embeddings from a shared MLLM backbone enables effective retrieval across multiple task types
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use contrastive alignment of visual and textual embeddings generated using a shared MLLM backbone... LInfoNCE = −log exp(sim(qi, ci)/τ) / Σ exp(sim(qi, cj)/τ)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Flamingo: a visual language model for few-shot learning,
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, An- toine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Se- bastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sa- hand Sharifzadeh, Mikolaj Bink...
-
[2]
Qwen2.5-vl technical report, 2025
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report,...
work page 2025
-
[3]
A clip-hitchhiker’s guide to long video retrieval, 2022
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. A clip-hitchhiker’s guide to long video retrieval, 2022. 1, 3
work page 2022
-
[4]
Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022
Max Bain, Arsha Nagrani, G ¨ul Varol, and Andrew Zisser- man. Frozen in time: A joint video and image encoder for end-to-end retrieval, 2022. 1, 3
work page 2022
-
[5]
Perception encoder: The best visual embeddings are not at the output of the net- work, 2025
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Mon- teiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Doll´ar, and Christoph Feichtenhofer. Perception encoder: The best visual embeddings are not at the output of the net- work, 2025. 6
work page 2025
-
[6]
Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired compar- isons.Biometrika, 39(3/4):324–345, 1952. 5
work page 1952
-
[7]
Collecting highly parallel data for paraphrase evaluation
David Chen and William Dolan. Collecting highly parallel data for paraphrase evaluation. InProceedings of the 49th Annual Meeting of the Association for Computational Lin- guistics: Human Language Technologies, pages 190–200, Portland, Oregon, USA, 2011. Association for Computa- tional Linguistics. 2, 6, 11
work page 2011
-
[8]
Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks
Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation mod- els and aligning for generic visual-linguistic tasks. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24185–24198, 2024. 3
work page 2024
-
[9]
Gramian multimodal representation learning and alignment, 2025
Giordano Cicchetti, Eleonora Grassucci, Luigi Sigillo, and Danilo Comminiello. Gramian multimodal representation learning and alignment, 2025. 2, 3
work page 2025
-
[10]
Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge
Gabriel de Souza P. Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. Nv- retriever: Improving text embedding models with effective hard-negative mining, 2025. 6
work page 2025
-
[11]
Rui Meng et. al. Vlm2vec-v2: Advancing multimodal em- bedding for videos, images, and visual documents, 2025. 6, 13
work page 2025
-
[12]
Zhuoning Guo et. al. Towards universal video retrieval: Gen- eralizing video embedding via synthesized multimodal pyra- mid curriculum, 2025. 6, 7, 13
work page 2025
-
[13]
Tall: Temporal activity localization via language query,
Jiyang Gao, Chen Sun, Zhenheng Yang, and Ram Neva- tia. Tall: Temporal activity localization via language query,
-
[14]
Localizing mo- ments in video with natural language, 2017
Lisa Anne Hendricks, Oliver Wang, Eli Shechtman, Josef Sivic, Trevor Darrell, and Bryan Russell. Localizing mo- ments in video with natural language, 2017. 2, 6, 11
work page 2017
-
[15]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. 4, 11
work page 2021
-
[16]
Vtimellm: Empower llm to grasp video moments,
Bin Huang, Xin Wang, Hong Chen, Zihan Song, and Wenwu Zhu. Vtimellm: Empower llm to grasp video moments,
-
[17]
Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024
Thomas Hummel, Shyamgopal Karthik, Mariana-Iuliana Georgescu, and Zeynep Akata. Egocvr: An egocentric benchmark for fine-grained composed video retrieval, 2024. 1
work page 2024
-
[18]
E5-v: Universal embeddings with multi- modal large language models, 2024
Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multi- modal large language models, 2024. 1, 3
work page 2024
-
[19]
Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks, 2025. 1, 2, 3, 6, 7, 13
work page 2025
-
[20]
Language-free training for zero-shot video grounding, 2022
Dahye Kim, Jungin Park, Jiyoung Lee, Seongheon Park, and Kwanghoon Sohn. Language-free training for zero-shot video grounding, 2022. 8
work page 2022
-
[21]
Dense-captioning events in videos,
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. Dense-captioning events in videos,
-
[22]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. BLIP: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InProceed- ings of the 39th International Conference on Machine Learn- ing, pages 12888–12900. PMLR, 2022. 3
work page 2022
-
[23]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023. 1, 3
work page 2023
-
[24]
Mvbench: A comprehensive multi- modal video understanding benchmark, 2024
Kunchang Li, Yali Wang, Yinan He, Yizhuo Li, Yi Wang, Yi Liu, Zun Wang, Jilan Xu, Guo Chen, Ping Luo, Limin Wang, and Yu Qiao. Mvbench: A comprehensive multi- modal video understanding benchmark, 2024. 8
work page 2024
-
[25]
Univtg: Towards unified video- language temporal grounding, 2023
Kevin Qinghong Lin, Pengchuan Zhang, Joya Chen, Shra- man Pramanick, Difei Gao, Alex Jinpeng Wang, Rui Yan, and Mike Zheng Shou. Univtg: Towards unified video- language temporal grounding, 2023. 8
work page 2023
-
[26]
Visual instruction tuning, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning, 2023. 1, 3 9
work page 2023
-
[27]
Lamra: Large multimodal model as your advanced retrieval assistant,
Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant,
-
[28]
Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024
Weiheng Lu, Jian Li, An Yu, Ming-Ching Chang, Sheng- peng Ji, and Min Xia. Llava-mr: Large language-and-vision assistant for video moment retrieval, 2024. 2
work page 2024
-
[29]
Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip retrieval, 2021. 1, 3
work page 2021
-
[30]
Boris Meinardus, Anil Batra, Anna Rohrbach, and Mar- cus Rohrbach. The surprising effectiveness of multimodal large language models for video moment retrieval.ArXiv, abs/2406.18113, 2024. 2
-
[31]
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019
Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips, 2019. 3
work page 2019
-
[32]
End-to-end learning of visual representations from uncurated instruc- tional videos, 2020
Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, and Andrew Zisserman. End-to-end learning of visual representations from uncurated instruc- tional videos, 2020. 3
work page 2020
-
[33]
Gen- erative representational instruction tuning, 2025
Niklas Muennighoff, Hongjin Su, Liang Wang, Nan Yang, Furu Wei, Tao Yu, Amanpreet Singh, and Douwe Kiela. Gen- erative representational instruction tuning, 2025. 8, 12
work page 2025
-
[34]
Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024
Long Qian, Juncheng Li, Yu Wu, Yaobo Ye, Hao Fei, Tat- Seng Chua, Yueting Zhuang, and Siliang Tang. Momen- tor: Advancing video large language model with fine-grained temporal reasoning, 2024. 2, 3, 8
work page 2024
-
[35]
Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024
Mengxue Qu, Xiaodong Chen, Wu Liu, Alicia Li, and Yao Zhao. Chatvtg: Video temporal grounding via chat with video dialogue large language models, 2024. 2, 3, 8
work page 2024
-
[36]
Learning transferable visual models from natural language supervision, 2021
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021. 3
work page 2021
-
[37]
Timechat: A time-sensitive multimodal large language model for long video understanding, 2024
Shuhuai Ren, Linli Yao, Shicheng Li, Xu Sun, and Lu Hou. Timechat: A time-sensitive multimodal large language model for long video understanding, 2024. 3, 8
work page 2024
-
[38]
Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi
Mamshad Nayeem Rizve, Fan Fei, Jayakrishnan Unnikrish- nan, Son Tran, Benjamin Z. Yao, Belinda Zeng, Mubarak Shah, and Trishul Chilimbi. Vidla: Video-language align- ment at scale, 2024. 3
work page 2024
-
[39]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, Melbourne, Australia, 2018. Association for Computational Li...
work page 2018
-
[40]
Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra
Xiaoqian Shen, Yunyang Xiong, Changsheng Zhao, Lemeng Wu, Jun Chen, Chenchen Zhu, Zechun Liu, Fanyi Xiao, Bal- akrishnan Varadarajan, Florian Bordes, Zhuang Liu, Hu Xu, Hyunwoo J. Kim, Bilge Soran, Raghuraman Krishnamoor- thi, Mohamed Elhoseiny, and Vikas Chandra. Longvu: Spa- tiotemporal adaptive compression for long video-language understanding, 2024. 3
work page 2024
-
[41]
Composed video retrieval via enriched context and discriminative embeddings, 2024
Omkar Thawakar, Muzammal Naseer, Rao Muhammad An- wer, Salman Khan, Michael Felsberg, Mubarak Shah, and Fahad Shahbaz Khan. Composed video retrieval via enriched context and discriminative embeddings, 2024. 2, 7, 13
work page 2024
-
[42]
Repre- sentation learning with contrastive predictive coding, 2019
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding, 2019. 4
work page 2019
-
[43]
Lucas Ventura, Antoine Yang, Cordelia Schmid, and G ¨ul Varol. Covr-2: Automatic data construction for composed video retrieval.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):11409–11421, 2024. 1, 2, 3, 7, 11, 13
work page 2024
-
[44]
Cogvlm: Visual expert for pretrained language models, 2024
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, Jiazheng Xu, Bin Xu, Juanzi Li, Yuxiao Dong, Ming Ding, and Jie Tang. Cogvlm: Visual expert for pretrained language models, 2024. 3
work page 2024
-
[45]
Internvideo2: Scaling foundation models for multimodal video understanding, 2024
Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Ziang Yan, Rongkun Zheng, Jilan Xu, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. Internvideo2: Scaling foundation models for multimodal video understanding, 2024. 1, 2, 3, 6, 7, 13
work page 2024
-
[46]
im invincible im unstoppable i’m a lion, 2024
Yueqian Wang, Xiaojun Meng, Jianxin Liang, Yuxuan Wang, Qun Liu, and Dongyan Zhao. im invincible im unstoppable i’m a lion, 2024. 3, 7, 8
work page 2024
-
[47]
Msr-vtt: A large video description dataset for bridging video and language
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. Msr-vtt: A large video description dataset for bridging video and language. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016. 2, 6, 11
work page 2016
-
[48]
Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024
Lin Xu, Yilin Zhao, Daquan Zhou, Zhijie Lin, See Kiong Ng, and Jiashi Feng. Pllava : Parameter-free llava extension from images to videos for video dense captioning, 2024. 3
work page 2024
-
[49]
Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025
Yifan Xu, Xinhao Li, Yichun Yang, Desen Meng, Rui Huang, and Limin Wang. Carebench: A fine-grained bench- mark for video captioning and retrieval, 2025. 1, 3, 6, 7, 13
work page 2025
-
[50]
Self-chained image-language model for video localization and question answering, 2023
Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering, 2023. 8
work page 2023
-
[51]
Long-clip: Unlocking the long-text capability of clip, 2024
Beichen Zhang, Pan Zhang, Xiaoyi Dong, Yuhang Zang, and Jiaqi Wang. Long-clip: Unlocking the long-text capability of clip, 2024. 6, 7
work page 2024
-
[52]
Gme: Improving universal multimodal retrieval by multimodal llms, 2025
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms, 2025. 1, 2, 3
work page 2025
-
[53]
Llava-video: Video instruction tuning with synthetic data, 2025
Yuanhan Zhang, Jinming Wu, Wei Li, Bo Li, Zejun Ma, Zi- wei Liu, and Chunyuan Li. Llava-video: Video instruction tuning with synthetic data, 2025. 1
work page 2025
-
[54]
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian Pang, Wenhao Jiang, Junwu Zhang, Zong- wei Li, Wancai Zhang, Zhifeng Li, Wei Liu, and Li Yuan. Languagebind: Extending video-language pretraining to n- modality by language-based semantic alignment, 2024. 6, 7 10 Appendix A. Additional Implementation Details All VIRTUE models use Qwen2...
work page 2024
-
[55]
and2e−5for the video-text stage (stage 2). We employ the AdamW optimizer with a cosine learning rate schedule and mixed-precision training (BF16). At inference time, we include dual-softmax based re-ordering before feeding the candidates to the re-ranker only for theVIRTUE-Ranker based results in Tabs. 1 to 3. A.1. Evaluation Datasets We provide detailed ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.