pith. machine review for the scientific record. sign in

arxiv: 2507.04590 · v1 · pith:BP2MKXXLnew · submitted 2025-07-07 · 💻 cs.CV · cs.CL

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

Pith reviewed 2026-05-18 14:04 UTC · model grok-4.3

classification 💻 cs.CV cs.CL
keywords multimodal embeddingsvideo retrievalvisual document retrievalunified embedding learningMMEB-V2 benchmarktemporal groundingretrieval-augmented generation
0
0 comments X

The pith

VLM2Vec-V2 trains a single embedding model that handles text, images, videos, and visual documents while improving results on both new and existing benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to extend multimodal embedding models beyond natural images to cover videos and visual documents. Current models like the original VLM2Vec focus narrowly on images, which limits their usefulness for real applications such as AI agents, multimodal search, and retrieval-augmented generation. To address this, the authors create the MMEB-V2 benchmark with five new task types spanning visual documents and videos, then train VLM2Vec-V2 as a unified model on mixed data. Experiments show the model performs well on the new retrieval and grounding tasks while also beating prior baselines on the original image benchmarks.

Core claim

VLM2Vec-V2 is a general-purpose embedding model that supports text, image, video, and visual document inputs and achieves strong performance on newly introduced video and document retrieval tasks while also improving over prior baselines on the original image benchmarks.

What carries the argument

The unified framework that trains one model on a mixed data regime covering text, image, video, and visual document inputs to produce embeddings usable across all four modalities.

If this is right

  • Multimodal search and recommendation systems can use one embedding space instead of separate models for images, videos, and documents.
  • Retrieval-augmented generation pipelines gain access to video and visual document sources through the same embedding mechanism.
  • AI agents can retrieve and reason over mixed visual inputs without switching between modality-specific embedders.
  • Future representation learning research can build on the observed generalizability patterns across the new benchmark tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If unified training continues to lift image performance as a side effect, practitioners may prefer single models over ensembles even when only images are needed.
  • The benchmark expansion suggests similar extensions could be made for audio or 3D data to test whether the same training recipe scales further.
  • Effective strategies identified for unified embedding learning might reduce the cost of maintaining separate modality-specific systems in production.

Load-bearing premise

The training procedure and data mixture used for VLM2Vec-V2 will generalize across video and visual-document tasks without requiring modality-specific architectural changes or extra tuning.

What would settle it

A held-out set of video retrieval or visual document tasks where VLM2Vec-V2 scores lower than existing specialized models for those modalities would show the unified approach does not generalize as claimed.

read the original abstract

Multimodal embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering over different modalities. However, existing multimodal embeddings like VLM2Vec, E5-V, GME are predominantly focused on natural images, with limited support for other visual forms such as videos and visual documents. This restricts their applicability in real-world scenarios, including AI agents, multi-modal search and recommendation, and retrieval-augmented generation (RAG). To close this gap, we propose VLM2Vec-V2, a unified framework for learning embeddings across diverse visual forms. First, we introduce MMEB-V2, a comprehensive benchmark that extends MMEB with five new task types: visual document retrieval, video retrieval, temporal grounding, video classification and video question answering - spanning text, image, video, and visual document inputs. Next, we train VLM2Vec-V2, a general-purpose embedding model that supports text, image, video, and visual document inputs. Extensive experiments show that VLM2Vec-V2 achieves strong performance not only on the newly introduced video and document retrieval tasks, but also improves over prior baselines on the original image benchmarks. Through extensive evaluation, our study offers insights into the generalizability of various multimodal embedding models and highlights effective strategies for unified embedding learning, laying the groundwork for more scalable and adaptable representation learning in both research and real-world settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces VLM2Vec-V2, a unified framework for learning embeddings across text, images, videos, and visual documents. It presents MMEB-V2, an extended benchmark including new tasks for visual document retrieval, video retrieval, temporal grounding, video classification, and video question answering. The authors report that VLM2Vec-V2 achieves strong performance on the new video and document tasks while also improving upon prior baselines on the original image benchmarks from MMEB.

Significance. If the reported improvements on image benchmarks can be attributed to the unified training procedure and data mixture rather than simply increased training data or a stronger backbone, this work would represent a meaningful advance in multimodal embedding models. It addresses a practical gap in supporting diverse visual modalities for applications such as retrieval-augmented generation and AI agents, and the new benchmark could facilitate future research in this area.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that VLM2Vec-V2 improves over prior baselines on the original image benchmarks, yet provides no quantitative results, no details on the volume or composition of image-text pairs relative to the original VLM2Vec training corpus, and no mention of backbone strength or total compute. This leaves open the possibility that observed gains on legacy MMEB tasks are driven by expanded data rather than the unified framework, directly undermining the central attribution claim.
  2. [Experiments] Experiments section: To substantiate the claim that the unified training procedure enables generalization across modalities without modality-specific changes, the manuscript must include controlled ablations (e.g., VLM2Vec-V2 trained on image-only data with matched volume versus the full video+document mixture). Absent such controls, the image-benchmark gains cannot be confidently credited to the proposed method rather than data scaling.
minor comments (2)
  1. Add standard deviations or results from multiple random seeds to all reported metrics in the main results tables to allow assessment of statistical reliability.
  2. [Benchmark] Clarify the exact definition and construction of the five new task types in MMEB-V2 (e.g., how temporal grounding is formulated as a retrieval task) in the benchmark description section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and commit to revisions that directly respond to the concerns raised regarding attribution of gains and experimental controls.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that VLM2Vec-V2 improves over prior baselines on the original image benchmarks, yet provides no quantitative results, no details on the volume or composition of image-text pairs relative to the original VLM2Vec training corpus, and no mention of backbone strength or total compute. This leaves open the possibility that observed gains on legacy MMEB tasks are driven by expanded data rather than the unified framework, directly undermining the central attribution claim.

    Authors: We agree that the abstract would benefit from greater specificity. In the revised version we will insert the key quantitative improvements on the original MMEB image tasks (average gains relative to prior baselines), a concise statement on the scale and composition of the image-text portion of the training mixture, and the backbone and approximate compute used. While the full training details and per-task numbers already appear in the Experiments section, we accept that moving a summary of these facts into the abstract will make the attribution claim more transparent and will reduce the possibility that readers attribute gains solely to data volume. revision: yes

  2. Referee: [Experiments] Experiments section: To substantiate the claim that the unified training procedure enables generalization across modalities without modality-specific changes, the manuscript must include controlled ablations (e.g., VLM2Vec-V2 trained on image-only data with matched volume versus the full video+document mixture). Absent such controls, the image-benchmark gains cannot be confidently credited to the proposed method rather than data scaling.

    Authors: We acknowledge that the current experiments do not contain an explicit image-only ablation with matched data volume, which would more cleanly isolate the effect of the joint training procedure. We will add this controlled comparison in the revised manuscript: a model trained on an image-only subset whose volume matches the image portion of the full mixture, evaluated on the same legacy MMEB image tasks. This ablation will be reported alongside the existing results to demonstrate whether the observed image gains arise from the unified multi-modal mixture or from data scaling alone. We believe the addition will directly address the referee’s concern while preserving the paper’s central narrative. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical training and benchmarking pipeline

full rationale

The paper introduces MMEB-V2 benchmark and trains VLM2Vec-V2 on a data mixture, then reports measured performance on held-out retrieval, classification, and QA tasks. No mathematical derivation, first-principles prediction, or fitted parameter is presented whose output is forced by construction to equal its own inputs. All claims are externally falsifiable via standard benchmark comparisons and do not rely on self-citation chains or ansatzes that smuggle in the target result. This is a standard empirical ML paper whose central results stand or fall on the reported numbers rather than internal definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. Training of a large multimodal model implicitly involves many hyperparameters and data choices that are not enumerated here.

pith-pipeline@v0.9.0 · 5832 in / 1081 out tokens · 30517 ms · 2026-05-18T14:04:21.187924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When Text Hijacks Vision: Benchmarking and Mitigating Text Overlay-Induced Hallucination in Vision Language Models

    cs.CV 2026-04 unverdicted novelty 8.0

    VLMs hallucinate by prioritizing contradictory on-screen text over visual content, addressed via the VisualTextTrap benchmark with 6,057 human-validated samples and the VTHM-MoE dual-encoder framework using dimension-...

  2. Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

    cs.CL 2026-05 unverdicted novelty 7.0

    Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...

  3. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a structured taxonomy of artistic production variables, dual-channel conditioning, a style-motion curriculum, and deformation-aware optimization to prioritize art over physics.

  4. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos by structuring artistic production rules into a controllable taxonomy and training the model to prioritize those rules over physical realism, achieving top scores from professional ani...

  5. AniMatrix: An Anime Video Generation Model that Thinks in Art, Not Physics

    cs.CV 2026-05 unverdicted novelty 7.0

    AniMatrix generates anime videos using a production knowledge taxonomy, dual-channel conditioning, style-motion curriculum, and deformation-aware preference optimization, outperforming baselines in animator evaluation...

  6. MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models

    cs.IR 2026-04 unverdicted novelty 7.0

    MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.

  7. CodeMMR: Bridging Natural Language, Code, and Image for Unified Retrieval

    cs.SE 2026-04 unverdicted novelty 7.0

    CodeMMR creates a unified embedding space for text, code, and images, outperforming baselines by 10 nDCG@10 points and boosting RAG code generation quality.

  8. CLAY: Conditional Visual Similarity Modulation in Vision-Language Embedding Space

    cs.CV 2026-04 unverdicted novelty 7.0

    CLAY reframes pretrained VLM embedding spaces as text-conditional similarity spaces for adaptive, multi-conditioned image retrieval without additional training.

  9. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  10. Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

  11. PLUME: Latent Reasoning Based Universal Multimodal Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

  12. LMEB: Long-horizon Memory Embedding Benchmark

    cs.CL 2026-03 unverdicted novelty 7.0

    LMEB benchmark shows that embedding models' performance on traditional retrieval does not transfer to long-horizon memory tasks, larger models do not always perform better, and LMEB measures capabilities orthogonal to MTEB.

  13. Adapting MLLMs for Nuanced Video Retrieval

    cs.CV 2025-12 unverdicted novelty 7.0

    Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.

  14. DenseStep2M: A Scalable, Training-Free Pipeline for Dense Instructional Video Annotation

    cs.CV 2026-04 unverdicted novelty 6.0

    A scalable training-free pipeline using video segmentation, filtering, and off-the-shelf multimodal models creates DenseStep2M, a dataset of 100K videos and 2M detailed instructional steps that improves dense captioni...

  15. Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

    cs.CV 2026-04 unverdicted novelty 6.0

    Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.

  16. SIMMER: Cross-Modal Food Image--Recipe Retrieval via MLLM-Based Embedding

    cs.CV 2026-04 unverdicted novelty 6.0

    SIMMER uses a single multimodal LLM (VLM2Vec) with custom prompts and partial-recipe augmentation to embed food images and recipes, achieving new state-of-the-art retrieval accuracy on Recipe1M.

  17. CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding

    cs.CL 2026-01 unverdicted novelty 6.0

    CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.

  18. FreeRet: MLLMs as Training-Free Retrievers

    cs.CV 2025-09 unverdicted novelty 6.0

    FreeRet enables pretrained MLLMs to act as training-free retrievers via semantically grounded embeddings and reasoning-based reranking, outperforming models trained on millions of pairs on MMEB benchmarks.

  19. MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

    cs.IR 2025-09 unverdicted novelty 6.0

    MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.

  20. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    cs.CL 2026-01 unverdicted novelty 4.0

    Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 18 Pith papers · 7 internal anchors

  1. [1]

    Llm2vec: Large language models are secretly powerful text encoders

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders. arXiv preprint arXiv:2404.05961,

  2. [2]

    Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, and Pradyumna Narayana

    URL https://arxiv.org/abs/1907.06987. Boqi Chen, Anuj Khare, Gaurav Kumar, Arjun Akula, and Pradyumna Narayana. Seeing beyond: Enhancing visual question answering with multi-modal retrieval. In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track , pp. 410– 421, Abu Dhabi, UAE, January

  3. [3]

    URL https://aclanthology.org/2025.coling-industry.35/

    Association for Computational Linguistics. URL https://aclanthology.org/2025.coling-industry.35/. David Chen and William B Dolan. Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, pp. 190–200,

  4. [4]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

  5. [5]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Gautier Viaud, C ´eline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. arXiv preprint arXiv:2407.01449,

  6. [6]

    Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

    Chaoyou Fu, Yuhan Dai, Yongdong Luo, Lei Li, Shuhuai Ren, Renrui Zhang, Zihan Wang, Chenyu Zhou, Yunhang Shen, Mengdan Zhang, et al. Video-mme: The first-ever com- prehensive evaluation benchmark of multi-modal llms in video analysis. arXiv preprint arXiv:2405.21075,

  7. [7]

    Scaling deep contrastive learning batch size under memory limited setup

    Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983,

  8. [8]

    The "something something" video database for learning and evaluating visual common sense

    URL https://arxiv.org/abs/1706.04261. 10 Preprint. Yanhao Jia, Xinyi Wu, Hao Li, Qinglin Zhang, Yuxiao Hu, Shuai Zhao, and Wenqi Fan. Uni-retrieval: A multi-style retrieval framework for stem’s education. arXiv preprint arXiv:2502.05863,

  9. [9]

    VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. arXiv preprint arXiv:2410.05160,

  10. [10]

    Kuehne, H

    H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: A large video database for human motion recognition. In 2011 International Conference on Computer Vision, pp. 2556–2563,

  11. [11]

    Unsupervised graph association for person re-identification,

    doi: 10.1109/ICCV .2011.6126543. Hilde Kuehne, Ali Arslan, and Thomas Serre. The language of actions: Recovering the syntax and semantics of goal-directed human activities. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pp. 780–787,

  12. [12]

    Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su

    doi: 10.1109/CVPR.2014.105. Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, and Jinsong Su. Llave: Large language and vision embedding models with hardness-weighted contrastive learning.arXiv preprint arXiv:2503.04812,

  13. [13]

    Berg, and Mohit Bansal

    Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights: Detecting moments and highlights in videos via natural language queries. ArXiv, abs/2107.09609,

  14. [14]

    Mm-embed: Universal multimodal retrieval with multimodal llms

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. arXiv preprint arXiv:2411.02571,

  15. [15]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pp. 740–755. Springer,

  16. [16]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a. Yang Liu, Samuel Albanie, Arsha Nagrani, and Andrew Zisserman. Use what you have: Video retrieval using representations from collaborative experts. arXiv preprint arXiv:1907.13487,

  17. [17]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. arXiv preprint arXiv:2412.01720, 2024b. Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. Clip4clip: An empirical study of clip for end to end video clip r...

  18. [18]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. arXiv preprint arXiv:2407.01523,

  19. [19]

    Vidore benchmark v2: Raising the bar for visual retrieval

    11 Preprint. Quentin Mac´e, Ant´onio Loison, and Manuel Faysse. Vidore benchmark v2: Raising the bar for visual retrieval. arXiv preprint arXiv:2505.17166,

  20. [20]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748,

  21. [21]

    Unirag: Universal retrieval augmentation for large vision language models

    Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. Unirag: Universal retrieval augmentation for large vision language models. In Findings of the Association for Computational Linguistics: NAACL 2025, pp. 2026–2039,

  22. [22]

    Hollywood in homes: Crowdsourcing data collection for activity understanding

    Gunnar A Sigurdsson, G¨ul Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 510–526. Springer,

  23. [23]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    URL https://arxiv.org/abs/1212.0402. Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction- finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023, pp. 1102–1121,

  24. [24]

    Improving text embeddings with large language models

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models. arXiv preprint arXiv:2401.00368, 2024a. Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayihen...

  25. [25]

    Internvideo2: Scaling video foundation models for multimodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Chenting Wang, Guo Chen, Baoqi Pei, Rongkun Zheng, Jilan Xu, Zun Wang, et al. Internvideo2: Scaling video foundation models for multimodal video understanding. arXiv preprint arXiv:2403.15377, 2024c. Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Tr...

  26. [26]

    Videoclip: Contrastive pre-training for zero-shot video-text understanding

    Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084,

  27. [27]

    Videococa: Video-text modeling with zero-shot transfer from contrastive captioners

    Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, and Jiahui Yu. Videococa: Video-text modeling with zero-shot transfer from contrastive captioners. arXiv preprint arXiv:2212.04979,

  28. [28]

    Uni- versalrag: Retrieval-augmented generation over multiple corpora with diverse modalities and granularities

    Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jinheon Baek, and Sung Ju Hwang. Uni- versalrag: Retrieval-augmented generation over multiple corpora with diverse modalities and granularities. arXiv preprint arXiv:2504.20734,

  29. [29]

    VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented generation on multi-modality documents. arXiv preprint arXiv:2410.10594,

  30. [30]

    Momentseeker: A comprehensive benchmark and a strong baseline for moment retrieval within long videos

    Huaying Yuan, Jian Ni, Yueze Wang, Junjie Zhou, Zhengyang Liang, Zheng Liu, Zhao Cao, Zhicheng Dou, and Ji-Rong Wen. Momentseeker: A comprehensive benchmark and a strong baseline for moment retrieval within long videos. arXiv preprint arXiv:2502.12558,

  31. [31]

    Direct preference optimization of video large multimodal models from language model reward

    Ruohong Zhang, Liangke Gui, Zhiqing Sun, Yihao Feng, Keyang Xu, Yuanhan Zhang, Di Fu, Chunyuan Li, Alexander Hauptmann, Yonatan Bisk, et al. Direct preference optimization of video large multimodal models from language model reward. arXiv preprint arXiv:2404.01258, 2024a. Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, ...

  32. [32]

    Following JSFusion (Yu et al., 2018), we sampled 1K clip-text pairs to incorporate into our benchmark

    is a dataset composed of 10K open-domain videos, each video clip ranging from 10 to 32 seconds in length and accompanied by a total of 200K captions. Following JSFusion (Yu et al., 2018), we sampled 1K clip-text pairs to incorporate into our benchmark. The query side contains both the instruction and the video caption, while the candidates consist of all ...

  33. [33]

    paragraph-to-video

    consists of 10K videos collected from Flickr, each trimmed to a maximum of 30 seconds. Each video includes approximately 3 to 5 anno- tated pairs of descriptions and their corresponding distinct moments. Following previous work (Liu et al., 2019; Luo et al., 2021), we concatenate these descriptions and perform “paragraph-to-video” retrieval on this benchm...

  34. [34]

    Each video is annotated with high-quality labels for both query-based video moment retrieval and highlight detection

    is a dataset comprising 10K videos collected from YouTube, covering a diverse range of topics. Each video is annotated with high-quality labels for both query-based video moment retrieval and highlight detection. In our embedding benchmark, we adopt the standard practice of ranking candidate clips and evaluating performance using Recall@1. In contrast, th...

  35. [35]

    Containing 1.8K queries, MomentSeeker consists of 4 subtasks with various query-side modalities

    is a dataset designed to benchmark multimodal retrievers on long video moment retrieval tasks. Containing 1.8K queries, MomentSeeker consists of 4 subtasks with various query-side modalities. Additionally, MomentSeeker spans a diverse range of topics, including egocentric videos, cartoons, sports, and movies. For each query, we uniformly sampled nine nega...

  36. [36]

    Each question is grounded in a 3- minute clip and targets long-range temporal reasoning

    is a diagnostic benchmark for long-form video under- standing, constructed from Ego4D and comprising over 5,000 multiple-choice QA pairs spanning more than 250 hours of egocentric video. Each question is grounded in a 3- minute clip and targets long-range temporal reasoning. In our study, we use a subset of 500 questions for which answer annotations are p...