arxiv: 2410.05160 · v3 · pith:K7HXT4ZNnew · submitted 2024-10-07 · 💻 cs.CV · cs.AI· cs.CL

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Ziyan Jiang , Rui Meng , Xinyi Yang , Semih Yavuz , Yingbo Zhou , Wenhu Chen This is my paper

Pith reviewed 2026-05-17 21:14 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords multimodal embeddingsvision-language modelscontrastive learningMMEB benchmarkuniversal embeddingsimage-text retrievalvisual grounding

0 comments

The pith

A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that contrastive training on a large collection of multimodal datasets can convert existing vision-language models into universal embedding generators. These models accept arbitrary combinations of images and text along with task instructions and output fixed-length vectors suitable for classification, retrieval, visual question answering, and grounding. A sympathetic reader would care because multimodal embedding progress has lagged behind text-only models, and a single training recipe applied to strong VLMs like LLaVA and Phi-3.5-V yields consistent gains on both seen and unseen tasks. The result suggests that the heavy lifting of building general-purpose multimodal embedders can be offloaded to already-trained VLMs rather than designing new architectures from scratch.

Core claim

VLM2Vec is a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model. Unlike CLIP or BLIP, which encode text or images independently without task instructions, VLM2Vec processes any image-text combination guided by instructions to produce a fixed-dimensional vector. When models built on Phi-3.5-V and LLaVA-1.6 are trained on the 20 training datasets of MMEB, they deliver an absolute average improvement of 10 to 20 percent over prior multimodal embedding models on the 16 held-out evaluation datasets, both in-distribution and out-of-distribution.

What carries the argument

VLM2Vec, the contrastive training procedure that adapts a vision-language model to output task-instructed embeddings from mixed image and text inputs.

If this is right

Existing vision-language models can be repurposed into strong embedding models without new architecture design.
A single training run on the MMEB training split yields gains across classification, retrieval, visual question answering, and grounding.
Multimodal embedding evaluation can now use a standardized benchmark that mixes in-distribution and out-of-distribution tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recipe could be applied to even larger VLMs to test whether scaling laws observed in language models extend to multimodal embeddings.
Task instructions might allow a single model to switch between embedding objectives at inference time without retraining.

Load-bearing premise

That contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets without substantial overfitting or data leakage between splits.

What would settle it

Training VLM2Vec on the 20 datasets and then measuring zero or negative improvement on a fresh multimodal task never seen in MMEB would falsify the claim of broad generalization.

read the original abstract

Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite its importance and practicality. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multimodal Embedding Benchmark), which covers 4 meta-tasks (i.e. classification, visual question answering, multimodal retrieval, and visual grounding) and 36 datasets, including 20 training and 16 evaluation datasets covering both in-distribution and out-of-distribution tasks, and (2) VLM2Vec (Vision-Language Model -> Vector), a contrastive training framework that converts any state-of-the-art vision-language model into an embedding model via training on MMEB. Unlike previous models such as CLIP and BLIP, which encodes text or images independently without any task instruction, VLM2Vec can process any combination of images and text to generate a fixed-dimensional vector based on task instructions. We build a series of VLM2Vec models on SoTA VLMs like Phi-3.5-V, LLaVA-1.6 and evaluate them on MMEB's evaluation split. Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VLM2Vec gives a workable recipe for turning VLMs into instruction-aware multimodal embedders with a new large benchmark, but the OOD gains rest on unverified split cleanliness.

read the letter

The main point is that the authors built MMEB, a benchmark with 36 datasets across classification, VQA, retrieval, and grounding, then used it to contrastively train VLMs like LLaVA and Phi-3.5-V with task instructions so they output fixed vectors for any image-text mix. This produces the reported 10-20% absolute lifts over prior multimodal embedders on both in-distribution and out-of-distribution evaluation sets. That combination of scale and instruction conditioning is the concrete addition relative to standard CLIP-style work. The practical upside is clear: it shows existing VLMs can be repurposed for embedding tasks without new architectures, and the numbers suggest the approach is effective when the training data is this broad. The paper handles the literature on universal text embedders reasonably and positions the multimodal case as the natural next step. The soft spot is the one the stress test flags. The abstract treats the 16 evaluation datasets as held-out and includes OOD tasks, yet gives no quantitative evidence that images, captions, or near-duplicates do not cross the train-eval boundary. If sources overlap or near-duplicates slipped through, the measured generalization could be partly artifactual. Baseline details and exact evaluation protocols are also thin in the summary, which leaves room for implementation differences to explain part of the gap. This is the kind of work that belongs in the multimodal retrieval and embedding community. People building retrieval systems or clustering tools across vision and language will get immediate use from the benchmark and the training recipe. It is coherent on its own terms and engages the prior literature without obvious internal contradictions, so it merits a serious referee to check the split integrity and experimental controls rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MMEB, a benchmark with 36 multimodal datasets (20 for training, 16 for evaluation) spanning classification, visual question answering, multimodal retrieval, and visual grounding, including both in-distribution and out-of-distribution tasks. It proposes VLM2Vec, a contrastive training method to convert vision-language models into embedding models that incorporate task instructions to generate embeddings from mixed image-text inputs. The key finding is that VLM2Vec achieves 10% to 20% absolute improvements over prior multimodal embedding models on the MMEB evaluation split.

Significance. If the generalization results hold, this work is significant for showing that state-of-the-art VLMs can be adapted via contrastive training into strong universal multimodal embedders that handle task instructions, going beyond independent encoding in models like CLIP. The large-scale MMEB benchmark itself is a valuable resource that could standardize evaluation in the field, analogous to MTEB for text embeddings.

major comments (2)

[MMEB Benchmark Construction] MMEB construction and split description: No quantitative checks (image hashing, caption similarity, or source provenance analysis) are reported to rule out sample overlap or near-duplicates between the 20 training datasets and 16 evaluation datasets. This directly affects the load-bearing claim of generalization to out-of-distribution tasks and the interpretation of the 10-20% gains as arising from the VLM2Vec objective rather than leakage.
[Experiments and Results] Experimental protocol and baselines: Insufficient detail is given on exact baseline re-implementations (e.g., whether CLIP/BLIP variants were re-trained on the same MMEB training split with identical prompts or used off-the-shelf), evaluation protocols, and contamination controls. This weakens the quantitative support for the central performance claims.

minor comments (2)

[Abstract] The phrase 'VLMs are secretly strong embedding models' in the abstract is informal; a more precise statement such as 'VLMs can be effectively adapted as task-aware embedding models' would improve formality.
[Results] Tables reporting average improvements should explicitly separate in-distribution and out-of-distribution results and include standard deviations or statistical tests to support the 'absolute average improvement of 10% to 20%' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments on benchmark validation and experimental transparency are well-taken and will improve the manuscript. We address each major comment below and commit to revisions that strengthen the presentation of our results without altering the core claims.

read point-by-point responses

Referee: [MMEB Benchmark Construction] MMEB construction and split description: No quantitative checks (image hashing, caption similarity, or source provenance analysis) are reported to rule out sample overlap or near-duplicates between the 20 training datasets and 16 evaluation datasets. This directly affects the load-bearing claim of generalization to out-of-distribution tasks and the interpretation of the 10-20% gains as arising from the VLM2Vec objective rather than leakage.

Authors: We acknowledge that the original manuscript did not report explicit quantitative overlap analyses. The 36 datasets were drawn from established public benchmarks and retained their original train/evaluation splits to preserve task diversity and out-of-distribution coverage. In the revised version we will add a dedicated appendix section that quantifies potential overlaps using perceptual image hashing and sentence-embedding cosine similarity between the training and evaluation partitions. Preliminary internal checks show overlap rates below 1 percent; these results will be reported to support the interpretation that the observed gains stem from the contrastive training objective rather than data leakage. revision: yes
Referee: [Experiments and Results] Experimental protocol and baselines: Insufficient detail is given on exact baseline re-implementations (e.g., whether CLIP/BLIP variants were re-trained on the same MMEB training split with identical prompts or used off-the-shelf), evaluation protocols, and contamination controls. This weakens the quantitative support for the central performance claims.

Authors: We agree that additional protocol details are required for reproducibility. All reported baselines (CLIP, BLIP, and related models) were evaluated using their publicly released checkpoints without any fine-tuning on the MMEB training split, preserving a fair comparison to prior work that does not incorporate task instructions. In the revision we will expand the experimental section and add an appendix that specifies exact prompt templates, similarity computation, batch sizes, and hardware settings. We will also include an explicit discussion of contamination controls, confirming that evaluation tasks were chosen to avoid source overlap with training data and describing the steps taken to mitigate leakage risks. revision: yes

Circularity Check

0 steps flagged

No significant circularity: empirical results on held-out MMEB evaluation splits

full rationale

The paper introduces the MMEB benchmark with an explicit partition into 20 training datasets and 16 distinct evaluation datasets (covering in-distribution and out-of-distribution tasks), trains VLM2Vec via contrastive learning on the training split, and reports performance metrics on the held-out evaluation split. This constitutes an independent empirical test rather than any derivation that reduces to its own inputs by construction. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the core claims; the 10-20% gains are measured against external held-out data and therefore remain falsifiable outside the training procedure.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard contrastive learning assumptions and the construction of a new benchmark; no new physical entities or ad-hoc constants are introduced.

free parameters (1)

contrastive temperature
Standard hyperparameter in contrastive objectives that is typically tuned on validation data.

axioms (1)

domain assumption Contrastive loss on task-instructed multimodal inputs produces useful fixed-dimensional embeddings
Invoked when the authors convert VLMs into embedders via contrastive training on MMEB.

pith-pipeline@v0.9.0 · 5627 in / 1308 out tokens · 48224 ms · 2026-05-17T21:14:56.020911+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Evidence utility is defined as information gain on the model's output distribution, with ranking by gain on a latent helpfulness variable shown equivalent to answer-space utility under mild assumptions, enabling a tra...
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 7.0

Jina-embeddings-v5-omni creates multimodal embeddings for text, image, audio, and video by freezing the text and media encoders and training only 0.35% of the weights via a VLM-style connector.
MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models
cs.IR 2026-04 unverdicted novelty 7.0

MMEB-V3 benchmark shows omni-modality embedding models fail to enforce instruction-specified modality constraints and exhibit asymmetric, query-biased retrieval.
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
cs.IR 2026-04 unverdicted novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
PLUME: Latent Reasoning Based Universal Multimodal Embedding
cs.CV 2026-04 unverdicted novelty 7.0

PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
Adapting MLLMs for Nuanced Video Retrieval
cs.CV 2025-12 unverdicted novelty 7.0

Text-only contrastive fine-tuning of an MLLM with hard negatives produces embeddings that handle temporal, negation, and multimodal nuances in video retrieval and achieves SOTA performance.
jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers
cs.CL 2026-05 unverdicted novelty 6.0

GELATO extends frozen text embedding models with locked image and audio encoders, training minimal connectors to produce a single semantic embedding space for text, image, audio, and video while keeping original text ...
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
cs.CV 2026-04 unverdicted novelty 6.0

Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
cs.IR 2026-04 unverdicted novelty 6.0

HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding
cs.CL 2026-01 unverdicted novelty 6.0

CausalEmbed uses auto-regressive generation with iterative margin loss to produce multi-vector embeddings that reduce visual token counts 30-155x while retaining competitive performance on VDR benchmarks.
EmbeddingGemma: Powerful and Lightweight Text Representations
cs.CL 2025-09 unverdicted novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction
cs.IR 2025-09 unverdicted novelty 6.0

MetaEmbed trains fixed learnable Meta Tokens to produce granularity-organized multi-vector embeddings that support test-time scaling in multimodal retrieval.
Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval
cs.CV 2026-04 unverdicted novelty 5.0

SSA-ME uses saliency-aware modeling to reduce visual neglect and semantic drift, achieving SOTA results on the MMEB benchmark for multimodal retrieval.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
cs.IR 2026-04 unverdicted novelty 5.0

BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...
Attention Grounded Enhancement for Visual Document Retrieval
cs.IR 2025-11 unverdicted novelty 5.0

AGREE boosts visual document retrieval by adding local relevance signals from MLLM attention maps to global document labels during retriever training.
VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
cs.CV 2025-07 unverdicted novelty 5.0

VLM2Vec-V2 is a multimodal embedding model trained on an extended MMEB-V2 benchmark that adds video and visual document tasks and reports gains on both new and prior image benchmarks.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 19 Pith papers · 9 internal anchors

[1]

Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

SemEval-2012 task 6: A pilot on semantic textual similarity

Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. SemEval-2012 task 6: A pilot on semantic textual similarity. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret (eds.), *SEM 2012: The First Joint Conference on Lexical and Com- putational Semantics – Volume 1: Proceedings of the main conference and the sha...

work page 2012
[3]

URL https://aclanthology.org/S12-1051

Association for Computational Linguis- tics. URL https://aclanthology.org/S12-1051. Akari Asai, Timo Schick, Patrick Lewis, Xilun Chen, Gautier Izacard, Sebastian Riedel, Han- naneh Hajishirzi, and Wen-tau Yih. Task-aware retrieval with instructions. arXiv preprint arXiv:2211.09260,

work page arXiv
[4]

Llm2vec: Large language models are secretly powerful text encoders

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mosbach, Dzmitry Bahdanau, Nicolas Chapa- dos, and Siva Reddy. Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961,

work page arXiv
[5]

SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation

Daniel Cer, Mona Diab, Eneko Agirre, I ˜nigo Lopez-Gazpio, and Lucia Specia. SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation. In Steven Bethard, Marine Carpuat, Marianna Apidianaki, Saif M. Mohammad, Daniel Cer, and David Ju- rgens (eds.), Proceedings of the 11th International Workshop on Semantic Evaluati...

work page 2017
[6]

doi: 10.18653/v1/S17-2001

Association for Computational Linguistics. doi: 10.18653/v1/S17-2001. URL https://aclanthology.org/S17-2001. Yingshan Chang, Mridu Narang, Hisami Suzuki, Guihong Cao, Jianfeng Gao, and Yonatan Bisk. Webqa: Multihop and multimodal qa. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16495–16504,

work page doi:10.18653/v1/s17-2001 2001
[7]

Supervised learning of universal sentence representations from natural language inference data

Alexis Conneau, Douwe Kiela, Holger Schwenk, Lo ¨ıc Barrault, and Antoine Bordes. Supervised learning of universal sentence representations from natural language inference data. In Proceed- ings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 670–680,

work page 2017
[8]

Imagenet: A large-scale hi- erarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hi- erarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee,

work page 2009
[9]

org/CorpusID:207252270

URL https://api.semanticscholar. org/CorpusID:207252270. 12 Manuscript Stephanie Fu, Netanel Tamir, Shobhita Sundaram, Lucy Chai, Richard Zhang, Tali Dekel, and Phillip Isola. Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344,

work page arXiv
[10]

Scaling deep contrastive learning batch size under memory limited setup

Luyu Gao, Yunyi Zhang, Jiawei Han, and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. arXiv preprint arXiv:2101.06983, 2021a. Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processin...

work page arXiv 2021
[11]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig

URL https://arxiv.org/abs/2007.0128. Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning , pp. 4904–4916. PMLR,

work page arXiv 2007
[12]

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Dense passage retrieval for open-domain question answering

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 6769–6781,

work page 2020
[14]

Referitgame: Referring to objects in photographs of natural scenes

Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 787–798,

work page 2014
[15]

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019a. Tom Kwiatkowski, Jennimaria Palomaki, Oliv...

work page internal anchor Pith review Pith/arXiv arXiv
[16]

LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pp. 19730–19742. PMLR, 2023a. Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with ...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[18]

Visual news: Benchmark and challenges in news image captioning

Fuxiao Liu, Yinghan Wang, Tianlu Wang, and Vicente Ordonez. Visual news: Benchmark and challenges in news image captioning. arXiv preprint arXiv:2010.03743,

work page arXiv 2010
[19]

What makes good in-context examples for gpt-3? DeeLIO 2022, pp

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. What makes good in-context examples for gpt-3? DeeLIO 2022, pp. 100,

work page 2022
[20]

Edis: Entity-driven image search over multimodal web content

Siqi Liu, Weixi Feng, Tsu-jui Fu, Wenhu Chen, and William Yang Wang. Edis: Entity-driven image search over multimodal web content. arXiv preprint arXiv:2305.13631,

work page arXiv
[21]

Unifying multimodal retrieval via document screenshot embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. arXiv preprint arXiv:2406.11251, 2024a. Xueguang Ma, Liang Wang, Nan Yang, Furu Wei, and Jimmy Lin. Fine-tuning llama for multi-stage text retrieval. In Proceedings of the 47th International ACM SIGIR Conference on Research...

work page arXiv
[22]

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Ahmed Masry, Do Xuan Long, Jia Qing Tan, Shafiq Joty, and Enamul Hoque. Chartqa: A bench- mark for question answering about charts with visual and logical reasoning. arXiv preprint arXiv:2203.10244,

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Efficient Estimation of Word Representations in Vector Space

URL https://huggingface. co/Salesforce/SFR-Embedding-2_R. Tomas Mikolov. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Mteb: Massive text em- bedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. Mteb: Massive text em- bedding benchmark. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp. 2014–2037,

work page 2014
[25]

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al

URL https://www.microsoft.com/en-us/research/publication/ ms-marco-human-generated-machine-reading-comprehension-dataset/ . Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernandez Abrego, Ji Ma, Vincent Zhao, Yi Luan, Keith Hall, Ming-Wei Chang, et al. Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical ...

work page 2022
[26]

Glove: Global vectors for word representation

15 Manuscript Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543,

work page 2014
[27]

Sentence-BERT: Sentence embeddings using Siamese BERT- networks

Nils Reimers and Iryna Gurevych. Sentence-BERT: Sentence embeddings using Siamese BERT- networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Lan- guage Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3982–3992, Hong Kong, China, November

work page 2019
[28]

Sentence-bert: Sentence embeddings using siamese bert-networks

Association for Com- putational Linguistics. doi: 10.18653/v1/D19-1410. URL https://aclanthology.org/ D19-1410. Ohad Rubin, Jonathan Herzig, and Jonathan Berant. Learning to retrieve prompts for in-context learning. In Proceedings of the 2022 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Techno...

work page doi:10.18653/v1/d19-1410 2022
[29]

Rep- etition improves language model embeddings.arXiv preprint arXiv:2402.15449,

Jacob Mitchell Springer, Suhas Kotha, Daniel Fried, Graham Neubig, and Aditi Raghunathan. Rep- etition improves language model embeddings. arXiv preprint arXiv:2402.15449,

work page arXiv
[30]

One embedder, any task: Instruction-finetuned text embeddings

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen-tau Yih, Noah A Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings. In Findings of the Association for Computational Linguistics: ACL 2023 , pp. 1102–1121,

work page 2023
[31]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Nandan Thakur, Nils Reimers, Andreas R¨uckl´e, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin ...

work page internal anchor Pith review Pith/arXiv arXiv
[33]

N24news: A new dataset for multimodal news classification

Zhen Wang, Xu Shan, Xiangxie Zhang, and Jie Yang. N24news: A new dataset for multimodal news classification. arXiv preprint arXiv:2108.13327,

work page arXiv
[34]

Simvlm: Sim- ple visual language model pretraining with weak supervision

16 Manuscript Zirui Wang, Jiahui Yu, Adams Wei Yu, Zihang Dai, Yulia Tsvetkov, and Yuan Cao. Simvlm: Sim- ple visual language model pretraining with weak supervision. In International Conference on Learning Representations, 2022b. Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking ...

work page arXiv
[35]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pp. 3485–3492. IEEE,

work page 2010
[36]

Approximate nearest neighbor negative contrastive learn- ing for dense text retrieval.arXiv preprint arXiv:2007.00808,

Lee Xiong, Chenyan Xiong, Ye Li, Kwok-Fung Tang, Jialin Liu, Paul Bennett, Junaid Ahmed, and Arnold Overwijk. Approximate nearest neighbor negative contrastive learning for dense text retrieval. arXiv preprint arXiv:2007.00808,

work page arXiv 2007
[37]

Magiclens: Self-supervised image retrieval with open-ended instructions

Kai Zhang, Yi Luan, Hexiang Hu, Kenton Lee, Siyuan Qiao, Wenhu Chen, Yu Su, and Ming-Wei Chang. Magiclens: Self-supervised image retrieval with open-ended instructions. arXiv preprint arXiv:2403.19651,

work page arXiv
[38]

The original dataset consists of triplets: a reference image and two perturbed versions, along with human judgments indicating which version is most similar to the reference

The dataset contains human similarity judgments on image pairs that are alike in various ways. The original dataset consists of triplets: a reference image and two perturbed versions, along with human judgments indicating which version is most similar to the reference. Following M-BEIR (Wei et al., 2023), we refactor this dataset into a retrieval task to ...

work page 2023
[39]

This dataset contains entity-rich queries, requiring the model to understand both entities and events from the text queries

The dataset is a cross-modal image search in the news domain. This dataset contains entity-rich queries, requiring the model to understand both entities and events from the text queries. The candidate consists of the news image and its accompanying headline. Wiki-SS-NQ (Ma et al., 2024a) The dataset is another retrieval-based VQA dataset. Unlike the origi...

work page 2023
[40]

telling” and “pointing

The dataset establishes a semantic link between textual de- scriptions and image regions through object-level grounding. It has two types of questions: “telling” and “pointing”. It leverages the six W questions (what, where, when, who, why, and how) to sys- tematically examine a model’s capability for visual understanding through telling questions. Addi- ...

work page 2000
[41]

Represent the given news image with the following caption for domain classifi- cation. Ms. Goodman styled Am- ber Valletta with wings for a 1993 shoot by Peter Lind- bergh for Harper’s Bazaar. Style - VOC2007 (Everingham et al.,

work page 1993
[42]

bus - SUN397 (Xiao et al., 2010)Identify the scene shown in the image

Identify the object shown in the image. bus - SUN397 (Xiao et al., 2010)Identify the scene shown in the image. firing range indoor - ObjectNet (Barbu et al.,

work page 2010
[43]

Find a Wikipedia image-passage pair that answers this question. Do both the Hays County Court- house in San Marcos, Texas and the Ike Wood House at 227 Mitchell Street in San Marcos, Texas have six columns on their front entrance? - Represent the given Wikipedia im- age with related text information. Hays County Courthouse (2018), San Marcos, TX The Hays ...

work page 2018
[44]

Tom Holland makes his debut in the Spidey suit in Captain America Civil War

Find a news image that matches the provided caption. Tom Holland makes his debut in the Spidey suit in Captain America Civil War. - Represent the given image with re- lated text information. Comic RiffsJon Favreau is set to reprise his Iron Man role for Spider Man: Homecoming. Wiki-SS-NQ (Ma et al., 2024a)Find the document screenshot that can answer the g...

work page 2020
[45]

kid on right in back, blondish hair Select the portion of the image that follows the language expressions

Select the portion of the image that follows the language expressions. kid on right in back, blondish hair Select the portion of the image that follows the language expressions. top right kid Table 11: Zero-shot text-image retrieval performance on Flickr30K. As a general multimodal rep- resentation model, VL M2VE C can still achieve competitive T2I (Text-...

work page 2023