{"paper":{"title":"VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Rui Meng, Semih Yavuz, Wenhu Chen, Xinyi Yang, Yingbo Zhou, Ziyan Jiang","submitted_at":"2024-10-07T16:14:05Z","abstract_excerpt":"Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite its importance and practicality. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets (including out-of-distribution ones) without substantial overfitting or data leakage between splits.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1481c76f92c6b42bf9d5d37389448605a6a15a0598c2bd27b37fff6a9b998fd4"},"source":{"id":"2410.05160","kind":"arxiv","version":3},"verdict":{"id":"cb75eb9e-32fc-412b-a139-f81b4ac81d84","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T21:14:56.020911Z","strongest_claim":"Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.","one_line_summary":"VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets (including out-of-distribution ones) without substantial overfitting or data leakage between splits.","pith_extraction_headline":"A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks."},"references":{"count":45,"sample":[{"doi":"","year":null,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":1,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"","year":2012,"title":"SemEval-2012 task 6: A pilot on semantic textual similarity","work_id":"b3da1a53-1971-4931-9961-0c8af87a30a4","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"arXiv preprint arXiv:2211.09260 , year=","work_id":"8ff1935b-870d-4685-99e0-95249679188d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Llm2vec: Large language models are secretly powerful text encoders","work_id":"156e1320-54cd-416f-af15-d9da54374957","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation","work_id":"a8ff10da-ea02-4989-80b4-bbd28ac1e663","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":45,"snapshot_sha256":"485c1fe862d436c5a0563abbab722ef29ae07269e3c036a03b4f88d505dd298c","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}