{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:K7HXT4ZN3IS3EC6AAMEVNSKHRJ","short_pith_number":"pith:K7HXT4ZN","schema_version":"1.0","canonical_sha256":"57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f","source":{"kind":"arxiv","id":"2410.05160","version":3},"attestation_state":"computed","paper":{"title":"VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Rui Meng, Semih Yavuz, Wenhu Chen, Xinyi Yang, Yingbo Zhou, Ziyan Jiang","submitted_at":"2024-10-07T16:14:05Z","abstract_excerpt":"Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite its importance and practicality. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multi"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":false},"canonical_record":{"source":{"id":"2410.05160","kind":"arxiv","version":3},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CV","submitted_at":"2024-10-07T16:14:05Z","cross_cats_sorted":["cs.AI","cs.CL"],"title_canon_sha256":"41d27d66a80e95ca2a37e1619bf0335b9f6ba1bf69ec247231ff3a12e23891d4","abstract_canon_sha256":"fec1327baaf6d937bd58b1cd02c0e6490a6f95af146745fda3f018f0c2140ea0"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:13.047494Z","signature_b64":"2uUtR3tTTeURj5M69KiG7LFZeTvcvSrk9nVIY7MTVdPE8kyGD9jy+w9f09vI/d/a6SO2H5omZfoa853xUhCyCg==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f","last_reissued_at":"2026-05-17T23:38:13.046884Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:13.046884Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.","cross_cats":["cs.AI","cs.CL"],"primary_cat":"cs.CV","authors_text":"Rui Meng, Semih Yavuz, Wenhu Chen, Xinyi Yang, Yingbo Zhou, Ziyan Jiang","submitted_at":"2024-10-07T16:14:05Z","abstract_excerpt":"Embedding models have been crucial in enabling various downstream tasks such as semantic similarity, information retrieval, and clustering. Recently, there has been a surge of interest in developing universal text embedding models that can generalize across tasks (e.g., MTEB). However, progress in learning universal multimodal embedding models has been relatively slow despite its importance and practicality. In this work, we aim to explore the potential for building universal embeddings capable of handling a wide range of downstream tasks. Our contributions are twofold: (1) MMEB (Massive Multi"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets (including out-of-distribution ones) without substantial overfitting or data leakage between splits.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"1481c76f92c6b42bf9d5d37389448605a6a15a0598c2bd27b37fff6a9b998fd4"},"source":{"id":"2410.05160","kind":"arxiv","version":3},"verdict":{"id":"cb75eb9e-32fc-412b-a139-f81b4ac81d84","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T21:14:56.020911Z","strongest_claim":"Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.","one_line_summary":"VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets (including out-of-distribution ones) without substantial overfitting or data leakage between splits.","pith_extraction_headline":"A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks."},"references":{"count":45,"sample":[{"doi":"","year":null,"title":"Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone","work_id":"feef9556-a016-493c-abd2-0c97a23a7ebf","ref_index":1,"cited_arxiv_id":"2404.14219","is_internal_anchor":true},{"doi":"","year":2012,"title":"SemEval-2012 task 6: A pilot on semantic textual similarity","work_id":"b3da1a53-1971-4931-9961-0c8af87a30a4","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"arXiv preprint arXiv:2211.09260 , year=","work_id":"8ff1935b-870d-4685-99e0-95249679188d","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Llm2vec: Large language models are secretly powerful text encoders","work_id":"156e1320-54cd-416f-af15-d9da54374957","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2017,"title":"SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation","work_id":"a8ff10da-ea02-4989-80b4-bbd28ac1e663","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":45,"snapshot_sha256":"485c1fe862d436c5a0563abbab722ef29ae07269e3c036a03b4f88d505dd298c","internal_anchors":9},"formal_canon":{"evidence_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2410.05160","created_at":"2026-05-17T23:38:13.047001+00:00"},{"alias_kind":"arxiv_version","alias_value":"2410.05160v3","created_at":"2026-05-17T23:38:13.047001+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2410.05160","created_at":"2026-05-17T23:38:13.047001+00:00"},{"alias_kind":"pith_short_12","alias_value":"K7HXT4ZN3IS3","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"K7HXT4ZN3IS3EC6A","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"K7HXT4ZN","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":24,"internal_anchor_count":24,"sample":[{"citing_arxiv_id":"2505.12601","citing_title":"Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers","ref_index":32,"is_internal_anchor":true},{"citing_arxiv_id":"2605.21832","citing_title":"FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.16638","citing_title":"TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2509.18095","citing_title":"MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction","ref_index":25,"is_internal_anchor":true},{"citing_arxiv_id":"2507.04590","citing_title":"VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2511.13415","citing_title":"Attention Grounded Enhancement for Visual Document Retrieval","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13511","citing_title":"Adapting MLLMs for Nuanced Video Retrieval","ref_index":37,"is_internal_anchor":true},{"citing_arxiv_id":"2601.21262","citing_title":"CausalEmbed: Auto-Regressive Multi-Vector Generation in Latent Space for Visual Document Embedding","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20354","citing_title":"EmbeddingGemma: Powerful and Lightweight Text Representations","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":87,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13277","citing_title":"Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation","ref_index":1,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02073","citing_title":"PLUME: Latent Reasoning Based Universal Multimodal Embedding","ref_index":20,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08384","citing_title":"jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08384","citing_title":"jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25273","citing_title":"Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.23321","citing_title":"MMEB-V3: Measuring the Performance Gaps of Omni-Modality Embedding Models","ref_index":10,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22280","citing_title":"Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11095","citing_title":"Bottleneck Tokens for Unified Multimodal Retrieval","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2604.10167","citing_title":"Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07220","citing_title":"HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07201","citing_title":"BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07079","citing_title":"MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2604.17054","citing_title":"mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval","ref_index":21,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":0,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ","json":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ.json","graph_json":"https://pith.science/api/pith-number/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/graph.json","events_json":"https://pith.science/api/pith-number/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/events.json","paper":"https://pith.science/paper/K7HXT4ZN"},"agent_actions":{"view_html":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ","download_json":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ.json","view_paper":"https://pith.science/paper/K7HXT4ZN","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2410.05160&json=true","fetch_graph":"https://pith.science/api/pith-number/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/graph.json","fetch_events":"https://pith.science/api/pith-number/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/action/timestamp_anchor","attest_storage":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/action/storage_attestation","attest_author":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/action/author_attestation","sign_citation":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/action/citation_signature","submit_replication":"https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ/action/replication_record"}},"created_at":"2026-05-17T23:38:13.047001+00:00","updated_at":"2026-05-17T23:38:13.047001+00:00"}