{"record_type":"pith_number_record","schema_url":"https://pith.science/schemas/pith-number/v1.json","pith_number":"pith:2024:H2YTRIIBCHUG55H22SZIBC4O6L","short_pith_number":"pith:H2YTRIIB","schema_version":"1.0","canonical_sha256":"3eb138a10111e86ef4fad4b2808b8ef2e0769cd0c85be9142afd142b1608f8f7","source":{"kind":"arxiv","id":"2407.12580","version":1},"attestation_state":"computed","paper":{"title":"E5-V: Universal Embeddings with Multimodal Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Prompted MLLMs trained only on text pairs deliver universal multimodal embeddings that rival or exceed specialized models.","cross_cats":["cs.CV","cs.IR"],"primary_cat":"cs.CL","authors_text":"Deqing Wang, Feng Sun, Fuzhen Zhuang, Haizhen Huang, MingHui Song, Qi Zhang, Ting Jiang, Weiwei Deng, Zihan Zhang","submitted_at":"2024-07-17T14:04:12Z","abstract_excerpt":"Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong perform"},"verification_status":{"content_addressed":true,"pith_receipt":true,"author_attested":false,"weak_author_claims":0,"strong_author_claims":0,"externally_anchored":false,"storage_verified":false,"citation_signatures":0,"replication_records":0,"graph_snapshot":true,"references_resolved":true,"formal_links_present":true},"canonical_record":{"source":{"id":"2407.12580","kind":"arxiv","version":1},"metadata":{"license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","primary_cat":"cs.CL","submitted_at":"2024-07-17T14:04:12Z","cross_cats_sorted":["cs.CV","cs.IR"],"title_canon_sha256":"d428716da15993d082556602deea93b9d1867317eb8bfbd1a0ba8c0afcf0e1f9","abstract_canon_sha256":"dcb6625152f4f4eb341ffc634f6dcc46dad5877714b30f83aef58aa09ffdff31"},"schema_version":"1.0"},"receipt":{"kind":"pith_receipt","key_id":"pith-v1-2026-05","algorithm":"ed25519","signed_at":"2026-05-17T23:38:46.330824Z","signature_b64":"e3W7Izc5oHAwiW+H8A+aQqecOXwxGuzkfXZLiekcb8/0EVtvqm3F+IMasRyu7uac7/qKpAEiSZvEznrqSR64AA==","signed_message":"canonical_sha256_bytes","builder_version":"pith-number-builder-2026-05-17-v1","receipt_version":"0.3","canonical_sha256":"3eb138a10111e86ef4fad4b2808b8ef2e0769cd0c85be9142afd142b1608f8f7","last_reissued_at":"2026-05-17T23:38:46.330376Z","signature_status":"signed_v1","first_computed_at":"2026-05-17T23:38:46.330376Z","public_key_fingerprint":"8d4b5ee74e4693bcd1df2446408b0d54"},"graph_snapshot":{"paper":{"title":"E5-V: Universal Embeddings with Multimodal Large Language Models","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"Prompted MLLMs trained only on text pairs deliver universal multimodal embeddings that rival or exceed specialized models.","cross_cats":["cs.CV","cs.IR"],"primary_cat":"cs.CL","authors_text":"Deqing Wang, Feng Sun, Fuzhen Zhuang, Haizhen Huang, MingHui Song, Qi Zhang, Ting Jiang, Weiwei Deng, Zihan Zhang","submitted_at":"2024-07-17T14:04:12Z","abstract_excerpt":"Multimodal large language models (MLLMs) have shown promising advancements in general visual and language understanding. However, the representation of multimodal information using MLLMs remains largely unexplored. In this work, we introduce a new framework, E5-V, designed to adapt MLLMs for achieving universal multimodal embeddings. Our findings highlight the significant potential of MLLMs in representing multimodal inputs compared to previous approaches. By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong perform"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the internal representations learned by MLLMs during pretraining are already rich enough to support universal multimodal embeddings via prompting alone, and that text-only contrastive training will generalize to unseen modalities without any multimodal data.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Prompted MLLMs trained only on text pairs deliver universal multimodal embeddings that rival or exceed specialized models.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"3af9e9a4f01340c5e8f2ff35b6e073e35c9daba4f84a849cfe5a83c9a9746732"},"source":{"id":"2407.12580","kind":"arxiv","version":1},"verdict":{"id":"e3db4649-2ebd-4c70-9ea1-70bb0f75dc12","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-16T22:48:42.759101Z","strongest_claim":"By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%.","one_line_summary":"E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the internal representations learned by MLLMs during pretraining are already rich enough to support universal multimodal embeddings via prompting alone, and that text-only contrastive training will generalize to unseen modalities without any multimodal data.","pith_extraction_headline":"Prompted MLLMs trained only on text pairs deliver universal multimodal embeddings that rival or exceed specialized models."},"references":{"count":16,"sample":[{"doi":"","year":null,"title":"isearle: Improving textual inversion for zero-shot composed image retrieval","work_id":"f67f5d54-a40d-4d81-a74b-c9ba4e6c6007","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model","work_id":"0fe2cfd8-d442-4ceb-b1a9-a465704f39b2","ref_index":2,"cited_arxiv_id":"2304.15010","is_internal_anchor":true},{"doi":"","year":null,"title":"SimCSE: Simple Contrastive Learning of Sentence Embeddings","work_id":"e9fab1e4-f443-4963-9f2a-83f772482c00","ref_index":3,"cited_arxiv_id":"2104.08821","is_internal_anchor":true},{"doi":"","year":null,"title":"Scaling sentence embeddings with large language models","work_id":"b83f2bc5-7ffa-4fa0-a007-62ae2c04fbf9","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"PromptBERT: Improving BERT Sentence Embeddings with Prompts","work_id":"0fb23e19-9f28-4928-8635-9c024f774833","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":16,"snapshot_sha256":"7b85b8c42631fe0d24d0cafeb7aa16c21d9e5952d6440e9024736cce348aa998","internal_anchors":4},"formal_canon":{"evidence_count":2,"snapshot_sha256":"2702817b2d3958eb7e3de090a5384080f2b5dc2888b14624acdf30f580f0274c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"},"aliases":[{"alias_kind":"arxiv","alias_value":"2407.12580","created_at":"2026-05-17T23:38:46.330440+00:00"},{"alias_kind":"arxiv_version","alias_value":"2407.12580v1","created_at":"2026-05-17T23:38:46.330440+00:00"},{"alias_kind":"doi","alias_value":"10.48550/arxiv.2407.12580","created_at":"2026-05-17T23:38:46.330440+00:00"},{"alias_kind":"pith_short_12","alias_value":"H2YTRIIBCHUG","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_16","alias_value":"H2YTRIIBCHUG55H2","created_at":"2026-05-18T12:33:37.589309+00:00"},{"alias_kind":"pith_short_8","alias_value":"H2YTRIIB","created_at":"2026-05-18T12:33:37.589309+00:00"}],"events":[],"event_summary":{},"paper_claims":[],"inbound_citations":{"count":25,"internal_anchor_count":25,"sample":[{"citing_arxiv_id":"2605.16638","citing_title":"TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":86,"is_internal_anchor":true},{"citing_arxiv_id":"2509.00798","citing_title":"Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering","ref_index":17,"is_internal_anchor":true},{"citing_arxiv_id":"2509.24621","citing_title":"FreeRet: MLLMs as Training-Free Retrievers","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2410.05160","citing_title":"VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2512.13511","citing_title":"Adapting MLLMs for Nuanced Video Retrieval","ref_index":36,"is_internal_anchor":true},{"citing_arxiv_id":"2509.20354","citing_title":"EmbeddingGemma: Powerful and Lightweight Text Representations","ref_index":9,"is_internal_anchor":true},{"citing_arxiv_id":"2412.16855","citing_title":"GME: Improving Universal Multimodal Retrieval by Multimodal LLMs","ref_index":22,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14448","citing_title":"Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture","ref_index":13,"is_internal_anchor":true},{"citing_arxiv_id":"2605.14311","citing_title":"Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment","ref_index":86,"is_internal_anchor":true},{"citing_arxiv_id":"2605.13277","citing_title":"Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation","ref_index":4,"is_internal_anchor":true},{"citing_arxiv_id":"2604.02073","citing_title":"PLUME: Latent Reasoning Based Universal Multimodal Embedding","ref_index":19,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08384","citing_title":"jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2601.04720","citing_title":"Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking","ref_index":12,"is_internal_anchor":true},{"citing_arxiv_id":"2605.08384","citing_title":"jina-embeddings-v5-omni: Geometry-preserving Embeddings via Locked Aligned Towers","ref_index":15,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05831","citing_title":"Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13710","citing_title":"SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.25273","citing_title":"Combating Visual Neglect and Semantic Drift in Large Multimodal Models for Enhanced Cross-Modal Retrieval","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.22280","citing_title":"Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings","ref_index":16,"is_internal_anchor":true},{"citing_arxiv_id":"2605.05831","citing_title":"Unifying Scientific Communication: Fine-Grained Correspondence Across Scientific Media","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2605.00063","citing_title":"A Survey of Reasoning-Intensive Retrieval: Progress and Challenges","ref_index":26,"is_internal_anchor":true},{"citing_arxiv_id":"2604.11095","citing_title":"Bottleneck Tokens for Unified Multimodal Retrieval","ref_index":8,"is_internal_anchor":true},{"citing_arxiv_id":"2604.07419","citing_title":"ReAlign: Optimizing the Visual Document Retriever with Reasoning-Guided Fine-Grained Alignment","ref_index":24,"is_internal_anchor":true},{"citing_arxiv_id":"2604.13710","citing_title":"SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs","ref_index":14,"is_internal_anchor":true},{"citing_arxiv_id":"2604.20135","citing_title":"AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce","ref_index":7,"is_internal_anchor":true}]},"formal_canon":{"evidence_count":2,"sample":[],"anchors":[]},"links":{"html":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L","json":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L.json","graph_json":"https://pith.science/api/pith-number/H2YTRIIBCHUG55H22SZIBC4O6L/graph.json","events_json":"https://pith.science/api/pith-number/H2YTRIIBCHUG55H22SZIBC4O6L/events.json","paper":"https://pith.science/paper/H2YTRIIB"},"agent_actions":{"view_html":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L","download_json":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L.json","view_paper":"https://pith.science/paper/H2YTRIIB","resolve_alias":"https://pith.science/api/pith-number/resolve?arxiv=2407.12580&json=true","fetch_graph":"https://pith.science/api/pith-number/H2YTRIIBCHUG55H22SZIBC4O6L/graph.json","fetch_events":"https://pith.science/api/pith-number/H2YTRIIBCHUG55H22SZIBC4O6L/events.json","actions":{"anchor_timestamp":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L/action/timestamp_anchor","attest_storage":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L/action/storage_attestation","attest_author":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L/action/author_attestation","sign_citation":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L/action/citation_signature","submit_replication":"https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L/action/replication_record"}},"created_at":"2026-05-17T23:38:46.330440+00:00","updated_at":"2026-05-17T23:38:46.330440+00:00"}