Pith Number

pith:H2YTRIIB

pith:2024:H2YTRIIBCHUG55H22SZIBC4O6L

not attested not anchored not stored refs resolved

E5-V: Universal Embeddings with Multimodal Large Language Models

Deqing Wang, Feng Sun, Fuzhen Zhuang, Haizhen Huang, MingHui Song, Qi Zhang, Ting Jiang, Weiwei Deng, Zihan Zhang

Prompted MLLMs trained only on text pairs deliver universal multimodal embeddings that rival or exceed specialized models.

arxiv:2407.12580 v1 · 2024-07-17 · cs.CL · cs.CV · cs.IR

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{H2YTRIIBCHUG55H22SZIBC4O6L}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%.

C2weakest assumption

That the internal representations learned by MLLMs during pretraining are already rich enough to support universal multimodal embeddings via prompting alone, and that text-only contrastive training will generalize to unseen modalities without any multimodal data.

C3one line summary

E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.

References

16 extracted · 16 resolved · 4 Pith anchors

[1] isearle: Improving textual inversion for zero-shot composed image retrieval

[2] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model · arXiv:2304.15010

[3] SimCSE: Simple Contrastive Learning of Sentence Embeddings · arXiv:2104.08821

[4] Scaling sentence embeddings with large language models

[5] PromptBERT: Improving BERT Sentence Embeddings with Prompts

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

FreeRet: MLLMs as Training-Free Retrievers

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Receipt and verification

First computed	2026-05-17T23:38:46.330376Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

3eb138a10111e86ef4fad4b2808b8ef2e0769cd0c85be9142afd142b1608f8f7

Aliases

arxiv: 2407.12580 · arxiv_version: 2407.12580v1 · doi: 10.48550/arxiv.2407.12580 · pith_short_12: H2YTRIIBCHUG · pith_short_16: H2YTRIIBCHUG55H2 · pith_short_8: H2YTRIIB

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 3eb138a10111e86ef4fad4b2808b8ef2e0769cd0c85be9142afd142b1608f8f7

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "dcb6625152f4f4eb341ffc634f6dcc46dad5877714b30f83aef58aa09ffdff31",
    "cross_cats_sorted": [
      "cs.CV",
      "cs.IR"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-07-17T14:04:12Z",
    "title_canon_sha256": "d428716da15993d082556602deea93b9d1867317eb8bfbd1a0ba8c0afcf0e1f9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2407.12580",
    "kind": "arxiv",
    "version": 1
  }
}