pith. sign in
Pith Number

pith:H2YTRIIB

pith:2024:H2YTRIIBCHUG55H22SZIBC4O6L
not attested not anchored not stored refs resolved

E5-V: Universal Embeddings with Multimodal Large Language Models

Deqing Wang, Feng Sun, Fuzhen Zhuang, Haizhen Huang, MingHui Song, Qi Zhang, Ting Jiang, Weiwei Deng, Zihan Zhang

Prompted MLLMs trained only on text pairs deliver universal multimodal embeddings that rival or exceed specialized models.

arxiv:2407.12580 v1 · 2024-07-17 · cs.CL · cs.CV · cs.IR

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{H2YTRIIBCHUG55H22SZIBC4O6L}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

By leveraging MLLMs with prompts, E5-V effectively bridges the modality gap between different types of inputs, demonstrating strong performance in multimodal embeddings even without fine-tuning. We propose a single modality training approach for E5-V, where the model is trained exclusively on text pairs. This method demonstrates significant improvements over traditional multimodal training on image-text pairs, while reducing training costs by approximately 95%.

C2weakest assumption

That the internal representations learned by MLLMs during pretraining are already rich enough to support universal multimodal embeddings via prompting alone, and that text-only contrastive training will generalize to unseen modalities without any multimodal data.

C3one line summary

E5-V produces strong universal multimodal embeddings from MLLMs trained solely on text pairs, often surpassing prior methods across retrieval and related tasks without multimodal fine-tuning.

References

16 extracted · 16 resolved · 4 Pith anchors

[1] isearle: Improving textual inversion for zero-shot composed image retrieval
[2] LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model · arXiv:2304.15010
[3] SimCSE: Simple Contrastive Learning of Sentence Embeddings · arXiv:2104.08821
[4] Scaling sentence embeddings with large language models
[5] PromptBERT: Improving BERT Sentence Embeddings with Prompts

Formal links

2 machine-checked theorem links

Cited by

25 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:46.330376Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

3eb138a10111e86ef4fad4b2808b8ef2e0769cd0c85be9142afd142b1608f8f7

Aliases

arxiv: 2407.12580 · arxiv_version: 2407.12580v1 · doi: 10.48550/arxiv.2407.12580 · pith_short_12: H2YTRIIBCHUG · pith_short_16: H2YTRIIBCHUG55H2 · pith_short_8: H2YTRIIB
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/H2YTRIIBCHUG55H22SZIBC4O6L \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 3eb138a10111e86ef4fad4b2808b8ef2e0769cd0c85be9142afd142b1608f8f7
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "dcb6625152f4f4eb341ffc634f6dcc46dad5877714b30f83aef58aa09ffdff31",
    "cross_cats_sorted": [
      "cs.CV",
      "cs.IR"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-07-17T14:04:12Z",
    "title_canon_sha256": "d428716da15993d082556602deea93b9d1867317eb8bfbd1a0ba8c0afcf0e1f9"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2407.12580",
    "kind": "arxiv",
    "version": 1
  }
}