pith:K7HXT4ZN
VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks
A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.
arxiv:2410.05160 v3 · 2024-10-07 · cs.CV · cs.AI · cs.CL
Add to your LaTeX paper
\usepackage{pith}
\pithnumber{K7HXT4ZN3IS3EC6AAMEVNSKHRJ}
Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge
Record completeness
Claims
Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.
The assumption that contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets (including out-of-distribution ones) without substantial overfitting or data leakage between splits.
VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.
References
Cited by
Receipt and verification
| First computed | 2026-05-17T23:38:13.046884Z |
|---|---|
| Builder | pith-number-builder-2026-05-17-v1 |
| Signature | Pith Ed25519
(pith-v1-2026-05) · public key |
| Schema | pith-number/v1.0 |
Canonical hash
57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f
Aliases
· · · · ·Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ \
| jq -c '.canonical_record' \
| python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f
Canonical record JSON
{
"metadata": {
"abstract_canon_sha256": "fec1327baaf6d937bd58b1cd02c0e6490a6f95af146745fda3f018f0c2140ea0",
"cross_cats_sorted": [
"cs.AI",
"cs.CL"
],
"license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
"primary_cat": "cs.CV",
"submitted_at": "2024-10-07T16:14:05Z",
"title_canon_sha256": "41d27d66a80e95ca2a37e1619bf0335b9f6ba1bf69ec247231ff3a12e23891d4"
},
"schema_version": "1.0",
"source": {
"id": "2410.05160",
"kind": "arxiv",
"version": 3
}
}