pith. sign in
Pith Number

pith:K7HXT4ZN

pith:2024:K7HXT4ZN3IS3EC6AAMEVNSKHRJ
not attested not anchored not stored refs resolved

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Rui Meng, Semih Yavuz, Wenhu Chen, Xinyi Yang, Yingbo Zhou, Ziyan Jiang

A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.

arxiv:2410.05160 v3 · 2024-10-07 · cs.CV · cs.AI · cs.CL

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{K7HXT4ZN3IS3EC6AAMEVNSKHRJ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.

C2weakest assumption

The assumption that contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets (including out-of-distribution ones) without substantial overfitting or data leakage between splits.

C3one line summary

VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.

References

45 extracted · 45 resolved · 9 Pith anchors

[1] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone · arXiv:2404.14219
[2] SemEval-2012 task 6: A pilot on semantic textual similarity 2012
[3] arXiv preprint arXiv:2211.09260 , year=
[4] Llm2vec: Large language models are secretly powerful text encoders
[5] SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation 2017

Cited by

24 papers in Pith

Receipt and verification
First computed 2026-05-17T23:38:13.046884Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f

Aliases

arxiv: 2410.05160 · arxiv_version: 2410.05160v3 · doi: 10.48550/arxiv.2410.05160 · pith_short_12: K7HXT4ZN3IS3 · pith_short_16: K7HXT4ZN3IS3EC6A · pith_short_8: K7HXT4ZN
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "fec1327baaf6d937bd58b1cd02c0e6490a6f95af146745fda3f018f0c2140ea0",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-10-07T16:14:05Z",
    "title_canon_sha256": "41d27d66a80e95ca2a37e1619bf0335b9f6ba1bf69ec247231ff3a12e23891d4"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.05160",
    "kind": "arxiv",
    "version": 3
  }
}