Pith Number

pith:K7HXT4ZN

pith:2024:K7HXT4ZN3IS3EC6AAMEVNSKHRJ

not attested not anchored not stored refs resolved

VLM2Vec: Training Vision-Language Models for Massive Multimodal Embedding Tasks

Rui Meng, Semih Yavuz, Wenhu Chen, Xinyi Yang, Yingbo Zhou, Ziyan Jiang

A contrastive training method turns vision-language models into versatile multimodal embedding models that improve 10 to 20 percent on a new benchmark of 36 tasks.

arxiv:2410.05160 v3 · 2024-10-07 · cs.CV · cs.AI · cs.CL

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{K7HXT4ZN3IS3EC6AAMEVNSKHRJ}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

Our results show that VLM2Vec achieves an absolute average improvement of 10% to 20% over existing multimodal embedding models on both in-distribution and out-of-distribution datasets in MMEB. We show that VLMs are secretly strong embedding models.

C2weakest assumption

The assumption that contrastive training on the 20 MMEB training datasets produces embeddings that generalize to the 16 evaluation datasets (including out-of-distribution ones) without substantial overfitting or data leakage between splits.

C3one line summary

VLM2Vec converts state-of-the-art vision-language models into universal multimodal embedders via contrastive training on the new MMEB benchmark, delivering 10-20% absolute gains over prior models on both in-distribution and out-of-distribution tasks.

References

45 extracted · 45 resolved · 9 Pith anchors

[1] Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone · arXiv:2404.14219

[2] SemEval-2012 task 6: A pilot on semantic textual similarity 2012

[3] arXiv preprint arXiv:2211.09260 , year=

[4] Llm2vec: Large language models are secretly powerful text encoders

[5] SemEval-2017 task 1: Semantic textual similarity multilingual and crosslingual focused evaluation 2017

Cited by

24 papers in Pith

Rethinking Predictive Modeling for LLM Routing: When Simple kNN Beats Complex Learned Routers

FLUID: From Ephemeral IDs to Multimodal Semantic Codes for Industrial-Scale Livestreaming Recommendation

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

MetaEmbed: Scaling Multimodal Retrieval at Test-Time with Flexible Late Interaction

Receipt and verification

First computed	2026-05-17T23:38:13.046884Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f

Aliases

arxiv: 2410.05160 · arxiv_version: 2410.05160v3 · doi: 10.48550/arxiv.2410.05160 · pith_short_12: K7HXT4ZN3IS3 · pith_short_16: K7HXT4ZN3IS3EC6A · pith_short_8: K7HXT4ZN

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/K7HXT4ZN3IS3EC6AAMEVNSKHRJ \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 57cf79f32dda25b20bc0030956c9478a46343646bb5f8893142e0cfa34d5715f

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "fec1327baaf6d937bd58b1cd02c0e6490a6f95af146745fda3f018f0c2140ea0",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.CL"
    ],
    "license": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
    "primary_cat": "cs.CV",
    "submitted_at": "2024-10-07T16:14:05Z",
    "title_canon_sha256": "41d27d66a80e95ca2a37e1619bf0335b9f6ba1bf69ec247231ff3a12e23891d4"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2410.05160",
    "kind": "arxiv",
    "version": 3
  }
}