Pith Number

pith:NDJPKY32

pith:2024:NDJPKY322GAKH3VG5AOJTLCDTR

not attested not anchored not stored refs resolved

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Bryan Catanzaro, Chankyu Lee, Jonathan Raiman, Mengyao Xu, Mohammad Shoeybi, Rajarshi Roy, Wei Ping

Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning.

arxiv:2405.17428 v3 · 2024-05-27 · cs.CL · cs.AI · cs.IR · cs.LG

Open paper page JSON Open Graph Bundle Merged state Verified badge What is a Pith Number?

Add to your LaTeX paper

\usepackage{pith}
\pithnumber{NDJPKY322GAKH3VG5AOJTLCDTR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp

2 Internet Archive

3 Author claim open · sign in to claim

4 Citations open

5 Replications open

✓ Portable graph bundle live · download bundle · merged state

The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

By combining the latent attention layer, removal of the causal attention mask, two-stage contrastive instruction-tuning, and curated datasets including hard negatives and synthetic data, NV-Embed-v1 and NV-Embed-v2 obtain the No.1 position on the MTEB leaderboard across 56 tasks.

C2weakest assumption

That the reported gains stem primarily from the proposed architectural and procedural changes rather than from larger training compute, model scale, or the specific choice of public datasets alone.

C3one line summary

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

References

121 extracted · 121 resolved · 22 Pith anchors

[1] Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain 2019

[2] S em E val-2012 task 6: A pilot on semantic textual similarity 2012

[6] Language models are few-shot learners 1901

[7] Efficient intent detection with dual sentence encoders 2020

[9] Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023 2023

Formal links

1 machine-checked theorem link

Cited by

36 papers in Pith

R2MED: A Benchmark for Reasoning-Driven Medical Retrieval

ScaleDoc: Scaling LLM-based Predicates over Large Document Collections

Legal Retrieval for Public Defenders

MeMo: Memory as a Model

Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

Receipt and verification

First computed	2026-05-17T23:39:21.658359Z
Builder	pith-number-builder-2026-05-17-v1
Signature	Pith Ed25519 (`pith-v1-2026-05`) · public key
Schema	pith-number/v1.0

Canonical hash

68d2f5637ad180a3eea6e81c99ac439c63e2cd81aab074ba9ac5be8730b06582

Aliases

arxiv: 2405.17428 · arxiv_version: 2405.17428v3 · doi: 10.48550/arxiv.2405.17428 · pith_short_12: NDJPKY322GAK · pith_short_16: NDJPKY322GAKH3VG · pith_short_8: NDJPKY32

Agent API

Resolver JSON Graph JSON Events JSON Schema Signing key

Verify this Pith Number yourself

curl -sH 'Accept: application/ld+json' https://pith.science/pith/NDJPKY322GAKH3VG5AOJTLCDTR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 68d2f5637ad180a3eea6e81c99ac439c63e2cd81aab074ba9ac5be8730b06582

Canonical record JSON

{
  "metadata": {
    "abstract_canon_sha256": "a7f6c5b6f45a7ac779b7dfe74fc2db9c77e236c79712c3492386cf5ca706a101",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.IR",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-05-27T17:59:45Z",
    "title_canon_sha256": "9cc3182331acfb590da2194415353ff9e9f16aba89d9ab2ec357db5a469d6304"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2405.17428",
    "kind": "arxiv",
    "version": 3
  }
}