pith. sign in
Pith Number

pith:NDJPKY32

pith:2024:NDJPKY322GAKH3VG5AOJTLCDTR
not attested not anchored not stored refs resolved

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Bryan Catanzaro, Chankyu Lee, Jonathan Raiman, Mengyao Xu, Mohammad Shoeybi, Rajarshi Roy, Wei Ping

Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning.

arxiv:2405.17428 v3 · 2024-05-27 · cs.CL · cs.AI · cs.IR · cs.LG

Add to your LaTeX paper
\usepackage{pith}
\pithnumber{NDJPKY322GAKH3VG5AOJTLCDTR}

Prints a linked badge after your title and injects PDF metadata. Compiles on arXiv. Learn more · Embed verified badge

Record completeness

1 Bitcoin timestamp
2 Internet Archive
3 Author claim open · sign in to claim
4 Citations open
5 Replications open
Portable graph bundle live · download bundle · merged state
The bundle contains the canonical record plus signed events. A mirror can host it anywhere and recompute the same current state with the deterministic merge algorithm.

Claims

C1strongest claim

By combining the latent attention layer, removal of the causal attention mask, two-stage contrastive instruction-tuning, and curated datasets including hard negatives and synthetic data, NV-Embed-v1 and NV-Embed-v2 obtain the No.1 position on the MTEB leaderboard across 56 tasks.

C2weakest assumption

That the reported gains stem primarily from the proposed architectural and procedural changes rather than from larger training compute, model scale, or the specific choice of public datasets alone.

C3one line summary

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

References

121 extracted · 121 resolved · 22 Pith anchors

[1] Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain 2019
[2] S em E val-2012 task 6: A pilot on semantic textual similarity 2012
[6] Language models are few-shot learners 1901
[7] Efficient intent detection with dual sentence encoders 2020
[9] Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023 2023

Formal links

1 machine-checked theorem link

Cited by

36 papers in Pith

Receipt and verification
First computed 2026-05-17T23:39:21.658359Z
Builder pith-number-builder-2026-05-17-v1
Signature Pith Ed25519 (pith-v1-2026-05) · public key
Schema pith-number/v1.0

Canonical hash

68d2f5637ad180a3eea6e81c99ac439c63e2cd81aab074ba9ac5be8730b06582

Aliases

arxiv: 2405.17428 · arxiv_version: 2405.17428v3 · doi: 10.48550/arxiv.2405.17428 · pith_short_12: NDJPKY322GAK · pith_short_16: NDJPKY322GAKH3VG · pith_short_8: NDJPKY32
Agent API
Verify this Pith Number yourself
curl -sH 'Accept: application/ld+json' https://pith.science/pith/NDJPKY322GAKH3VG5AOJTLCDTR \
  | jq -c '.canonical_record' \
  | python3 -c "import sys,json,hashlib; b=json.dumps(json.loads(sys.stdin.read()), sort_keys=True, separators=(',',':'), ensure_ascii=False).encode(); print(hashlib.sha256(b).hexdigest())"
# expect: 68d2f5637ad180a3eea6e81c99ac439c63e2cd81aab074ba9ac5be8730b06582
Canonical record JSON
{
  "metadata": {
    "abstract_canon_sha256": "a7f6c5b6f45a7ac779b7dfe74fc2db9c77e236c79712c3492386cf5ca706a101",
    "cross_cats_sorted": [
      "cs.AI",
      "cs.IR",
      "cs.LG"
    ],
    "license": "http://creativecommons.org/licenses/by/4.0/",
    "primary_cat": "cs.CL",
    "submitted_at": "2024-05-27T17:59:45Z",
    "title_canon_sha256": "9cc3182331acfb590da2194415353ff9e9f16aba89d9ab2ec357db5a469d6304"
  },
  "schema_version": "1.0",
  "source": {
    "id": "2405.17428",
    "kind": "arxiv",
    "version": 3
  }
}