{"paper":{"title":"NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models","license":"http://creativecommons.org/licenses/by/4.0/","headline":"Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning.","cross_cats":["cs.AI","cs.IR","cs.LG"],"primary_cat":"cs.CL","authors_text":"Bryan Catanzaro, Chankyu Lee, Jonathan Raiman, Mengyao Xu, Mohammad Shoeybi, Rajarshi Roy, Wei Ping","submitted_at":"2024-05-27T17:59:45Z","abstract_excerpt":"Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"By combining the latent attention layer, removal of the causal attention mask, two-stage contrastive instruction-tuning, and curated datasets including hard negatives and synthetic data, NV-Embed-v1 and NV-Embed-v2 obtain the No.1 position on the MTEB leaderboard across 56 tasks.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the reported gains stem primarily from the proposed architectural and procedural changes rather than from larger training compute, model scale, or the specific choice of public datasets alone.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"f1ab2dcb0c51c1ec49b3c4689c6ae8304e36e3fb919d3adfc495ab4d976e4231"},"source":{"id":"2405.17428","kind":"arxiv","version":3},"verdict":{"id":"bb8cdfa5-7843-4e4a-b6e4-918cf5084467","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-14T21:10:16.373016Z","strongest_claim":"By combining the latent attention layer, removal of the causal attention mask, two-stage contrastive instruction-tuning, and curated datasets including hard negatives and synthetic data, NV-Embed-v1 and NV-Embed-v2 obtain the No.1 position on the MTEB leaderboard across 56 tasks.","one_line_summary":"NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the reported gains stem primarily from the proposed architectural and procedural changes rather than from larger training compute, model scale, or the specific choice of public datasets alone.","pith_extraction_headline":"Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning."},"references":{"count":121,"sample":[{"doi":"","year":2019,"title":"Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain","work_id":"e6b82b89-83d2-4785-afe4-51851d731321","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2012,"title":"S em E val-2012 task 6: A pilot on semantic textual similarity","work_id":"1d6ab256-a6da-4115-a303-7ceb21fa334f","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1901,"title":"Language models are few-shot learners","work_id":"b5af3a68-2622-4421-b39b-b1d2fbde2d8d","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2020,"title":"Efficient intent detection with dual sentence encoders","work_id":"d963b119-021b-48ee-9acb-554b5e402977","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023","work_id":"99655c36-3038-4267-abae-eb9bd7978726","ref_index":9,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":121,"snapshot_sha256":"cc1ec95389852ac9f876b6121a7610ff83fcc6a61de4400535acb6d940aeb710","internal_anchors":22},"formal_canon":{"evidence_count":1,"snapshot_sha256":"3cff1c009108c745061a6a45e3c7f15d5b37baa9e2f32f96db14ae0a05958b9f"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}