Nv-retriever: Improving text embedding models with effective hard-negative mining

Gabriel de Souza P Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, Even Oldridge · 2024 · arXiv 2407.15831

10 Pith papers cite this work. Polarity classification is still indexing.

10 Pith papers citing it

read on arXiv browse 10 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability

cs.IR · 2026-04-17 · unverdicted · novelty 7.0

LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,

vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents

cs.IR · 2026-04-16 · conditional · novelty 7.0

vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BEIR datasets.

MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

cs.IR · 2026-05-08 · unverdicted · novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.

ARHN: Answer-Centric Relabeling of Hard Negatives with Open-Source LLMs for Dense Retrieval

cs.IR · 2026-04-13 · unverdicted · novelty 6.0

ARHN refines hard-negative training data for dense retrieval by using LLMs to convert answer-containing passages into additional positives and exclude answer-containing passages from the negative set.

EmbeddingGemma: Powerful and Lightweight Text Representations

cs.CL · 2025-09-24 · unverdicted · novelty 6.0

A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.

SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

cs.CL · 2025-08-03 · unverdicted · novelty 6.0

SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

cs.CL · 2024-05-27 · accept · novelty 6.0

NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.

PRAGMA: Revolut Foundation Model

cs.LG · 2026-04-09 · unverdicted · novelty 5.0 · 2 refs

PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and lifetime value prediction using linear heads or light fine-tuning.

Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging

cs.IR · 2025-07-11 · unverdicted · novelty 5.0

Language composition in training data creates opposing effects on CLIR and mono-IR performance for Korean-English retrieval, which model merging can partially resolve.

Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com

cs.IR · 2026-05-01 · conditional · novelty 3.0

Structured negative mining with taxonomy and LLM judges improves offline category accuracy by 2.6% in IKEA search but yields no significant online engagement gains due to prevalent zero-click user behavior.

citing papers explorer

Showing 10 of 10 citing papers.

On the Robustness of LLM-Based Dense Retrievers: A Systematic Analysis of Generalizability and Stability cs.IR · 2026-04-17 · unverdicted · none · ref 51
LLM-based dense retrievers generalize better when instruction-tuned but pay a specialization tax when optimized for reasoning; they resist typos and corpus poisoning better than encoder-only baselines yet remain vulnerable to semantic perturbations, with larger models and certain embedding geometry,
vstash: Local-First Hybrid Retrieval with Adaptive Fusion for LLM Agents cs.IR · 2026-04-16 · conditional · none · ref 18
vstash shows that hybrid retrieval disagreements provide a free training signal to fine-tune 33M-parameter embeddings, yielding NDCG@10 gains up to 19.5% on NFCorpus and matching some larger models on three of five BEIR datasets.
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal cs.IR · 2026-05-08 · unverdicted · none · ref 41
MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide distinct behavioral differences among retrievers.
ARHN: Answer-Centric Relabeling of Hard Negatives with Open-Source LLMs for Dense Retrieval cs.IR · 2026-04-13 · unverdicted · none · ref 18
ARHN refines hard-negative training data for dense retrieval by using LLMs to convert answer-containing passages into additional positives and exclude answer-containing passages from the negative set.
EmbeddingGemma: Powerful and Lightweight Text Representations cs.CL · 2025-09-24 · unverdicted · none · ref 3
A 300M-parameter open embedding model sets new SOTA on MTEB for its size class and matches models twice as large while staying effective when compressed.
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension cs.CL · 2025-08-03 · unverdicted · none · ref 8
SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models cs.CL · 2024-05-27 · accept · none · ref 48
NV-Embed achieves first place on the MTEB leaderboard across 56 tasks by combining a latent attention layer, causal-mask removal, two-stage contrastive training, and data curation for LLM-based embedding models.
PRAGMA: Revolut Foundation Model cs.LG · 2026-04-09 · unverdicted · none · ref 3 · 2 links
PRAGMA pre-trains a Transformer on heterogeneous banking events with a tailored self-supervised masked objective, yielding embeddings that support strong downstream performance on credit scoring, fraud detection, and lifetime value prediction using linear heads or light fine-tuning.
Improving Korean-English Cross-Lingual Retrieval: A Data-Centric Study of Language Composition and Model Merging cs.IR · 2025-07-11 · unverdicted · none · ref 25
Language composition in training data creates opposing effects on CLIR and mono-IR performance for Korean-English retrieval, which model merging can partially resolve.
Negative Data Mining for Contrastive Learning in Dense Retrieval at IKEA.com cs.IR · 2026-05-01 · conditional · none · ref 9
Structured negative mining with taxonomy and LLM judges improves offline category accuracy by 2.6% in IKEA search but yields no significant online engagement gains due to prevalent zero-click user behavior.

Nv-retriever: Improving text embedding models with effective hard-negative mining

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer