arxiv: 2405.17428 · v3 · submitted 2024-05-27 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: 1 theorem link

· Lean Theorem

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Chankyu Lee , Rajarshi Roy , Mengyao Xu , Jonathan Raiman , Mohammad Shoeybi , Bryan Catanzaro , Wei Ping

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG

keywords LLM embeddingstext embeddingscontrastive instruction tuningMTEB benchmarklatent attention layerhard negative mininggeneralist embedding modelsdecoder-only models

0 comments

The pith

Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that decoder-only large language models can be turned into strong generalist embedding models through targeted changes. A latent attention layer replaces standard pooling methods to create better vector representations. Removing the causal attention mask during training and using a two-stage contrastive instruction-tuning process with hard negatives and synthetic data further improve results. These steps together produce models that reach the top rank on the MTEB leaderboard across 56 tasks and perform well on additional out-of-domain retrieval benchmarks.

Core claim

NV-Embed achieves the No.1 position on the MTEB leaderboard across 56 tasks by incorporating a latent attention layer to obtain pooled embeddings, removing the causal attention mask of LLMs during contrastive training, introducing a two-stage contrastive instruction-tuning method that first focuses on retrieval then blends non-retrieval tasks, and utilizing hard-negative mining along with synthetic data generation from public datasets.

What carries the argument

The latent attention layer that produces pooled embeddings from LLMs, which improves accuracy over mean pooling or last-token embeddings when combined with causal mask removal and two-stage contrastive tuning.

If this is right

The latent attention layer consistently raises retrieval and downstream task accuracy compared with mean pooling or last <EOS> token embeddings.
Removing the causal attention mask during contrastive training improves representation learning for embedding tasks.
Two-stage contrastive instruction-tuning boosts non-retrieval task accuracy while also raising retrieval performance.
Curated hard negatives and synthetic data further increase overall embedding quality.
The resulting models reach the highest Long Doc scores and second-highest QA scores on the AIR Benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same set of changes could be applied to other decoder-only LLMs to raise their embedding performance without increasing model size.
Model compression techniques discussed in the paper may allow these high-performing embeddings to run efficiently on limited hardware.
Strong results on out-of-domain benchmarks suggest the approach could support reliable retrieval in real-world settings beyond MTEB tasks.

Load-bearing premise

The reported gains stem primarily from the proposed architectural changes, mask removal, and training stages rather than from larger model scale, extra compute, or dataset selection alone.

What would settle it

A side-by-side retraining of an equivalent LLM using only mean pooling, keeping the causal mask, and single-stage tuning on the same data and compute budget, then checking whether MTEB scores match those of NV-Embed.

read the original abstract

Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position on the MTEB leaderboard (as of May 24 and August 30, 2024, respectively) across 56 tasks, demonstrating the sustained effectiveness of the proposed methods over time. It also achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB. We further provide the analysis of model compression techniques for generalist embedding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NV-Embed tops MTEB with latent attention pooling, causal mask removal, and two-stage tuning, but the gains are not cleanly separated from model scale and data volume.

read the letter

NV-Embed reaches the top of the MTEB leaderboard by adding a latent attention pooling layer instead of mean pooling or last-token embeddings, dropping the causal mask during contrastive training, and running a two-stage instruction tuning process that starts with retrieval data plus hard negatives then blends in non-retrieval tasks. The paper also generates synthetic data and reports results on the AIR benchmark for out-of-domain retrieval. These changes produce measurable lifts on the 56-task MTEB suite and keep the model simple enough to reproduce. The compression analysis at the end is a practical addition for people who need smaller embeddings. The results sit on fixed public benchmarks with external hard-negative mining, so there is no obvious circularity. The main limitation is the lack of controlled ablations that hold base model size, total tokens, and dataset volume fixed while toggling only the new pieces. Without those runs it is difficult to say how much of the leaderboard jump comes from the latent attention layer or the two-stage schedule versus simply training a larger model on more data. The methods are still worth testing because they are cheap to implement on top of existing contrastive pipelines. This paper is aimed at groups that build or fine-tune embedding models for retrieval and RAG. It gives working recipes and leaderboard numbers that justify further experiments. I would send it to peer review because the empirical record is solid even if the exact credit assignment needs tightening.

Referee Report

1 major / 3 minor

Summary. The manuscript presents NV-Embed, techniques to train decoder-only LLMs as generalist embedding models. It introduces a latent attention pooling layer, removal of the causal attention mask during contrastive training, a two-stage contrastive instruction-tuning procedure (retrieval-focused stage with hard negatives followed by blending non-retrieval tasks), and curated datasets using hard-negative mining and synthetic data. These yield NV-Embed-v1 and v2 achieving the No.1 position on the MTEB leaderboard across 56 tasks (as of May 24 and August 30, 2024) plus strong AIR benchmark results and model compression analysis.

Significance. If the performance attribution holds, the work is significant for showing how targeted architectural and procedural changes can make LLM-based embeddings outperform prior BERT/T5 approaches on public benchmarks. The techniques are presented as simple and reproducible, with explicit ablation-friendly design choices and practical compression analysis adding deployment value. Credit is given for the reported leaderboard results and consistent improvements on fixed benchmarks.

major comments (1)

[Results section (MTEB and AIR evaluations)] The central claim attributes the #1 MTEB ranking to the combination of latent attention, causal mask removal, two-stage tuning, and curated data. However, the manuscript lacks controlled ablations that fix base LLM scale, total training tokens, and data volume while toggling only the proposed components (e.g., mean pooling + causal mask vs. latent attention + no mask on identical runs). This leaves open whether gains arise primarily from the new techniques or from scale/compute/dataset volume differences, as noted in the stress-test concern.

minor comments (3)

[§3.1 (architecture)] The abstract and methods would benefit from an explicit equation or pseudocode defining the latent attention pooling operation and its integration with the decoder layers.
[Training procedure description] Clarify the exact blending ratios and instruction formats used in stage-2 of the contrastive tuning to improve reproducibility.
[Compression experiments] In the model compression analysis, include quantitative tables showing performance drop vs. compression ratio for each technique evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation of minor revision. We address the concern regarding the lack of fully controlled ablations below, agreeing to clarify experimental limitations and strengthen the discussion of attribution in the revised manuscript.

read point-by-point responses

Referee: The central claim attributes the #1 MTEB ranking to the combination of latent attention, causal mask removal, two-stage tuning, and curated data. However, the manuscript lacks controlled ablations that fix base LLM scale, total training tokens, and data volume while toggling only the proposed components (e.g., mean pooling + causal mask vs. latent attention + no mask on identical runs). This leaves open whether gains arise primarily from the new techniques or from scale/compute/dataset volume differences, as noted in the stress-test concern.

Authors: We acknowledge the validity of this point. Our Section 4.3 ablations incrementally add each proposed component (latent attention, mask removal, two-stage tuning) to the same base LLM and report consistent gains on MTEB, but these runs do not enforce identical total training tokens or exact data volume across every variant due to compute limits. All models share the same base scale and use overlapping data sources, with hard-negative mining and synthetic data applied uniformly. We will revise the manuscript to explicitly state this limitation, add a dedicated paragraph on potential confounding factors, and include a table summarizing training token counts per ablation run. We maintain that the techniques drive the gains, as they improve over strong same-scale baselines and align with prior embedding literature, but we agree a more controlled comparison would further strengthen the attribution. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks are independently verifiable

full rationale

The manuscript describes architectural changes (latent attention pooling, removal of causal mask), a two-stage contrastive training procedure, and data curation steps, then reports performance numbers on fixed public leaderboards (MTEB across 56 tasks, AIR Benchmark). No equations, uniqueness theorems, or first-principles derivations are presented that reduce to quantities fitted inside the same experiment. All claimed gains are measured against external, unchanging test sets using standard metrics; training details reference public datasets and hard-negative mining without any self-referential loop that would make the reported ranking equivalent to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard contrastive-learning assumptions and empirical validation rather than new theoretical derivations; no new physical entities or ungrounded constants are introduced.

free parameters (2)

contrastive temperature and batch size
Typical hyperparameters in contrastive embedding training that are tuned on validation data.
hard-negative selection thresholds
Curated thresholds for mining hard negatives from retrieval datasets.

axioms (2)

domain assumption Contrastive objectives with in-batch and hard negatives improve embedding quality
Standard premise in modern embedding model training literature.
domain assumption Removing causal mask during contrastive training is beneficial for bidirectional representations
Invoked to justify the architectural change for embedding tasks.

pith-pipeline@v0.9.0 · 5648 in / 1344 out tokens · 51567 ms · 2026-05-14T21:10:16.373016+00:00 · methodology

discussion (0)

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MeMo: Memory as a Model
cs.CL 2026-05 unverdicted novelty 7.0

MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
cs.LG 2026-05 unverdicted novelty 7.0

BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning
cs.LG 2026-05 unverdicted novelty 7.0

SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude ...
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
cs.LG 2026-05 unverdicted novelty 7.0

Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
cs.AI 2026-05 unverdicted novelty 7.0

SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
cs.CL 2026-05 unverdicted novelty 7.0

TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
cs.CL 2026-05 unverdicted novelty 7.0

BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
Bottleneck Tokens for Unified Multimodal Retrieval
cs.LG 2026-04 unverdicted novelty 7.0

Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
cs.LG 2026-05 unverdicted novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
Aspect-Aware Content-Based Recommendations for Mathematical Research Papers
cs.IR 2026-05 unverdicted novelty 6.0

The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage acr...
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
cs.CL 2026-05 unverdicted novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding
cs.CL 2026-04 unverdicted novelty 6.0

TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.
Exploring Audio Hallucination in Egocentric Video Understanding
cs.CV 2026-04 unverdicted novelty 6.0

AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
ViLL-E: Video LLM Embeddings for Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service
cs.CR 2026-04 unverdicted novelty 6.0

GeoMark decouples local watermark triggering from centralized ownership attribution using geometry-separated anchors and adaptive neighborhoods to improve robustness against paraphrasing, dimension changes, and cluste...
Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders
cs.IR 2026-04 unverdicted novelty 6.0

New CMedTEB benchmark and CARE asymmetric retriever outperform symmetric models on Chinese medical retrieval tasks while preserving low latency.
Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA
cs.IR 2026-04 conditional novelty 6.0

Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering
cs.IR 2026-04 conditional novelty 6.0

BridgeRAG improves training-free multi-hop retrieval by using a bridge-conditioned LLM scorer to rank evidence chains, achieving new best R@5 scores on MuSiQue, 2WikiMultiHopQA, and HotpotQA.
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
cs.CL 2026-04 unverdicted novelty 5.0

DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
cs.CL 2026-04 unverdicted novelty 5.0

AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
cs.CL 2026-01 unverdicted novelty 4.0

Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
cs.CL 2025-06 unverdicted novelty 4.0

Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning,...

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 24 Pith papers · 17 internal anchors

[1]

Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain

C.J. Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain. Jigsaw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification

work page 2019
[2]

S em E val-2012 task 6: A pilot on semantic textual similarity

Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. S em E val-2012 task 6: A pilot on semantic textual similarity. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret (eds.), * SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the ...

work page 2012
[6]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

work page 1901
[7]

Efficient intent detection with dual sentence encoders

I \ n igo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vulic. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. URL https://arxiv.org/abs/2003.04807. Data available at https://github.com/PolyAI-LDN/task-specific-datasets

work page arXiv 2020
[9]

Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023

Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023

work page 2023
[13]

Sparsegpt: Massive language models can be accurately pruned in one-shot

Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp.\ 10323--10337. PMLR, 2023

work page 2023
[18]

The stanford natural language inference (snli) corpus, 2022

Stanford NLP Group et al. The stanford natural language inference (snli) corpus, 2022

work page 2022
[19]

Retrieval augmented language model pre-training

Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pp.\ 3929--3938. PMLR, 2020

work page 2020
[26]

Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement

Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy yong Sohn, and Chanyeol Choi. Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement. linq ai research blog, 2024. URL https://getlinq.com/blog/linq-embed-mistral/

work page 2024
[27]

Natural questions: a benchmark for question answering research

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

work page 2019
[28]

Newsweeder: Learning to filter netnews

Ken Lang. Newsweeder: Learning to filter netnews. In Machine learning proceedings 1995, pp.\ 331--339. Elsevier, 1995

work page 1995
[30]

Open source strikes bread - new fluffy embeddings model, 2024 b

Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024 b . URL https://www.mixedbread.ai/blog/mxbai-embed-large-v1

work page 2024
[31]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 0 9459--9474, 2020

work page 2020
[32]

Paq: 65 million probably-asked questions and what you can do with them

Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich K \"u ttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9: 0 1098--1115, 2021

work page 2021
[33]

Datasets: A community library for natural language processing

Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario S a s ko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugg...

work page 2021
[38]

Chat QA : Surpassing GPT -4 on conversational QA and RAG

Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Mohammad Shoeybi, and Bryan Catanzaro. Chat QA : Surpassing GPT -4 on conversational QA and RAG . arXiv preprint arXiv:2401.10225, 2024

work page arXiv 2024
[39]

Maas, Raymond E

Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

work page 2011
[40]

Tweet sentiment extraction, 2020

Wei Chen Maggie, Phil Culliton. Tweet sentiment extraction, 2020. URL https://kaggle.com/competitions/tweet-sentiment-extraction

work page 2020
[41]

Www'18 open challenge: financial opinion mining and question answering

Macedo Maia, Siegfried Handschuh, Andr \'e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www'18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pp.\ 1941--1942, 2018

work page 2018
[43]

Hidden factors and hidden topics: understanding rating dimensions with review text

Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pp.\ 165--172, 2013 b

work page 2013
[44]

Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024 a

Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024 a . URL https://huggingface.co/Salesforce/SFR-Embedding-2_R

work page 2024
[45]

Sfrembedding-mistral: enhance text retrieval with transfer learning

Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3, 2024 b

work page 2024
[46]

Distributed representations of words and phrases and their compositionality

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013

work page 2013
[47]

Mixtral 8x22b

MistralAI. Mixtral 8x22b. URL https://mistral.ai/news/mixtral-8x22b/

work page
[48]

NV-Retriever: Improving text embedding models with effective hard-negative mining

Gabriel de Souza P Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. NV-Retriever: Improving text embedding models with effective hard-negative mining . arXiv preprint arXiv:2407.15831, 2024

work page arXiv 2024
[52]

MS MARCO : A human-generated machine reading comprehension dataset

Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO : A human-generated machine reading comprehension dataset. 2016

work page 2016
[55]

New embedding models and api updates, 2024

OpenAI. New embedding models and api updates, 2024

work page 2024
[56]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 2022

work page 2022
[57]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

work page 2020
[59]

Stackexchange (title, body) pairs, 2021 a

Nils Reimers. Stackexchange (title, body) pairs, 2021 a . URL https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_body_jsonl

work page 2021
[60]

Reddit (title, body) pairs, 2021 b

Nils Reimers. Reddit (title, body) pairs, 2021 b . URL https://huggingface.co/datasets/sentence-transformers/reddit-title-body

work page 2021
[62]

The probabilistic relevance framework: Bm25 and beyond

Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval , 3 0 (4): 0 333--389, 2009

work page 2009
[65]

Stack exchange data dump, 2023

Stack-Exchange-Community. Stack exchange data dump, 2023

work page 2023
[69]

An overview of the bioasq large-scale biomedical semantic indexing and question answering competition

George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16: 0 1--28, 2015

work page 2015
[70]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[71]

voyage-large-2-instruct: Instruction-tuned and rank 1 on mteb, 2024

Voyage-AI. voyage-large-2-instruct: Instruction-tuned and rank 1 on mteb, 2024

work page 2024
[72]

Retrieval of the best counterargument without prior topic knowledge

Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 241--251, 2018

work page 2018
[74]

Superglue: A stickier benchmark for general-purpose language understanding systems

Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019

work page 2019
[78]

Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis

Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning, pp.\ 5180--5189. PMLR, 2018

work page 2018
[82]

Miracl: A multilingual retrieval dataset covering 18 diverse languages

Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11: 0 1114--1131, 2023

work page 2023
[83]

Stack Exchange Data Dump , author=

work page
[84]

Linq AI Research Blog , author=

Linq-Embed-Mistral: Elevating Text Retrieval with Improved GPT Data Through Task-Specific Control and Quality Refinement. Linq AI Research Blog , author=. 2024 , url=

work page 2024
[85]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[86]

InProceedings of the 63rd Annual Meet- ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 18761–18799, Vi- enna, Austria

A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

work page arXiv
[87]

arXiv preprint arXiv:2104.07081 , year=

TWEAC: transformer with extendable QA agent classifiers , author=. arXiv preprint arXiv:2104.07081 , year=

work page arXiv
[88]

International Conference on Machine Learning , pages=

Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[89]

arXiv preprint arXiv:2409.15700 , year=

Making Text Embedders Few-Shot Learners , author=. arXiv preprint arXiv:2409.15700 , year=

work page arXiv
[90]

arXiv preprint arXiv:2310.01208 , year=

Label supervised llama finetuning , author=. arXiv preprint arXiv:2310.01208 , year=

work page arXiv
[91]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[92]

Hover: A dataset for many-hop fact extraction and claim verification, 2020

HoVer: A dataset for many-hop fact extraction and claim verification , author=. arXiv preprint arXiv:2011.03088 , year=

work page arXiv 2011
[93]

TyDi: A multi-lingual benchmark for dense retrieval , author=

Mr. TyDi: A multi-lingual benchmark for dense retrieval , author=. arXiv preprint arXiv:2108.08787 , year=

work page arXiv
[94]

arXiv preprint arXiv:2210.13777 , year=

SciFact-open: Towards open-domain scientific claim verification , author=. arXiv preprint arXiv:2210.13777 , year=

work page arXiv
[95]

Transactions of the Association for Computational Linguistics , volume=

Miracl: A multilingual retrieval dataset covering 18 diverse languages , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=

work page 2023
[96]

Moreira, Gabriel de Souza P and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even , journal=

work page
[97]

Advances in neural information processing systems , volume=

Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

work page
[98]

FEVER: a large-scale dataset for Fact Extraction and VERification

FEVER: a large-scale dataset for fact extraction and VERification , author=. arXiv preprint arXiv:1803.05355 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[99]

arXiv preprint arXiv:2401.00368

Improving text embeddings with large language models , author=. arXiv preprint arXiv:2401.00368 , year=

work page arXiv
[100]

arXiv preprint arXiv:2403.20327 , year=

Gecko: Versatile text embeddings distilled from large language models , author=. arXiv preprint arXiv:2403.20327 , year=

work page arXiv
[101]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[102]

Salesforce AI Research Blog , volume=

Sfrembedding-mistral: enhance text retrieval with transfer learning , author=. Salesforce AI Research Blog , volume=

work page
[103]

voyage-large-2-instruct: Instruction-tuned and rank 1 on MTEB , author=

work page
[104]

Generative representational instruction tuning.arXiv preprint arXiv:2402.09906, 2024

Generative representational instruction tuning , author=. arXiv preprint arXiv:2402.09906 , year=

work page arXiv
[105]

arXiv preprint arXiv:2201.10005 , year=

Text and code embeddings by contrastive pre-training , author=. arXiv preprint arXiv:2201.10005 , year=

work page arXiv
[106]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[107]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[108]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[109]

2024 , url=

SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training , author=. 2024 , url=

work page 2024
[110]

Unsupervised Dense Information Retrieval with Contrastive Learning

Unsupervised dense information retrieval with contrastive learning , author=. arXiv preprint arXiv:2112.09118 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[111]

Towards General Text Embeddings with Multi-stage Contrastive Learning

Towards general text embeddings with multi-stage contrastive learning , author=. arXiv preprint arXiv:2308.03281 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[112]

arXiv preprint arXiv:2112.07899 , year=

Large dual encoders are generalizable retrievers , author=. arXiv preprint arXiv:2112.07899 , year=

work page arXiv
[113]

2023 , eprint=

BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2023 , eprint=

work page 2023
[114]

Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li , year=

work page
[115]

MTEB: Massive Text Embedding Benchmark

Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. arXiv preprint arXiv:2210.07316 , year=

work page internal anchor Pith review arXiv
[116]

Mistral 7B

Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[117]

Gemini: A Family of Highly Capable Multimodal Models

Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[118]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

Sentence-bert: Sentence embeddings using siamese bert-networks , author=. arXiv preprint arXiv:1908.10084 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1908
[119]

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Simcse: Simple contrastive learning of sentence embeddings , author=. arXiv preprint arXiv:2104.08821 , year=

work page internal anchor Pith review arXiv
[120]

Advances in neural information processing systems , year=

Distributed representations of words and phrases and their compositionality , author=. Advances in neural information processing systems , year=

work page
[121]

Advances in Neural Information Processing Systems , volume=

Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

work page
[122]

Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Shoeybi, Mohammad and Catanzaro, Bryan , journal=. Chat

work page

Showing first 80 references.