pith. machine review for the scientific record. sign in

arxiv: 2405.17428 · v3 · submitted 2024-05-27 · 💻 cs.CL · cs.AI· cs.IR· cs.LG

Recognition: 1 theorem link

· Lean Theorem

NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.IRcs.LG
keywords LLM embeddingstext embeddingscontrastive instruction tuningMTEB benchmarklatent attention layerhard negative mininggeneralist embedding modelsdecoder-only models
0
0 comments X

The pith

Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that decoder-only large language models can be turned into strong generalist embedding models through targeted changes. A latent attention layer replaces standard pooling methods to create better vector representations. Removing the causal attention mask during training and using a two-stage contrastive instruction-tuning process with hard negatives and synthetic data further improve results. These steps together produce models that reach the top rank on the MTEB leaderboard across 56 tasks and perform well on additional out-of-domain retrieval benchmarks.

Core claim

NV-Embed achieves the No.1 position on the MTEB leaderboard across 56 tasks by incorporating a latent attention layer to obtain pooled embeddings, removing the causal attention mask of LLMs during contrastive training, introducing a two-stage contrastive instruction-tuning method that first focuses on retrieval then blends non-retrieval tasks, and utilizing hard-negative mining along with synthetic data generation from public datasets.

What carries the argument

The latent attention layer that produces pooled embeddings from LLMs, which improves accuracy over mean pooling or last-token embeddings when combined with causal mask removal and two-stage contrastive tuning.

If this is right

  • The latent attention layer consistently raises retrieval and downstream task accuracy compared with mean pooling or last <EOS> token embeddings.
  • Removing the causal attention mask during contrastive training improves representation learning for embedding tasks.
  • Two-stage contrastive instruction-tuning boosts non-retrieval task accuracy while also raising retrieval performance.
  • Curated hard negatives and synthetic data further increase overall embedding quality.
  • The resulting models reach the highest Long Doc scores and second-highest QA scores on the AIR Benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same set of changes could be applied to other decoder-only LLMs to raise their embedding performance without increasing model size.
  • Model compression techniques discussed in the paper may allow these high-performing embeddings to run efficiently on limited hardware.
  • Strong results on out-of-domain benchmarks suggest the approach could support reliable retrieval in real-world settings beyond MTEB tasks.

Load-bearing premise

The reported gains stem primarily from the proposed architectural changes, mask removal, and training stages rather than from larger model scale, extra compute, or dataset selection alone.

What would settle it

A side-by-side retraining of an equivalent LLM using only mean pooling, keeping the causal mask, and single-stage tuning on the same data and compute budget, then checking whether MTEB scores match those of NV-Embed.

read the original abstract

Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position on the MTEB leaderboard (as of May 24 and August 30, 2024, respectively) across 56 tasks, demonstrating the sustained effectiveness of the proposed methods over time. It also achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB. We further provide the analysis of model compression techniques for generalist embedding models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript presents NV-Embed, techniques to train decoder-only LLMs as generalist embedding models. It introduces a latent attention pooling layer, removal of the causal attention mask during contrastive training, a two-stage contrastive instruction-tuning procedure (retrieval-focused stage with hard negatives followed by blending non-retrieval tasks), and curated datasets using hard-negative mining and synthetic data. These yield NV-Embed-v1 and v2 achieving the No.1 position on the MTEB leaderboard across 56 tasks (as of May 24 and August 30, 2024) plus strong AIR benchmark results and model compression analysis.

Significance. If the performance attribution holds, the work is significant for showing how targeted architectural and procedural changes can make LLM-based embeddings outperform prior BERT/T5 approaches on public benchmarks. The techniques are presented as simple and reproducible, with explicit ablation-friendly design choices and practical compression analysis adding deployment value. Credit is given for the reported leaderboard results and consistent improvements on fixed benchmarks.

major comments (1)
  1. [Results section (MTEB and AIR evaluations)] The central claim attributes the #1 MTEB ranking to the combination of latent attention, causal mask removal, two-stage tuning, and curated data. However, the manuscript lacks controlled ablations that fix base LLM scale, total training tokens, and data volume while toggling only the proposed components (e.g., mean pooling + causal mask vs. latent attention + no mask on identical runs). This leaves open whether gains arise primarily from the new techniques or from scale/compute/dataset volume differences, as noted in the stress-test concern.
minor comments (3)
  1. [§3.1 (architecture)] The abstract and methods would benefit from an explicit equation or pseudocode defining the latent attention pooling operation and its integration with the decoder layers.
  2. [Training procedure description] Clarify the exact blending ratios and instruction formats used in stage-2 of the contrastive tuning to improve reproducibility.
  3. [Compression experiments] In the model compression analysis, include quantitative tables showing performance drop vs. compression ratio for each technique evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation of minor revision. We address the concern regarding the lack of fully controlled ablations below, agreeing to clarify experimental limitations and strengthen the discussion of attribution in the revised manuscript.

read point-by-point responses
  1. Referee: The central claim attributes the #1 MTEB ranking to the combination of latent attention, causal mask removal, two-stage tuning, and curated data. However, the manuscript lacks controlled ablations that fix base LLM scale, total training tokens, and data volume while toggling only the proposed components (e.g., mean pooling + causal mask vs. latent attention + no mask on identical runs). This leaves open whether gains arise primarily from the new techniques or from scale/compute/dataset volume differences, as noted in the stress-test concern.

    Authors: We acknowledge the validity of this point. Our Section 4.3 ablations incrementally add each proposed component (latent attention, mask removal, two-stage tuning) to the same base LLM and report consistent gains on MTEB, but these runs do not enforce identical total training tokens or exact data volume across every variant due to compute limits. All models share the same base scale and use overlapping data sources, with hard-negative mining and synthetic data applied uniformly. We will revise the manuscript to explicitly state this limitation, add a dedicated paragraph on potential confounding factors, and include a table summarizing training token counts per ablation run. We maintain that the techniques drive the gains, as they improve over strong same-scale baselines and align with prior embedding literature, but we agree a more controlled comparison would further strengthen the attribution. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks are independently verifiable

full rationale

The manuscript describes architectural changes (latent attention pooling, removal of causal mask), a two-stage contrastive training procedure, and data curation steps, then reports performance numbers on fixed public leaderboards (MTEB across 56 tasks, AIR Benchmark). No equations, uniqueness theorems, or first-principles derivations are presented that reduce to quantities fitted inside the same experiment. All claimed gains are measured against external, unchanging test sets using standard metrics; training details reference public datasets and hard-negative mining without any self-referential loop that would make the reported ranking equivalent to its own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The work rests on standard contrastive-learning assumptions and empirical validation rather than new theoretical derivations; no new physical entities or ungrounded constants are introduced.

free parameters (2)
  • contrastive temperature and batch size
    Typical hyperparameters in contrastive embedding training that are tuned on validation data.
  • hard-negative selection thresholds
    Curated thresholds for mining hard negatives from retrieval datasets.
axioms (2)
  • domain assumption Contrastive objectives with in-batch and hard negatives improve embedding quality
    Standard premise in modern embedding model training literature.
  • domain assumption Removing causal mask during contrastive training is beneficial for bidirectional representations
    Invoked to justify the architectural change for embedding tasks.

pith-pipeline@v0.9.0 · 5648 in / 1344 out tokens · 51567 ms · 2026-05-14T21:10:16.373016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MeMo: Memory as a Model

    cs.CL 2026-05 unverdicted novelty 7.0

    MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...

  2. Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment

    cs.LG 2026-05 unverdicted novelty 7.0

    BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.

  3. SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude ...

  4. Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models

    cs.LG 2026-05 unverdicted novelty 7.0

    A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.

  5. Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...

  6. SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.

  7. TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding

    cs.CL 2026-05 unverdicted novelty 7.0

    TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.

  8. Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems

    cs.CL 2026-05 unverdicted novelty 7.0

    BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.

  9. mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.

  10. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  11. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    cs.SD 2025-07 unverdicted novelty 7.0

    Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.

  12. Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

  13. Aspect-Aware Content-Based Recommendations for Mathematical Research Papers

    cs.IR 2026-05 unverdicted novelty 6.0

    The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage acr...

  14. Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

    cs.CL 2026-05 unverdicted novelty 6.0

    Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

  15. Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding

    cs.CL 2026-04 unverdicted novelty 6.0

    TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.

  16. Exploring Audio Hallucination in Egocentric Video Understanding

    cs.CV 2026-04 unverdicted novelty 6.0

    AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.

  17. ViLL-E: Video LLM Embeddings for Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.

  18. Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service

    cs.CR 2026-04 unverdicted novelty 6.0

    GeoMark decouples local watermark triggering from centralized ownership attribution using geometry-separated anchors and adaptive neighborhoods to improve robustness against paraphrasing, dimension changes, and cluste...

  19. Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders

    cs.IR 2026-04 unverdicted novelty 6.0

    New CMedTEB benchmark and CARE asymmetric retriever outperform symmetric models on Chinese medical retrieval tasks while preserving low latency.

  20. Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA

    cs.IR 2026-04 conditional novelty 6.0

    Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.

  21. BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering

    cs.IR 2026-04 conditional novelty 6.0

    BridgeRAG improves training-free multi-hop retrieval by using a bridge-conditioned LLM scorer to rank evidence chains, achieving new best R@5 scores on MuSiQue, 2WikiMultiHopQA, and HotpotQA.

  22. DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining

    cs.CL 2026-04 unverdicted novelty 5.0

    DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.

  23. AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce

    cs.CL 2026-04 unverdicted novelty 5.0

    AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.

  24. Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

    cs.CL 2026-01 unverdicted novelty 4.0

    Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.

  25. Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    cs.CL 2025-06 unverdicted novelty 4.0

    Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning,...

Reference graph

Works this paper leans on

121 extracted references · 121 canonical work pages · cited by 24 Pith papers · 16 internal anchors

  1. [1]

    Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain

    C.J. Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain. Jigsaw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification

  2. [2]

    S em E val-2012 task 6: A pilot on semantic textual similarity

    Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. S em E val-2012 task 6: A pilot on semantic textual similarity. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret (eds.), * SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the ...

  3. [6]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020

  4. [7]

    Efficient intent detection with dual sentence encoders

    I \ n igo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vulic. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. URL https://arxiv.org/abs/2003.04807. Data available at https://github.com/PolyAI-LDN/task-specific-datasets

  5. [9]

    Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023

  6. [13]

    Sparsegpt: Massive language models can be accurately pruned in one-shot

    Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp.\ 10323--10337. PMLR, 2023

  7. [18]

    The stanford natural language inference (snli) corpus, 2022

    Stanford NLP Group et al. The stanford natural language inference (snli) corpus, 2022

  8. [19]

    Retrieval augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pp.\ 3929--3938. PMLR, 2020

  9. [26]

    Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement

    Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy yong Sohn, and Chanyeol Choi. Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement. linq ai research blog, 2024. URL https://getlinq.com/blog/linq-embed-mistral/

  10. [27]

    Natural questions: a benchmark for question answering research

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019

  11. [28]

    Newsweeder: Learning to filter netnews

    Ken Lang. Newsweeder: Learning to filter netnews. In Machine learning proceedings 1995, pp.\ 331--339. Elsevier, 1995

  12. [30]

    Open source strikes bread - new fluffy embeddings model, 2024 b

    Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024 b . URL https://www.mixedbread.ai/blog/mxbai-embed-large-v1

  13. [31]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 0 9459--9474, 2020

  14. [32]

    Paq: 65 million probably-asked questions and what you can do with them

    Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich K \"u ttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9: 0 1098--1115, 2021

  15. [33]

    Datasets: A community library for natural language processing

    Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario S a s ko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugg...

  16. [38]

    Chat QA : Surpassing GPT -4 on conversational QA and RAG

    Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Mohammad Shoeybi, and Bryan Catanzaro. Chat QA : Surpassing GPT -4 on conversational QA and RAG . arXiv preprint arXiv:2401.10225, 2024

  17. [39]

    Maas, Raymond E

    Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...

  18. [40]

    Tweet sentiment extraction, 2020

    Wei Chen Maggie, Phil Culliton. Tweet sentiment extraction, 2020. URL https://kaggle.com/competitions/tweet-sentiment-extraction

  19. [41]

    Www'18 open challenge: financial opinion mining and question answering

    Macedo Maia, Siegfried Handschuh, Andr \'e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www'18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pp.\ 1941--1942, 2018

  20. [43]

    Hidden factors and hidden topics: understanding rating dimensions with review text

    Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pp.\ 165--172, 2013 b

  21. [44]

    Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024 a

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024 a . URL https://huggingface.co/Salesforce/SFR-Embedding-2_R

  22. [45]

    Sfrembedding-mistral: enhance text retrieval with transfer learning

    Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3, 2024 b

  23. [46]

    Distributed representations of words and phrases and their compositionality

    Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013

  24. [47]

    Mixtral 8x22b

    MistralAI. Mixtral 8x22b. URL https://mistral.ai/news/mixtral-8x22b/

  25. [48]

    NV-Retriever: Improving text embedding models with effective hard-negative mining

    Gabriel de Souza P Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. NV-Retriever: Improving text embedding models with effective hard-negative mining . arXiv preprint arXiv:2407.15831, 2024

  26. [52]

    MS MARCO : A human-generated machine reading comprehension dataset

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO : A human-generated machine reading comprehension dataset. 2016

  27. [55]

    New embedding models and api updates, 2024

    OpenAI. New embedding models and api updates, 2024

  28. [56]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 2022

  29. [57]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  30. [59]

    Stackexchange (title, body) pairs, 2021 a

    Nils Reimers. Stackexchange (title, body) pairs, 2021 a . URL https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_body_jsonl

  31. [60]

    Reddit (title, body) pairs, 2021 b

    Nils Reimers. Reddit (title, body) pairs, 2021 b . URL https://huggingface.co/datasets/sentence-transformers/reddit-title-body

  32. [62]

    The probabilistic relevance framework: Bm25 and beyond

    Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval , 3 0 (4): 0 333--389, 2009

  33. [65]

    Stack exchange data dump, 2023

    Stack-Exchange-Community. Stack exchange data dump, 2023

  34. [69]

    An overview of the bioasq large-scale biomedical semantic indexing and question answering competition

    George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16: 0 1--28, 2015

  35. [70]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

  36. [71]

    voyage-large-2-instruct: Instruction-tuned and rank 1 on mteb, 2024

    Voyage-AI. voyage-large-2-instruct: Instruction-tuned and rank 1 on mteb, 2024

  37. [72]

    Retrieval of the best counterargument without prior topic knowledge

    Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 241--251, 2018

  38. [74]

    Superglue: A stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019

  39. [78]

    Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis

    Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning, pp.\ 5180--5189. PMLR, 2018

  40. [82]

    Miracl: A multilingual retrieval dataset covering 18 diverse languages

    Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11: 0 1114--1131, 2023

  41. [83]

    Stack Exchange Data Dump , author=

  42. [84]

    Linq AI Research Blog , author=

    Linq-Embed-Mistral: Elevating Text Retrieval with Improved GPT Data Through Task-Specific Control and Quality Refinement. Linq AI Research Blog , author=. 2024 , url=

  43. [85]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=

  44. [86]

    A simple and effective pruning approach for large language models.arXiv preprint arXiv:2306.11695, 2023

    A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=

  45. [87]

    arXiv preprint arXiv:2104.07081 , year=

    TWEAC: transformer with extendable QA agent classifiers , author=. arXiv preprint arXiv:2104.07081 , year=

  46. [88]

    International Conference on Machine Learning , pages=

    Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  47. [89]

    arXiv preprint arXiv:2409.15700 , year=

    Making Text Embedders Few-Shot Learners , author=. arXiv preprint arXiv:2409.15700 , year=

  48. [90]

    arXiv preprint arXiv:2310.01208 , year=

    Label supervised llama finetuning , author=. arXiv preprint arXiv:2310.01208 , year=

  49. [91]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  50. [92]

    Hover: A dataset for many-hop fact extraction and claim verification, 2020

    HoVer: A dataset for many-hop fact extraction and claim verification , author=. arXiv preprint arXiv:2011.03088 , year=

  51. [93]

    TyDi: A multi-lingual benchmark for dense retrieval , author=

    Mr. TyDi: A multi-lingual benchmark for dense retrieval , author=. arXiv preprint arXiv:2108.08787 , year=

  52. [94]

    arXiv preprint arXiv:2210.13777 , year=

    SciFact-open: Towards open-domain scientific claim verification , author=. arXiv preprint arXiv:2210.13777 , year=

  53. [95]

    Transactions of the Association for Computational Linguistics , volume=

    Miracl: A multilingual retrieval dataset covering 18 diverse languages , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=

  54. [96]

    Moreira, Gabriel de Souza P and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even , journal=

  55. [97]

    Advances in neural information processing systems , volume=

    Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=

  56. [98]

    FEVER: a large-scale dataset for Fact Extraction and VERification

    FEVER: a large-scale dataset for fact extraction and VERification , author=. arXiv preprint arXiv:1803.05355 , year=

  57. [99]

    arXiv preprint arXiv:2401.00368

    Improving text embeddings with large language models , author=. arXiv preprint arXiv:2401.00368 , year=

  58. [100]

    arXiv preprint arXiv:2403.20327 , year=

    Gecko: Versatile text embeddings distilled from large language models , author=. arXiv preprint arXiv:2403.20327 , year=

  59. [101]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=

  60. [102]

    Salesforce AI Research Blog , volume=

    Sfrembedding-mistral: enhance text retrieval with transfer learning , author=. Salesforce AI Research Blog , volume=

  61. [103]

    voyage-large-2-instruct: Instruction-tuned and rank 1 on MTEB , author=

  62. [104]

    Generative representational instruction tuning.arXiv preprint arXiv:2402.09906, 2024

    Generative representational instruction tuning , author=. arXiv preprint arXiv:2402.09906 , year=

  63. [105]

    arXiv preprint arXiv:2201.10005 , year=

    Text and code embeddings by contrastive pre-training , author=. arXiv preprint arXiv:2201.10005 , year=

  64. [106]

    Advances in neural information processing systems , volume=

    Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

  65. [107]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=

  66. [108]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  67. [109]

    2024 , url=

    SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training , author=. 2024 , url=

  68. [110]

    Unsupervised Dense Information Retrieval with Contrastive Learning

    Unsupervised dense information retrieval with contrastive learning , author=. arXiv preprint arXiv:2112.09118 , year=

  69. [111]

    Towards General Text Embeddings with Multi-stage Contrastive Learning

    Towards general text embeddings with multi-stage contrastive learning , author=. arXiv preprint arXiv:2308.03281 , year=

  70. [112]

    arXiv preprint arXiv:2112.07899 , year=

    Large dual encoders are generalizable retrievers , author=. arXiv preprint arXiv:2112.07899 , year=

  71. [113]

    2023 , eprint=

    BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2023 , eprint=

  72. [114]

    Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li , year=

  73. [115]

    MTEB: Massive Text Embedding Benchmark

    Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. arXiv preprint arXiv:2210.07316 , year=

  74. [116]

    Mistral 7B

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=

  75. [117]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=

  76. [118]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Sentence-bert: Sentence embeddings using siamese bert-networks , author=. arXiv preprint arXiv:1908.10084 , year=

  77. [119]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings

    Simcse: Simple contrastive learning of sentence embeddings , author=. arXiv preprint arXiv:2104.08821 , year=

  78. [120]

    Advances in neural information processing systems , year=

    Distributed representations of words and phrases and their compositionality , author=. Advances in neural information processing systems , year=

  79. [121]

    Advances in Neural Information Processing Systems , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=

  80. [122]

    Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Shoeybi, Mohammad and Catanzaro, Bryan , journal=. Chat

Showing first 80 references.