Recognition: 1 theorem link
· Lean TheoremNV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Pith reviewed 2026-05-14 21:10 UTC · model grok-4.3
The pith
Decoder-only LLMs outperform BERT and T5 embedding models on general tasks by using a latent attention layer, removing causal masks, and applying two-stage contrastive instruction tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
NV-Embed achieves the No.1 position on the MTEB leaderboard across 56 tasks by incorporating a latent attention layer to obtain pooled embeddings, removing the causal attention mask of LLMs during contrastive training, introducing a two-stage contrastive instruction-tuning method that first focuses on retrieval then blends non-retrieval tasks, and utilizing hard-negative mining along with synthetic data generation from public datasets.
What carries the argument
The latent attention layer that produces pooled embeddings from LLMs, which improves accuracy over mean pooling or last-token embeddings when combined with causal mask removal and two-stage contrastive tuning.
If this is right
- The latent attention layer consistently raises retrieval and downstream task accuracy compared with mean pooling or last <EOS> token embeddings.
- Removing the causal attention mask during contrastive training improves representation learning for embedding tasks.
- Two-stage contrastive instruction-tuning boosts non-retrieval task accuracy while also raising retrieval performance.
- Curated hard negatives and synthetic data further increase overall embedding quality.
- The resulting models reach the highest Long Doc scores and second-highest QA scores on the AIR Benchmark.
Where Pith is reading between the lines
- The same set of changes could be applied to other decoder-only LLMs to raise their embedding performance without increasing model size.
- Model compression techniques discussed in the paper may allow these high-performing embeddings to run efficiently on limited hardware.
- Strong results on out-of-domain benchmarks suggest the approach could support reliable retrieval in real-world settings beyond MTEB tasks.
Load-bearing premise
The reported gains stem primarily from the proposed architectural changes, mask removal, and training stages rather than from larger model scale, extra compute, or dataset selection alone.
What would settle it
A side-by-side retraining of an equivalent LLM using only mean pooling, keeping the causal mask, and single-stage tuning on the same data and compute budget, then checking whether MTEB scores match those of NV-Embed.
read the original abstract
Decoder-only LLM-based embedding models are beginning to outperform BERT or T5-based embedding models in general-purpose text embedding tasks, including dense vector-based retrieval. In this work, we introduce NV-Embed, incorporating architectural designs, training procedures, and curated datasets to significantly enhance the performance of LLM as a versatile embedding model, while maintaining its simplicity and reproducibility. For model architecture, we propose a latent attention layer to obtain pooled embeddings, which consistently improves retrieval and downstream task accuracy compared to mean pooling or using the last <EOS> token embedding from LLMs. To enhance representation learning, we remove the causal attention mask of LLMs during contrastive training. For training algorithm, we introduce a two-stage contrastive instruction-tuning method. It first applies contrastive training with instructions on retrieval datasets, utilizing in-batch negatives and curated hard negative examples. At stage-2, it blends various non-retrieval into instruction tuning, which not only enhances non-retrieval task accuracy but also improves retrieval performance. For training data, we utilize the hard-negative mining, synthetic data generation and existing public available datasets to boost the performance of embedding model. By combining these techniques, our NV-Embed-v1 and NV-Embed-v2 models obtained the No.1 position on the MTEB leaderboard (as of May 24 and August 30, 2024, respectively) across 56 tasks, demonstrating the sustained effectiveness of the proposed methods over time. It also achieved the highest scores in the Long Doc section and the second-highest scores in the QA section of the AIR Benchmark, which covers a range of out-of-domain information retrieval topics beyond those in MTEB. We further provide the analysis of model compression techniques for generalist embedding models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents NV-Embed, techniques to train decoder-only LLMs as generalist embedding models. It introduces a latent attention pooling layer, removal of the causal attention mask during contrastive training, a two-stage contrastive instruction-tuning procedure (retrieval-focused stage with hard negatives followed by blending non-retrieval tasks), and curated datasets using hard-negative mining and synthetic data. These yield NV-Embed-v1 and v2 achieving the No.1 position on the MTEB leaderboard across 56 tasks (as of May 24 and August 30, 2024) plus strong AIR benchmark results and model compression analysis.
Significance. If the performance attribution holds, the work is significant for showing how targeted architectural and procedural changes can make LLM-based embeddings outperform prior BERT/T5 approaches on public benchmarks. The techniques are presented as simple and reproducible, with explicit ablation-friendly design choices and practical compression analysis adding deployment value. Credit is given for the reported leaderboard results and consistent improvements on fixed benchmarks.
major comments (1)
- [Results section (MTEB and AIR evaluations)] The central claim attributes the #1 MTEB ranking to the combination of latent attention, causal mask removal, two-stage tuning, and curated data. However, the manuscript lacks controlled ablations that fix base LLM scale, total training tokens, and data volume while toggling only the proposed components (e.g., mean pooling + causal mask vs. latent attention + no mask on identical runs). This leaves open whether gains arise primarily from the new techniques or from scale/compute/dataset volume differences, as noted in the stress-test concern.
minor comments (3)
- [§3.1 (architecture)] The abstract and methods would benefit from an explicit equation or pseudocode defining the latent attention pooling operation and its integration with the decoder layers.
- [Training procedure description] Clarify the exact blending ratios and instruction formats used in stage-2 of the contrastive tuning to improve reproducibility.
- [Compression experiments] In the model compression analysis, include quantitative tables showing performance drop vs. compression ratio for each technique evaluated.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation of minor revision. We address the concern regarding the lack of fully controlled ablations below, agreeing to clarify experimental limitations and strengthen the discussion of attribution in the revised manuscript.
read point-by-point responses
-
Referee: The central claim attributes the #1 MTEB ranking to the combination of latent attention, causal mask removal, two-stage tuning, and curated data. However, the manuscript lacks controlled ablations that fix base LLM scale, total training tokens, and data volume while toggling only the proposed components (e.g., mean pooling + causal mask vs. latent attention + no mask on identical runs). This leaves open whether gains arise primarily from the new techniques or from scale/compute/dataset volume differences, as noted in the stress-test concern.
Authors: We acknowledge the validity of this point. Our Section 4.3 ablations incrementally add each proposed component (latent attention, mask removal, two-stage tuning) to the same base LLM and report consistent gains on MTEB, but these runs do not enforce identical total training tokens or exact data volume across every variant due to compute limits. All models share the same base scale and use overlapping data sources, with hard-negative mining and synthetic data applied uniformly. We will revise the manuscript to explicitly state this limitation, add a dedicated paragraph on potential confounding factors, and include a table summarizing training token counts per ablation run. We maintain that the techniques drive the gains, as they improve over strong same-scale baselines and align with prior embedding literature, but we agree a more controlled comparison would further strengthen the attribution. revision: partial
Circularity Check
No circularity: empirical results on external benchmarks are independently verifiable
full rationale
The manuscript describes architectural changes (latent attention pooling, removal of causal mask), a two-stage contrastive training procedure, and data curation steps, then reports performance numbers on fixed public leaderboards (MTEB across 56 tasks, AIR Benchmark). No equations, uniqueness theorems, or first-principles derivations are presented that reduce to quantities fitted inside the same experiment. All claimed gains are measured against external, unchanging test sets using standard metrics; training details reference public datasets and hard-negative mining without any self-referential loop that would make the reported ranking equivalent to its own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (2)
- contrastive temperature and batch size
- hard-negative selection thresholds
axioms (2)
- domain assumption Contrastive objectives with in-batch and hard negatives improve embedding quality
- domain assumption Removing causal mask during contrastive training is beneficial for bidirectional representations
Forward citations
Cited by 25 Pith papers
-
MeMo: Memory as a Model
MeMo encodes new knowledge into a separate memory model for frozen LLMs, achieving strong performance on BrowseComp-Plus, NarrativeQA, and MuSiQue while capturing cross-document relationships and remaining robust to r...
-
Beyond Binary: Reframing GUI Critique as Continuous Semantic Alignment
BBCritic uses contrastive learning to align GUI actions in a continuous affordance space, outperforming larger binary critic models on a new four-level hierarchical benchmark while enabling zero-shot transfer.
-
SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning
SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude ...
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
A softmax-weighted centroid of the local top-K documents interpolated with the query improves nDCG@10 for frozen embedding models across seven families on held-out BEIR data.
-
Test-Time Compute for Dense Retrieval: Agentic Program Generation with Frozen Embedding Models
Agentic program search over frozen embedding APIs yields a parameter-free inference algebra—a softmax-weighted centroid of top-K documents interpolated with the query—that lifts nDCG@10 across seven model families on ...
-
SkillRet: A Large-Scale Benchmark for Skill Retrieval in LLM Agents
SkillRet benchmark shows fine-tuned retrievers improve NDCG@10 by 13+ points over prior models on large-scale skill retrieval for LLM agents.
-
TabEmbed: Benchmarking and Learning Generalist Embeddings for Tabular Understanding
TabEmbed is the first generalist embedding model for tabular data that unifies classification and retrieval in one space via contrastive learning and outperforms text embedding models on the new TabBench benchmark.
-
Rethinking Reasoning-Intensive Retrieval: Evaluating and Advancing Retrievers in Agentic Search Systems
BRIGHT-Pro and RTriever-Synth advance reasoning-intensive retrieval by adding multi-aspect evidence evaluation and aspect-decomposed synthetic training, with the fine-tuned RTriever-4B showing gains over its base model.
-
mEOL: Training-Free Instruction-Guided Multimodal Embedder for Vector Graphics and Image Retrieval
mEOL creates aligned embeddings for text, images, and SVGs using instruction-guided MLLM one-word summaries and semantic SVG rewriting, outperforming baselines on a new text-to-SVG retrieval benchmark.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
-
Aspect-Aware Content-Based Recommendations for Mathematical Research Papers
The authors introduce aspect-aware datasets GoldRiM and SilverRiM for math papers and AchGNN, a heterogeneous GNN that outperforms prior methods by jointly modeling textual semantics, citations, and author lineage acr...
-
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
-
Reliable Answers for Recurring Questions: Boosting Text-to-SQL Accuracy with Template Constrained Decoding
TeCoD improves Text-to-SQL execution accuracy by up to 36% over in-context learning and cuts latency 2.2x on matched queries by extracting templates from historical pairs and enforcing them with constrained decoding.
-
Exploring Audio Hallucination in Egocentric Video Understanding
AV-LLMs hallucinate audio from visuals in egocentric videos, scoring only 27.3% accuracy on foreground sounds and 39.5% on background sounds in a 1000-question evaluation.
-
ViLL-E: Video LLM Embeddings for Retrieval
ViLL-E introduces a dynamic embedding mechanism and joint contrastive-generative training for VideoLLMs, delivering up to 7% gains in temporal localization and 4% in video retrieval while enabling new zero-shot capabilities.
-
Geometry-Aware Localized Watermarking for Copyright Protection in Embedding-as-a-Service
GeoMark decouples local watermark triggering from centralized ownership attribution using geometry-separated anchors and adaptive neighborhoods to improve robustness against paraphrasing, dimension changes, and cluste...
-
Benchmarking and Enabling Efficient Chinese Medical Retrieval via Asymmetric Encoders
New CMedTEB benchmark and CARE asymmetric retriever outperform symmetric models on Chinese medical retrieval tasks while preserving low latency.
-
Regime-Conditional Retrieval: Theory and a Transferable Router for Two-Hop QA
Two-hop QA retrieval performance depends on whether the hop-2 entity is in the question or bridge passage, and a simple predicate-based router trained on one dataset transfers to improve R@5 on others.
-
BridgeRAG: Training-Free Bridge-Conditioned Retrieval for Multi-Hop Question Answering
BridgeRAG improves training-free multi-hop retrieval by using a bridge-conditioned LLM scorer to rank evidence chains, achieving new best R@5 scores on MuSiQue, 2WikiMultiHopQA, and HotpotQA.
-
DeepImagine: Learning Biomedical Reasoning via Successive Counterfactual Imagining
DeepImagine trains LLMs on counterfactual pairs from clinical trials using supervised fine-tuning and reinforcement learning to improve outcome prediction by approximating causal mechanisms.
-
AFMRL: Attribute-Enhanced Fine-Grained Multi-Modal Representation Learning in E-commerce
AFMRL uses MLLM-generated attributes in attribute-guided contrastive learning and retrieval-aware reinforcement to achieve SOTA fine-grained multimodal retrieval on e-commerce datasets.
-
Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking
Qwen3-VL-Embedding-8B achieves state-of-the-art performance with a 77.8 overall score on the MMEB-V2 multimodal embedding benchmark.
-
Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models
Qwen3 Embedding models in 0.6B-8B sizes achieve state-of-the-art results on MTEB and retrieval tasks including code, cross-lingual, and multilingual retrieval through unsupervised pre-training, supervised fine-tuning,...
Reference graph
Works this paper leans on
-
[1]
Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain
C.J. Adams, Daniel Borkan, Jeffrey Sorensen, Lucas Dixon, Lucy Vasserman, and Nithum Thain. Jigsaw unintended bias in toxicity classification, 2019. URL https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification
work page 2019
-
[2]
S em E val-2012 task 6: A pilot on semantic textual similarity
Eneko Agirre, Daniel Cer, Mona Diab, and Aitor Gonzalez-Agirre. S em E val-2012 task 6: A pilot on semantic textual similarity. In Eneko Agirre, Johan Bos, Mona Diab, Suresh Manandhar, Yuval Marton, and Deniz Yuret (eds.), * SEM 2012: The First Joint Conference on Lexical and Computational Semantics -- Volume 1: Proceedings of the main conference and the ...
work page 2012
-
[6]
Language models are few-shot learners
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 0 1877--1901, 2020
work page 1901
-
[7]
Efficient intent detection with dual sentence encoders
I \ n igo Casanueva, Tadas Temcinas, Daniela Gerz, Matthew Henderson, and Ivan Vulic. Efficient intent detection with dual sentence encoders. In Proceedings of the 2nd Workshop on NLP for ConvAI - ACL 2020, mar 2020. URL https://arxiv.org/abs/2003.04807. Data available at https://github.com/PolyAI-LDN/task-specific-datasets
-
[9]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. Bge m3-embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation, 2023
work page 2023
-
[13]
Sparsegpt: Massive language models can be accurately pruned in one-shot
Elias Frantar and Dan Alistarh. Sparsegpt: Massive language models can be accurately pruned in one-shot. In International Conference on Machine Learning, pp.\ 10323--10337. PMLR, 2023
work page 2023
-
[18]
The stanford natural language inference (snli) corpus, 2022
Stanford NLP Group et al. The stanford natural language inference (snli) corpus, 2022
work page 2022
-
[19]
Retrieval augmented language model pre-training
Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Mingwei Chang. Retrieval augmented language model pre-training. In International conference on machine learning, pp.\ 3929--3938. PMLR, 2020
work page 2020
-
[26]
Junseong Kim, Seolhwa Lee, Jihoon Kwon, Sangmo Gu, Yejin Kim, Minkyung Cho, Jy yong Sohn, and Chanyeol Choi. Linq-embed-mistral: Elevating text retrieval with improved gpt data through task-specific control and quality refinement. linq ai research blog, 2024. URL https://getlinq.com/blog/linq-embed-mistral/
work page 2024
-
[27]
Natural questions: a benchmark for question answering research
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7: 0 453--466, 2019
work page 2019
-
[28]
Newsweeder: Learning to filter netnews
Ken Lang. Newsweeder: Learning to filter netnews. In Machine learning proceedings 1995, pp.\ 331--339. Elsevier, 1995
work page 1995
-
[30]
Open source strikes bread - new fluffy embeddings model, 2024 b
Sean Lee, Aamir Shakir, Darius Koenig, and Julius Lipp. Open source strikes bread - new fluffy embeddings model, 2024 b . URL https://www.mixedbread.ai/blog/mxbai-embed-large-v1
work page 2024
-
[31]
u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33: 0 9459--9474, 2020
work page 2020
-
[32]
Paq: 65 million probably-asked questions and what you can do with them
Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich K \"u ttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you can do with them. Transactions of the Association for Computational Linguistics, 9: 0 1098--1115, 2021
work page 2021
-
[33]
Datasets: A community library for natural language processing
Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario S a s ko, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugg...
work page 2021
-
[38]
Chat QA : Surpassing GPT -4 on conversational QA and RAG
Zihan Liu, Wei Ping, Rajarshi Roy, Peng Xu, Mohammad Shoeybi, and Bryan Catanzaro. Chat QA : Surpassing GPT -4 on conversational QA and RAG . arXiv preprint arXiv:2401.10225, 2024
-
[39]
Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. Learning word vectors for sentiment analysis. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 142--150, Portland, Oregon, USA, June 2011. Association for Computational Linguistics. UR...
work page 2011
-
[40]
Tweet sentiment extraction, 2020
Wei Chen Maggie, Phil Culliton. Tweet sentiment extraction, 2020. URL https://kaggle.com/competitions/tweet-sentiment-extraction
work page 2020
-
[41]
Www'18 open challenge: financial opinion mining and question answering
Macedo Maia, Siegfried Handschuh, Andr \'e Freitas, Brian Davis, Ross McDermott, Manel Zarrouk, and Alexandra Balahur. Www'18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pp.\ 1941--1942, 2018
work page 2018
-
[43]
Hidden factors and hidden topics: understanding rating dimensions with review text
Julian McAuley and Jure Leskovec. Hidden factors and hidden topics: understanding rating dimensions with review text. In Proceedings of the 7th ACM conference on Recommender systems, pp.\ 165--172, 2013 b
work page 2013
-
[44]
Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024 a
Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfr-embedding-2: Advanced text embedding with multi-stage training, 2024 a . URL https://huggingface.co/Salesforce/SFR-Embedding-2_R
work page 2024
-
[45]
Sfrembedding-mistral: enhance text retrieval with transfer learning
Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3, 2024 b
work page 2024
-
[46]
Distributed representations of words and phrases and their compositionality
Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems, 2013
work page 2013
- [47]
-
[48]
NV-Retriever: Improving text embedding models with effective hard-negative mining
Gabriel de Souza P Moreira, Radek Osmulski, Mengyao Xu, Ronay Ak, Benedikt Schifferer, and Even Oldridge. NV-Retriever: Improving text embedding models with effective hard-negative mining . arXiv preprint arXiv:2407.15831, 2024
-
[52]
MS MARCO : A human-generated machine reading comprehension dataset
Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO : A human-generated machine reading comprehension dataset. 2016
work page 2016
-
[55]
New embedding models and api updates, 2024
OpenAI. New embedding models and api updates, 2024
work page 2024
-
[56]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 2022
work page 2022
-
[57]
Exploring the limits of transfer learning with a unified text-to-text transformer
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020
work page 2020
-
[59]
Stackexchange (title, body) pairs, 2021 a
Nils Reimers. Stackexchange (title, body) pairs, 2021 a . URL https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_title_body_jsonl
work page 2021
-
[60]
Reddit (title, body) pairs, 2021 b
Nils Reimers. Reddit (title, body) pairs, 2021 b . URL https://huggingface.co/datasets/sentence-transformers/reddit-title-body
work page 2021
-
[62]
The probabilistic relevance framework: Bm25 and beyond
Stephen Robertson, Hugo Zaragoza, et al. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends in Information Retrieval , 3 0 (4): 0 333--389, 2009
work page 2009
-
[65]
Stack exchange data dump, 2023
Stack-Exchange-Community. Stack exchange data dump, 2023
work page 2023
-
[69]
George Tsatsaronis, Georgios Balikas, Prodromos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R Alvers, Dirk Weissenborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, et al. An overview of the bioasq large-scale biomedical semantic indexing and question answering competition. BMC bioinformatics, 16: 0 1--28, 2015
work page 2015
-
[70]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, ukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017
work page 2017
-
[71]
voyage-large-2-instruct: Instruction-tuned and rank 1 on mteb, 2024
Voyage-AI. voyage-large-2-instruct: Instruction-tuned and rank 1 on mteb, 2024
work page 2024
-
[72]
Retrieval of the best counterargument without prior topic knowledge
Henning Wachsmuth, Shahbaz Syed, and Benno Stein. Retrieval of the best counterargument without prior topic knowledge. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 241--251, 2018
work page 2018
-
[74]
Superglue: A stickier benchmark for general-purpose language understanding systems
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019
work page 2019
-
[78]
Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis
Yuxuan Wang, Daisy Stanton, Yu Zhang, RJ-Skerry Ryan, Eric Battenberg, Joel Shor, Ying Xiao, Ye Jia, Fei Ren, and Rif A Saurous. Style tokens: Unsupervised style modeling, control and transfer in end-to-end speech synthesis. In International conference on machine learning, pp.\ 5180--5189. PMLR, 2018
work page 2018
-
[82]
Miracl: A multilingual retrieval dataset covering 18 diverse languages
Xinyu Zhang, Nandan Thakur, Odunayo Ogundepo, Ehsan Kamalloo, David Alfonso-Hermelo, Xiaoguang Li, Qun Liu, Mehdi Rezagholizadeh, and Jimmy Lin. Miracl: A multilingual retrieval dataset covering 18 diverse languages. Transactions of the Association for Computational Linguistics, 11: 0 1114--1131, 2023
work page 2023
-
[83]
Stack Exchange Data Dump , author=
-
[84]
Linq AI Research Blog , author=
Linq-Embed-Mistral: Elevating Text Retrieval with Improved GPT Data Through Task-Specific Control and Quality Refinement. Linq AI Research Blog , author=. 2024 , url=
work page 2024
-
[85]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Gptq: Accurate post-training quantization for generative pre-trained transformers , author=. arXiv preprint arXiv:2210.17323 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[86]
A simple and effective pruning approach for large language models , author=. arXiv preprint arXiv:2306.11695 , year=
-
[87]
arXiv preprint arXiv:2104.07081 , year=
TWEAC: transformer with extendable QA agent classifiers , author=. arXiv preprint arXiv:2104.07081 , year=
-
[88]
International Conference on Machine Learning , pages=
Sparsegpt: Massive language models can be accurately pruned in one-shot , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[89]
arXiv preprint arXiv:2409.15700 , year=
Making Text Embedders Few-Shot Learners , author=. arXiv preprint arXiv:2409.15700 , year=
-
[90]
arXiv preprint arXiv:2310.01208 , year=
Label supervised llama finetuning , author=. arXiv preprint arXiv:2310.01208 , year=
-
[91]
Advances in neural information processing systems , volume=
Attention is all you need , author=. Advances in neural information processing systems , volume=
-
[92]
Hover: A dataset for many-hop fact extraction and claim verification, 2020
HoVer: A dataset for many-hop fact extraction and claim verification , author=. arXiv preprint arXiv:2011.03088 , year=
-
[93]
TyDi: A multi-lingual benchmark for dense retrieval , author=
Mr. TyDi: A multi-lingual benchmark for dense retrieval , author=. arXiv preprint arXiv:2108.08787 , year=
-
[94]
arXiv preprint arXiv:2210.13777 , year=
SciFact-open: Towards open-domain scientific claim verification , author=. arXiv preprint arXiv:2210.13777 , year=
-
[95]
Transactions of the Association for Computational Linguistics , volume=
Miracl: A multilingual retrieval dataset covering 18 diverse languages , author=. Transactions of the Association for Computational Linguistics , volume=. 2023 , publisher=
work page 2023
-
[96]
Moreira, Gabriel de Souza P and Osmulski, Radek and Xu, Mengyao and Ak, Ronay and Schifferer, Benedikt and Oldridge, Even , journal=
-
[97]
Advances in neural information processing systems , volume=
Superglue: A stickier benchmark for general-purpose language understanding systems , author=. Advances in neural information processing systems , volume=
-
[98]
FEVER: a large-scale dataset for Fact Extraction and VERification
FEVER: a large-scale dataset for fact extraction and VERification , author=. arXiv preprint arXiv:1803.05355 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[99]
arXiv preprint arXiv:2401.00368
Improving text embeddings with large language models , author=. arXiv preprint arXiv:2401.00368 , year=
-
[100]
arXiv preprint arXiv:2403.20327 , year=
Gecko: Versatile text embeddings distilled from large language models , author=. arXiv preprint arXiv:2403.20327 , year=
-
[101]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Text embeddings by weakly-supervised contrastive pre-training , author=. arXiv preprint arXiv:2212.03533 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[102]
Salesforce AI Research Blog , volume=
Sfrembedding-mistral: enhance text retrieval with transfer learning , author=. Salesforce AI Research Blog , volume=
-
[103]
voyage-large-2-instruct: Instruction-tuned and rank 1 on MTEB , author=
-
[104]
Generative representational instruction tuning.arXiv preprint arXiv:2402.09906, 2024
Generative representational instruction tuning , author=. arXiv preprint arXiv:2402.09906 , year=
-
[105]
arXiv preprint arXiv:2201.10005 , year=
Text and code embeddings by contrastive pre-training , author=. arXiv preprint arXiv:2201.10005 , year=
-
[106]
Advances in neural information processing systems , volume=
Language models are few-shot learners , author=. Advances in neural information processing systems , volume=
-
[107]
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv preprint arXiv:1810.04805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[108]
Journal of machine learning research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
-
[109]
SFR-Embedding-2: Advanced Text Embedding with Multi-stage Training , author=. 2024 , url=
work page 2024
-
[110]
Unsupervised Dense Information Retrieval with Contrastive Learning
Unsupervised dense information retrieval with contrastive learning , author=. arXiv preprint arXiv:2112.09118 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[111]
Towards General Text Embeddings with Multi-stage Contrastive Learning
Towards general text embeddings with multi-stage contrastive learning , author=. arXiv preprint arXiv:2308.03281 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[112]
arXiv preprint arXiv:2112.07899 , year=
Large dual encoders are generalizable retrievers , author=. arXiv preprint arXiv:2112.07899 , year=
-
[113]
BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation , author=. 2023 , eprint=
work page 2023
-
[114]
Nguyen, Tri and Rosenberg, Mir and Song, Xia and Gao, Jianfeng and Tiwary, Saurabh and Majumder, Rangan and Deng, Li , year=
-
[115]
MTEB: Massive Text Embedding Benchmark
Muennighoff, Niklas and Tazi, Nouamane and Magne, Lo. arXiv preprint arXiv:2210.07316 , year=
work page internal anchor Pith review arXiv
-
[116]
Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[117]
Gemini: A Family of Highly Capable Multimodal Models
Gemini: a family of highly capable multimodal models , author=. arXiv preprint arXiv:2312.11805 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[118]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Sentence-bert: Sentence embeddings using siamese bert-networks , author=. arXiv preprint arXiv:1908.10084 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1908
-
[119]
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Simcse: Simple contrastive learning of sentence embeddings , author=. arXiv preprint arXiv:2104.08821 , year=
work page internal anchor Pith review arXiv
-
[120]
Advances in neural information processing systems , year=
Distributed representations of words and phrases and their compositionality , author=. Advances in neural information processing systems , year=
-
[121]
Advances in Neural Information Processing Systems , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems , volume=
-
[122]
Liu, Zihan and Ping, Wei and Roy, Rajarshi and Xu, Peng and Shoeybi, Mohammad and Catanzaro, Bryan , journal=. Chat
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.