Nomic Embed: Training a Reproducible Long Context Text Embedder

Andriy Mulyar; Brandon Duderstadt; John X. Morris; Zach Nussbaum

arxiv: 2402.01613 · v2 · pith:YENANHR4new · submitted 2024-02-02 · 💻 cs.CL · cs.AI

Nomic Embed: Training a Reproducible Long Context Text Embedder

Zach Nussbaum , John X. Morris , Brandon Duderstadt , Andriy Mulyar This is my paper

Pith reviewed 2026-05-21 14:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords text embeddingslong contextreproducibilityopen sourceMTEB benchmarkLoCo benchmarkcontrastive learning

0 comments

The pith

A fully open 8192-context text embedder outperforms OpenAI Ada-002 and text-embedding-3-small on MTEB and LoCo benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper details the training of nomic-embed-text-v1 as an English text embedding model that handles up to 8192 tokens. It claims this is the first model released with full open-source weights, data, and replication code that surpasses the cited OpenAI embedders on short-context and long-context benchmarks. A sympathetic reader cares because the release removes reliance on closed APIs and hidden training details for semantic search and retrieval tasks. The work focuses on making the entire process verifiable and extensible by the community.

Core claim

Nomic-embed-text-v1 is the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark, with the full curated training data and code released for exact replication.

What carries the argument

The contrastive training pipeline on curated data that produces embeddings supporting 8192-token contexts while enabling full replication.

If this is right

Researchers can exactly replicate the model training and evaluation using the released code and data.
Long-context retrieval applications can shift from proprietary APIs to this open model without loss of benchmark performance.
The open data release allows direct inspection and further curation of the examples used for training.
Subsequent models can adopt the same pipeline to target even longer contexts or additional languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reproducibility standard set here could serve as a template for other embedding or retrieval models where data transparency matters.
Wider adoption might reduce dependence on closed embedding services in production systems handling documents longer than typical short-context limits.
The approach invites targeted experiments on how specific data curation choices affect long-context performance.

Load-bearing premise

The curated training data and benchmark splits contain no hidden overlaps or selection biases that inflate scores relative to the OpenAI baselines.

What would settle it

An independent replication of the full training process that yields scores below OpenAI Ada-002 on MTEB or LoCo, or discovery of substantial data leakage in the evaluation sets, would falsify the performance claim.

read the original abstract

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Nomic has released a fully open 8192-context embedding model with data and code, which is the practical advance here even if the training recipe is standard.

read the letter

The main thing to take from this paper is the complete release of nomic-embed-text-v1: open weights, the full curated training corpus, and replication code under Apache 2.0. That combination at 8192 context length is what makes it stand out from most other open embedding efforts that stop at weights only. They train with contrastive loss on a large English dataset and report stronger numbers than OpenAI Ada-002 and text-embedding-3-small on both MTEB and the long-context LoCo benchmark. The code and data at the GitHub link let others actually rerun or modify the work, which is rare and directly useful. The underlying approach follows established contrastive embedding methods, so the novelty sits in the openness and context length rather than a new algorithm. Releasing the exact data and training script gives real value for anyone who wants to build or audit long-context retrieval systems. One soft spot is the handling of potential benchmark contamination. The paper states they deduplicated the corpus, but the methods do not include a detailed public audit against every MTEB and LoCo test instance. If modest overlap exists, the performance delta could be inflated. Because the data is released, though, independent checks are possible, so this is a manageable rather than fatal issue. This work is aimed at practitioners and researchers who need reproducible long-context embeddings for semantic search or RAG pipelines. It is not a theoretical paper, but the artifact is concrete enough that a serious referee should see it to verify the data curation and evaluation details.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the training procedure for nomic-embed-text-v1, an 8192-token English text embedding model. It claims to be the first fully open-source, open-weights, open-data, reproducible model of this context length and reports that it outperforms OpenAI text-embedding-ada-002 and text-embedding-3-small on both the short-context MTEB benchmark and the long-context LoCo benchmark. The authors release the full curated training corpus, training code, and model weights under an Apache 2.0 license to enable exact replication.

Significance. If the reported performance gains are not artifacts of training-data contamination, the work provides a concrete, fully reproducible long-context embedding baseline that exceeds current closed-source models on standard suites. The explicit release of the complete training data and replication code is a notable strength that directly supports the reproducibility claim and lowers the barrier for subsequent research.

major comments (2)

[Methods] Methods section: the manuscript states that the training corpus was deduplicated but does not describe an exhaustive overlap audit (exact n-gram, semantic similarity, or embedding-based) against every instance in the MTEB and LoCo test sets. Because the central claim is that nomic-embed-text-v1 outperforms the OpenAI baselines, any undetected leakage would directly undermine the performance comparison and the assertion of a genuine advance.
[Experiments] Evaluation protocol: details on the precise MTEB and LoCo evaluation settings, including any post-training adjustments, prompt templates, or normalization steps, are not fully specified. Without these, independent replication of the headline numbers cannot be verified even with the released code and data.

minor comments (2)

[Introduction] The abstract and introduction use the phrase 'first fully reproducible' without a clear comparison table listing prior open embedding models and their released artifacts; adding such a table would strengthen the novelty claim.
[Results] Figure captions and axis labels in the benchmark plots should explicitly state the number of runs and error bars if any; current figures appear to report single-point estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and reproducibility of our work. We address each major point below and have revised the manuscript to incorporate additional details on our methods and evaluation protocol.

read point-by-point responses

Referee: [Methods] Methods section: the manuscript states that the training corpus was deduplicated but does not describe an exhaustive overlap audit (exact n-gram, semantic similarity, or embedding-based) against every instance in the MTEB and LoCo test sets. Because the central claim is that nomic-embed-text-v1 outperforms the OpenAI baselines, any undetected leakage would directly undermine the performance comparison and the assertion of a genuine advance.

Authors: We agree that an explicit description of overlap checks against the benchmark test sets would further strengthen confidence in the results. In the revised manuscript, we have expanded the Methods section to detail our deduplication procedure, which combined exact duplicate removal with MinHash-based near-duplicate detection at the n-gram level. We have also added a new paragraph reporting the results of a post-hoc overlap audit (using both n-gram and embedding-based similarity) between the released training corpus and the MTEB/LoCo test sets, confirming negligible leakage. Because the full curated training data is already released under Apache 2.0, any reader can independently reproduce or extend this audit. revision: yes
Referee: [Experiments] Evaluation protocol: details on the precise MTEB and LoCo evaluation settings, including any post-training adjustments, prompt templates, or normalization steps, are not fully specified. Without these, independent replication of the headline numbers cannot be verified even with the released code and data.

Authors: We appreciate this feedback on documentation. Although the complete evaluation code is available in the public GitHub repository, we acknowledge that the manuscript should be self-contained. We have revised the Experiments section to explicitly list the prompt templates used for each task, the exact MTEB and LoCo evaluation configurations, any post-training normalization or pooling steps applied to the embeddings, and the absence of additional adjustments. These additions allow the reported numbers to be replicated from the paper alone while still pointing readers to the released code for full verification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training outcome on public benchmarks

full rationale

The paper reports the training of nomic-embed-text-v1 via contrastive learning on a released curated corpus, followed by direct measurement of performance on the public MTEB and LoCo benchmarks. No equations, uniqueness theorems, or self-citations are invoked to derive the central performance claims; the reported outperformance is an observed empirical result rather than a quantity defined in terms of itself or forced by internal fitting. The derivation chain consists of standard model training steps whose outputs are independently verifiable on external test sets, with no reduction of predictions to fitted inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The performance claim rests on standard contrastive loss, a curated dataset whose construction details are not fully specified in the abstract, and benchmark protocols that assume no data contamination between training and test sets.

free parameters (2)

learning rate schedule and batch size
Standard hyperparameters chosen during training to reach the reported benchmark numbers.
data curation filters and sampling ratios
Choices made when assembling the training corpus that directly affect final embedding quality.

axioms (1)

domain assumption MTEB and LoCo benchmarks provide unbiased estimates of embedding quality for downstream tasks
Invoked when claiming superiority over OpenAI models.

pith-pipeline@v0.9.0 · 5654 in / 1281 out tokens · 39789 ms · 2026-05-21T14:49:15.571153+00:00 · methodology

discussion (0)

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

OpenIIR: An Open Simulation Platform for Information Retrieval Research
cs.IR 2026-05 accept novelty 7.0

OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.
OpenIIR: An Open Simulation Platform for Information Retrieval Research
cs.IR 2026-05 unverdicted novelty 7.0

OpenIIR supplies a shared core, pluggable scenario interface, and four reference multi-agent IR simulation types that produce reproducible argument graphs, exposure logs, and fitness traces from LLM personas.
XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation
cs.AI 2026-04 unverdicted novelty 7.0

XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.
Participatory provenance as representational auditing for AI-mediated public consultation
cs.AI 2026-04 unverdicted novelty 7.0

Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
cs.IR 2026-04 unverdicted novelty 7.0

MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
Coordinate Heterogeneity Governs Binary Quantization: From InfoNCE to Recall
cs.LG 2026-05 unverdicted novelty 6.0

Coordinate heterogeneity governs binary quantization performance via closed-form ranking fidelity expressions and a two-parameter scaling law, validated on 13 datasets across 6 embedding families.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
cs.LG 2026-05 unverdicted novelty 6.0

LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.
TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding
cs.AI 2026-05 unverdicted novelty 6.0

TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only b...
Black-box model classification under the discriminative factorization
cs.LG 2026-05 unverdicted novelty 6.0

Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on ...
MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal
cs.IR 2026-05 unverdicted novelty 6.0

MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...
Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus
cs.CL 2026-05 unverdicted novelty 6.0

Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering
cs.DL 2026-03 accept novelty 6.0

ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.
SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension
cs.CL 2025-08 unverdicted novelty 6.0

SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.
GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval
cs.CL 2026-05 unverdicted novelty 5.0

GraphRAG with 7-8B local LLMs on 8GB VRAM hardware builds knowledge graphs from EHR docs and answers queries, with Llama 3.1 creating the largest graph, Qwen 2.5 scoring highest on quality, and models below ~7B failin...
Control Charts for Multi-agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against ...
Towards Platonic Representation for Table Reasoning: A Foundation for Permutation-Invariant Retrieval
cs.AI 2026-04 unverdicted novelty 5.0

Table representations must be permutation-invariant to preserve semantic structure, and a new header-aligned encoder moves toward this ideal while exposing fragility in existing LLM table embeddings.
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
cs.IR 2026-04 unverdicted novelty 5.0

BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...
Gaussian mixture models as a proxy for interacting language models
cs.CL 2025-05 unverdicted novelty 5.0

Interacting Gaussian mixture models with RAG-style updates are shown to mimic aspects of interacting LLMs and are used to prove lower bounds on polarization probability in the resulting Markov chain.
Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
cs.CL 2024-12 unverdicted novelty 5.0

ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.
Health System Scale Semantic Search Across Unstructured Clinical Notes
cs.IR 2026-04 unverdicted novelty 4.0

A semantic search system was deployed at health-system scale across 166 million clinical notes, delivering sub-second latency, ~$4000 monthly cost, and 24-89% faster chart abstraction with maintained agreement.

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 20 Pith papers · 4 internal anchors

[1]

Ms marco: A human generated machine reading comprehension dataset, 2018

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018

work page 2018
[2]

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation

bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. , 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

work page 2023
[3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015

work page 2015
[4]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URL https://arxiv.org/abs/2306.15595

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

S imple E nglish W ikipedia: A new text simplification task

William Coster and David Kauchak. S imple E nglish W ikipedia: A new text simplification task. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea (eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 665--669, Portland, Oregon, USA, June 2011. Association for Computational Linguist...

work page 2011
[6]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022
[7]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. 2021

work page 2021
[8]

Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

work page 2019
[9]

Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023

emozilla. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

work page 2023
[10]

Open Question Answering Over Curated and Extracted Knowledge Bases

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open Question Answering Over Curated and Extracted Knowledge Bases . In KDD, 2014

work page 2014
[11]

ELI5: long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: long form question answering. In Anna Korhonen, David R. Traum, and Llu \' s M \` a rquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp....

work page doi:10.18653/v1/p19-1346 2019
[12]

Overcoming the lack of parallel data in sentence compression

Katja Filippova and Yasemin Altun. Overcoming the lack of parallel data in sentence compression. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1481--1491, Seattle, Washington, USA, October 2013. Association for Comput...

work page 2013
[13]

Wikimedia downloads

Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org

work page
[14]

Condenser: a pre-training architecture for dense retrieval, 2021

Luyu Gao and Jamie Callan. Condenser: a pre-training architecture for dense retrieval, 2021

work page 2021
[15]

Simcse: Simple contrastive learning of sentence embeddings, 2022

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings, 2022

work page 2022
[16]

Cramming: Training a language model on a single gpu in one day, 2022

Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day, 2022

work page 2022
[17]

Amazonqa: A review-based question answering task, 2019

Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, and Zachary C Lipton. Amazonqa: A review-based question answering task, 2019

work page 2019
[18]

Jina embeddings: A novel set of high-performance sentence embedding models, 2023

Michael Günther, Louis Milliken, Jonathan Geuter, Georgios Mastrapas, Bo Wang, and Han Xiao. Jina embeddings: A novel set of high-performance sentence embedding models, 2023

work page 2023
[19]

Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2024

Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2024

work page 2024
[22]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[23]

Unsupervised dense information retrieval with contrastive learning, 2022 a

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning, 2022 a

work page 2022
[24]

Atlas: Few-shot learning with retrieval augmented language models, 2022 b

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models, 2022 b

work page 2022
[25]

Things I'm learning while training superhot

kaiokendev. Things I'm learning while training superhot. , 2023. URL https://kaiokendev.github.io/til#extending-context-to-8k

work page 2023
[26]

Dense passage retrieval for open-domain question answering, 2020

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020

work page 2020
[27]

Gooaq: Open question answering with diverse answer types, 2021

Daniel Khashabi, Amos Ng, Tushar Khot, Ashish Sabharwal, Hannaneh Hajishirzi, and Chris Callison-Burch. Gooaq: Open question answering with diverse answer types, 2021

work page 2021
[28]

Wikihow: A large scale text summarization dataset, 2018

Mahnaz Koupaee and William Yang Wang. Wikihow: A large scale text summarization dataset, 2018

work page 2018
[29]

Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021 a

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021 a

work page 2021
[30]

Paq: 65 million probably-asked questions and what you can do with them, 2021 b

Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you can do with them, 2021 b

work page 2021
[31]

Towards general text embeddings with multi-stage contrastive learning, 2023

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning, 2023

work page 2023
[32]

Roberta: A robustly optimized bert pretraining approach, 2019

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

work page 2019
[33]

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S. Weld. S2orc: The semantic scholar open research corpus, 2020

work page 2020
[34]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

work page 2019
[35]

Scaling deep contrastive learning batch size under memory limited setup

Jiawei Han Luyu Gao, Yunyi Zhang and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. In Proceedings of the 6th Workshop on Representation Learning for NLP, 2021

work page 2021
[36]

Mixed precision training, 2018

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2018

work page 2018
[37]

Sgpt: Gpt sentence embeddings for semantic search, 2022

Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search, 2022

work page 2022
[38]

Mteb: Massive text embedding benchmark, 2023

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark, 2023

work page 2023
[39]

Text and code embeddings by contrastive pre-training, 2022

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, P...

work page 2022
[41]

Zhao, Yi Luan, Keith B

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers, 2021 a

work page 2021
[42]

Hall, Daniel Cer, and Yinfei Yang

Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021 b

work page 2021
[44]

Yarn: Efficient context window extension of large language models, 2023

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models, 2023

work page 2023
[46]

Mosaicbert: A bidirectional encoder optimized for fast pretraining, 2023

Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, and Jonathan Frankle. Mosaicbert: A bidirectional encoder optimized for fast pretraining, 2023

work page 2023
[47]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

work page 2019
[48]

Zero: Memory optimizations toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020

work page 2020
[49]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . SQuAD: 100,000+ Questions for Machine Comprehension of Text . arXiv e-prints, art. arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[50]

In-context retrieval-augmented language models, 2023

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models, 2023

work page 2023
[51]

Sentence-bert: Sentence embeddings using siamese bert-networks, 2019

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019

work page 2019
[52]

Introducing embed v3, Nov 2023

Nils Reimers, Elliot Choi, Amr Kayid, Alekhya Nandula, Manoj Govindassamy, and Abdullah Elkady. Introducing embed v3, Nov 2023. URL https://txt.cohere.com/introducing-embed-v3/

work page 2023
[53]

Long-context retrieval models with monarch mixer, Jan 2024

Jon Saad-Falcon, Dan Fu, and Simran Arora. Long-context retrieval models with monarch mixer, Jan 2024. URL https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval

work page 2024
[55]

SCROLLS : Standardized C ompa R ison over long language sequences

Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. SCROLLS : Standardized C ompa R ison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 12007--12021, Abu Dhabi, United Arab Emirates, December ...

work page 2022
[57]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

work page 2020
[58]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

work page 2020
[59]

Smith, Luke Zettlemoyer, and Tao Yu

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings, 2023 a

work page 2023
[60]

Roformer: Enhanced transformer with rotary position embedding, 2023 b

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023 b

work page 2023
[61]

Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains, 2020. URL https://arxiv.org/abs/2006.10739

work page arXiv 2020
[62]

Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021

work page 2021
[63]

FEVER : a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VERification . In NAACL-HLT, 2018

work page 2018
[64]

Representation learning with contrastive predictive coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019

work page 2019
[65]

Excited to announce voyage embeddings!, Nov 2023

Voyage. Excited to announce voyage embeddings!, Nov 2023. URL https://blog.voyageai.com/2023/10/29/voyage-embeddings/

work page 2023
[67]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR

work page 2019
[68]

Text embeddings by weakly-supervised contrastive pre-training, 2022

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2022

work page 2022
[69]

Simlm: Pre-training with representation bottleneck for dense passage retrieval, 2023 a

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Simlm: Pre-training with representation bottleneck for dense passage retrieval, 2023 a

work page 2023
[70]

Improving text embeddings with large language models, 2023 b

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models, 2023 b

work page 2023
[71]

C-pack: Packaged resources to advance general chinese embedding, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

work page 2023
[72]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2018

work page 2018
[73]

Character-level convolutional networks for text classification, 2016

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016

work page 2016
[74]

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015

work page 2015
[75]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

work page 2024
[76]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. arXiv e-prints , year =

work page
[77]

2018 , eprint =

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author =. 2018 , eprint =

work page 2018
[78]

2022 , eprint =

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author =. 2022 , eprint =

work page 2022
[79]

2019 , eprint =

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. 2019 , eprint =

work page 2019
[80]

Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , author =

Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , url =. Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , author =

work page
[81]

2021 , eprint =

Condenser: a Pre-training Architecture for Dense Retrieval , author =. 2021 , eprint =

work page 2021
[82]

2022 , eprint =

SimCSE: Simple Contrastive Learning of Sentence Embeddings , author =. 2022 , eprint =

work page 2022
[83]

2022 , eprint =

Cramming: Training a Language Model on a Single GPU in One Day , author =. 2022 , eprint =

work page 2022
[84]

2023 , eprint =

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models , author =. 2023 , eprint =

work page 2023
[85]

2024 , eprint =

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents , author =. 2024 , eprint =

work page 2024
[86]

2020 , eprint =

Dense Passage Retrieval for Open-Domain Question Answering , author =. 2020 , eprint =

work page 2020
[87]

2023 , eprint =

Towards General Text Embeddings with Multi-stage Contrastive Learning , author =. 2023 , eprint =

work page 2023
[88]

2019 , eprint =

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author =. 2019 , eprint =

work page 2019

Showing first 80 references.

[1] [1]

Ms marco: A human generated machine reading comprehension dataset, 2018

Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018

work page 2018

[2] [2]

NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation

bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. , 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

work page 2023

[3] [3]

Bowman, Gabor Angeli, Christopher Potts, and Christopher D

Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015

work page 2015

[4] [4]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URL https://arxiv.org/abs/2306.15595

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

S imple E nglish W ikipedia: A new text simplification task

William Coster and David Kauchak. S imple E nglish W ikipedia: A new text simplification task. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea (eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 665--669, Portland, Oregon, USA, June 2011. Association for Computational Linguist...

work page 2011

[6] [6]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022

[7] [7]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. 2021

work page 2021

[8] [8]

Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

work page 2019

[9] [9]

Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023

emozilla. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

work page 2023

[10] [10]

Open Question Answering Over Curated and Extracted Knowledge Bases

Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open Question Answering Over Curated and Extracted Knowledge Bases . In KDD, 2014

work page 2014

[11] [11]

ELI5: long form question answering

Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: long form question answering. In Anna Korhonen, David R. Traum, and Llu \' s M \` a rquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp....

work page doi:10.18653/v1/p19-1346 2019

[12] [12]

Overcoming the lack of parallel data in sentence compression

Katja Filippova and Yasemin Altun. Overcoming the lack of parallel data in sentence compression. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1481--1491, Seattle, Washington, USA, October 2013. Association for Comput...

work page 2013

[13] [13]

Wikimedia downloads

Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org

work page

[14] [14]

Condenser: a pre-training architecture for dense retrieval, 2021

Luyu Gao and Jamie Callan. Condenser: a pre-training architecture for dense retrieval, 2021

work page 2021

[15] [15]

Simcse: Simple contrastive learning of sentence embeddings, 2022

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings, 2022

work page 2022

[16] [16]

Cramming: Training a language model on a single gpu in one day, 2022

Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day, 2022

work page 2022

[17] [17]

Amazonqa: A review-based question answering task, 2019

Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, and Zachary C Lipton. Amazonqa: A review-based question answering task, 2019

work page 2019

[18] [18]

Jina embeddings: A novel set of high-performance sentence embedding models, 2023

Michael Günther, Louis Milliken, Jonathan Geuter, Georgios Mastrapas, Bo Wang, and Han Xiao. Jina embeddings: A novel set of high-performance sentence embedding models, 2023

work page 2023

[19] [19]

Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2024

Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2024

work page 2024

[20] [22]

CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[21] [23]

Unsupervised dense information retrieval with contrastive learning, 2022 a

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning, 2022 a

work page 2022

[22] [24]

Atlas: Few-shot learning with retrieval augmented language models, 2022 b

Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models, 2022 b

work page 2022

[23] [25]

Things I'm learning while training superhot

kaiokendev. Things I'm learning while training superhot. , 2023. URL https://kaiokendev.github.io/til#extending-context-to-8k

work page 2023

[24] [26]

Dense passage retrieval for open-domain question answering, 2020

Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020

work page 2020

[25] [27]

Gooaq: Open question answering with diverse answer types, 2021

Daniel Khashabi, Amos Ng, Tushar Khot, Ashish Sabharwal, Hannaneh Hajishirzi, and Chris Callison-Burch. Gooaq: Open question answering with diverse answer types, 2021

work page 2021

[26] [28]

Wikihow: A large scale text summarization dataset, 2018

Mahnaz Koupaee and William Yang Wang. Wikihow: A large scale text summarization dataset, 2018

work page 2018

[27] [29]

Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021 a

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021 a

work page 2021

[28] [30]

Paq: 65 million probably-asked questions and what you can do with them, 2021 b

Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you can do with them, 2021 b

work page 2021

[29] [31]

Towards general text embeddings with multi-stage contrastive learning, 2023

Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning, 2023

work page 2023

[30] [32]

Roberta: A robustly optimized bert pretraining approach, 2019

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

work page 2019

[31] [33]

Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S. Weld. S2orc: The semantic scholar open research corpus, 2020

work page 2020

[32] [34]

Decoupled weight decay regularization, 2019

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

work page 2019

[33] [35]

Scaling deep contrastive learning batch size under memory limited setup

Jiawei Han Luyu Gao, Yunyi Zhang and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. In Proceedings of the 6th Workshop on Representation Learning for NLP, 2021

work page 2021

[34] [36]

Mixed precision training, 2018

Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2018

work page 2018

[35] [37]

Sgpt: Gpt sentence embeddings for semantic search, 2022

Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search, 2022

work page 2022

[36] [38]

Mteb: Massive text embedding benchmark, 2023

Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark, 2023

work page 2023

[37] [39]

Text and code embeddings by contrastive pre-training, 2022

Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, P...

work page 2022

[38] [41]

Zhao, Yi Luan, Keith B

Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers, 2021 a

work page 2021

[39] [42]

Hall, Daniel Cer, and Yinfei Yang

Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021 b

work page 2021

[40] [44]

Yarn: Efficient context window extension of large language models, 2023

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models, 2023

work page 2023

[41] [46]

Mosaicbert: A bidirectional encoder optimized for fast pretraining, 2023

Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, and Jonathan Frankle. Mosaicbert: A bidirectional encoder optimized for fast pretraining, 2023

work page 2023

[42] [47]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

work page 2019

[43] [48]

Zero: Memory optimizations toward training trillion parameter models, 2020

Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020

work page 2020

[44] [49]

SQuAD: 100,000+ Questions for Machine Comprehension of Text

Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . SQuAD: 100,000+ Questions for Machine Comprehension of Text . arXiv e-prints, art. arXiv:1606.05250, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[45] [50]

In-context retrieval-augmented language models, 2023

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models, 2023

work page 2023

[46] [51]

Sentence-bert: Sentence embeddings using siamese bert-networks, 2019

Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019

work page 2019

[47] [52]

Introducing embed v3, Nov 2023

Nils Reimers, Elliot Choi, Amr Kayid, Alekhya Nandula, Manoj Govindassamy, and Abdullah Elkady. Introducing embed v3, Nov 2023. URL https://txt.cohere.com/introducing-embed-v3/

work page 2023

[48] [53]

Long-context retrieval models with monarch mixer, Jan 2024

Jon Saad-Falcon, Dan Fu, and Simran Arora. Long-context retrieval models with monarch mixer, Jan 2024. URL https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval

work page 2024

[49] [55]

SCROLLS : Standardized C ompa R ison over long language sequences

Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. SCROLLS : Standardized C ompa R ison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 12007--12021, Abu Dhabi, United Arab Emirates, December ...

work page 2022

[50] [57]

Glu variants improve transformer, 2020

Noam Shazeer. Glu variants improve transformer, 2020

work page 2020

[51] [58]

Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

work page 2020

[52] [59]

Smith, Luke Zettlemoyer, and Tao Yu

Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings, 2023 a

work page 2023

[53] [60]

Roformer: Enhanced transformer with rotary position embedding, 2023 b

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023 b

work page 2023

[54] [61]

Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains, 2020. URL https://arxiv.org/abs/2006.10739

work page arXiv 2020

[55] [62]

Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021

Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021

work page 2021

[56] [63]

FEVER : a large-scale dataset for fact extraction and VERification

James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VERification . In NAACL-HLT, 2018

work page 2018

[57] [64]

Representation learning with contrastive predictive coding, 2019

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019

work page 2019

[58] [65]

Excited to announce voyage embeddings!, Nov 2023

Voyage. Excited to announce voyage embeddings!, Nov 2023. URL https://blog.voyageai.com/2023/10/29/voyage-embeddings/

work page 2023

[59] [67]

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR

work page 2019

[60] [68]

Text embeddings by weakly-supervised contrastive pre-training, 2022

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2022

work page 2022

[61] [69]

Simlm: Pre-training with representation bottleneck for dense passage retrieval, 2023 a

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Simlm: Pre-training with representation bottleneck for dense passage retrieval, 2023 a

work page 2023

[62] [70]

Improving text embeddings with large language models, 2023 b

Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models, 2023 b

work page 2023

[63] [71]

C-pack: Packaged resources to advance general chinese embedding, 2023

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

work page 2023

[64] [72]

Cohen, Ruslan Salakhutdinov, and Christopher D

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2018

work page 2018

[65] [73]

Character-level convolutional networks for text classification, 2016

Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016

work page 2016

[66] [74]

Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015

Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015

work page 2015

[67] [75]

2024 , eprint=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

work page 2024

[68] [76]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. arXiv e-prints , year =

work page

[69] [77]

2018 , eprint =

MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author =. 2018 , eprint =

work page 2018

[70] [78]

2022 , eprint =

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author =. 2022 , eprint =

work page 2022

[71] [79]

2019 , eprint =

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. 2019 , eprint =

work page 2019

[72] [80]

Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , author =

Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , url =. Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , author =

work page

[73] [81]

2021 , eprint =

Condenser: a Pre-training Architecture for Dense Retrieval , author =. 2021 , eprint =

work page 2021

[74] [82]

2022 , eprint =

SimCSE: Simple Contrastive Learning of Sentence Embeddings , author =. 2022 , eprint =

work page 2022

[75] [83]

2022 , eprint =

Cramming: Training a Language Model on a Single GPU in One Day , author =. 2022 , eprint =

work page 2022

[76] [84]

2023 , eprint =

Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models , author =. 2023 , eprint =

work page 2023

[77] [85]

2024 , eprint =

Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents , author =. 2024 , eprint =

work page 2024

[78] [86]

2020 , eprint =

Dense Passage Retrieval for Open-Domain Question Answering , author =. 2020 , eprint =

work page 2020

[79] [87]

2023 , eprint =

Towards General Text Embeddings with Multi-stage Contrastive Learning , author =. 2023 , eprint =

work page 2023

[80] [88]

2019 , eprint =

RoBERTa: A Robustly Optimized BERT Pretraining Approach , author =. 2019 , eprint =

work page 2019