pith. sign in

arxiv: 2402.01613 · v2 · pith:YENANHR4new · submitted 2024-02-02 · 💻 cs.CL · cs.AI

Nomic Embed: Training a Reproducible Long Context Text Embedder

Pith reviewed 2026-05-21 14:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords text embeddingslong contextreproducibilityopen sourceMTEB benchmarkLoCo benchmarkcontrastive learning
0
0 comments X

The pith

A fully open 8192-context text embedder outperforms OpenAI Ada-002 and text-embedding-3-small on MTEB and LoCo benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper details the training of nomic-embed-text-v1 as an English text embedding model that handles up to 8192 tokens. It claims this is the first model released with full open-source weights, data, and replication code that surpasses the cited OpenAI embedders on short-context and long-context benchmarks. A sympathetic reader cares because the release removes reliance on closed APIs and hidden training details for semantic search and retrieval tasks. The work focuses on making the entire process verifiable and extensible by the community.

Core claim

Nomic-embed-text-v1 is the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark, with the full curated training data and code released for exact replication.

What carries the argument

The contrastive training pipeline on curated data that produces embeddings supporting 8192-token contexts while enabling full replication.

If this is right

  • Researchers can exactly replicate the model training and evaluation using the released code and data.
  • Long-context retrieval applications can shift from proprietary APIs to this open model without loss of benchmark performance.
  • The open data release allows direct inspection and further curation of the examples used for training.
  • Subsequent models can adopt the same pipeline to target even longer contexts or additional languages.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reproducibility standard set here could serve as a template for other embedding or retrieval models where data transparency matters.
  • Wider adoption might reduce dependence on closed embedding services in production systems handling documents longer than typical short-context limits.
  • The approach invites targeted experiments on how specific data curation choices affect long-context performance.

Load-bearing premise

The curated training data and benchmark splits contain no hidden overlaps or selection biases that inflate scores relative to the OpenAI baselines.

What would settle it

An independent replication of the full training process that yields scores below OpenAI Ada-002 on MTEB or LoCo, or discovery of substantial data leakage in the evaluation sets, would falsify the performance claim.

read the original abstract

This technical report describes the training of nomic-embed-text-v1, the first fully reproducible, open-source, open-weights, open-data, 8192 context length English text embedding model that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on the short-context MTEB benchmark and the long context LoCo benchmark. We release the training code and model weights under an Apache 2.0 license. In contrast with other open-source models, we release the full curated training data and code that allows for full replication of nomic-embed-text-v1. You can find code and data to replicate the model at https://github.com/nomic-ai/contrastors.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the training procedure for nomic-embed-text-v1, an 8192-token English text embedding model. It claims to be the first fully open-source, open-weights, open-data, reproducible model of this context length and reports that it outperforms OpenAI text-embedding-ada-002 and text-embedding-3-small on both the short-context MTEB benchmark and the long-context LoCo benchmark. The authors release the full curated training corpus, training code, and model weights under an Apache 2.0 license to enable exact replication.

Significance. If the reported performance gains are not artifacts of training-data contamination, the work provides a concrete, fully reproducible long-context embedding baseline that exceeds current closed-source models on standard suites. The explicit release of the complete training data and replication code is a notable strength that directly supports the reproducibility claim and lowers the barrier for subsequent research.

major comments (2)
  1. [Methods] Methods section: the manuscript states that the training corpus was deduplicated but does not describe an exhaustive overlap audit (exact n-gram, semantic similarity, or embedding-based) against every instance in the MTEB and LoCo test sets. Because the central claim is that nomic-embed-text-v1 outperforms the OpenAI baselines, any undetected leakage would directly undermine the performance comparison and the assertion of a genuine advance.
  2. [Experiments] Evaluation protocol: details on the precise MTEB and LoCo evaluation settings, including any post-training adjustments, prompt templates, or normalization steps, are not fully specified. Without these, independent replication of the headline numbers cannot be verified even with the released code and data.
minor comments (2)
  1. [Introduction] The abstract and introduction use the phrase 'first fully reproducible' without a clear comparison table listing prior open embedding models and their released artifacts; adding such a table would strengthen the novelty claim.
  2. [Results] Figure captions and axis labels in the benchmark plots should explicitly state the number of runs and error bars if any; current figures appear to report single-point estimates.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and reproducibility of our work. We address each major point below and have revised the manuscript to incorporate additional details on our methods and evaluation protocol.

read point-by-point responses
  1. Referee: [Methods] Methods section: the manuscript states that the training corpus was deduplicated but does not describe an exhaustive overlap audit (exact n-gram, semantic similarity, or embedding-based) against every instance in the MTEB and LoCo test sets. Because the central claim is that nomic-embed-text-v1 outperforms the OpenAI baselines, any undetected leakage would directly undermine the performance comparison and the assertion of a genuine advance.

    Authors: We agree that an explicit description of overlap checks against the benchmark test sets would further strengthen confidence in the results. In the revised manuscript, we have expanded the Methods section to detail our deduplication procedure, which combined exact duplicate removal with MinHash-based near-duplicate detection at the n-gram level. We have also added a new paragraph reporting the results of a post-hoc overlap audit (using both n-gram and embedding-based similarity) between the released training corpus and the MTEB/LoCo test sets, confirming negligible leakage. Because the full curated training data is already released under Apache 2.0, any reader can independently reproduce or extend this audit. revision: yes

  2. Referee: [Experiments] Evaluation protocol: details on the precise MTEB and LoCo evaluation settings, including any post-training adjustments, prompt templates, or normalization steps, are not fully specified. Without these, independent replication of the headline numbers cannot be verified even with the released code and data.

    Authors: We appreciate this feedback on documentation. Although the complete evaluation code is available in the public GitHub repository, we acknowledge that the manuscript should be self-contained. We have revised the Experiments section to explicitly list the prompt templates used for each task, the exact MTEB and LoCo evaluation configurations, any post-training normalization or pooling steps applied to the embeddings, and the absence of additional adjustments. These additions allow the reported numbers to be replicated from the paper alone while still pointing readers to the released code for full verification. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training outcome on public benchmarks

full rationale

The paper reports the training of nomic-embed-text-v1 via contrastive learning on a released curated corpus, followed by direct measurement of performance on the public MTEB and LoCo benchmarks. No equations, uniqueness theorems, or self-citations are invoked to derive the central performance claims; the reported outperformance is an observed empirical result rather than a quantity defined in terms of itself or forced by internal fitting. The derivation chain consists of standard model training steps whose outputs are independently verifiable on external test sets, with no reduction of predictions to fitted inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The performance claim rests on standard contrastive loss, a curated dataset whose construction details are not fully specified in the abstract, and benchmark protocols that assume no data contamination between training and test sets.

free parameters (2)
  • learning rate schedule and batch size
    Standard hyperparameters chosen during training to reach the reported benchmark numbers.
  • data curation filters and sampling ratios
    Choices made when assembling the training corpus that directly affect final embedding quality.
axioms (1)
  • domain assumption MTEB and LoCo benchmarks provide unbiased estimates of embedding quality for downstream tasks
    Invoked when claiming superiority over OpenAI models.

pith-pipeline@v0.9.0 · 5654 in / 1281 out tokens · 39789 ms · 2026-05-21T14:49:15.571153+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OpenIIR: An Open Simulation Platform for Information Retrieval Research

    cs.IR 2026-05 accept novelty 7.0

    OpenIIR provides a shared core and pluggable interface for running reproducible multi-agent simulations of information retrieval using LLM personas in four defined study archetypes.

  2. OpenIIR: An Open Simulation Platform for Information Retrieval Research

    cs.IR 2026-05 unverdicted novelty 7.0

    OpenIIR supplies a shared core, pluggable scenario interface, and four reference multi-agent IR simulation types that produce reproducible argument graphs, exposure logs, and fitness traces from LLM personas.

  3. XGRAG: A Graph-Native Framework for Explaining KG-based Retrieval-Augmented Generation

    cs.AI 2026-04 unverdicted novelty 7.0

    XGRAG uses graph perturbations to quantify component contributions in GraphRAG and achieves 14.81% better explanation quality than text-based baselines on QA datasets, with correlations to graph centrality.

  4. Participatory provenance as representational auditing for AI-mediated public consultation

    cs.AI 2026-04 unverdicted novelty 7.0

    Participatory provenance auditing of Canada's AI strategy consultation shows official AI summaries exclude 15-17% of participants more than random baselines, with 33-88% exclusion for dissent clusters.

  5. MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

    cs.IR 2026-04 unverdicted novelty 7.0

    MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

  6. Coordinate Heterogeneity Governs Binary Quantization: From InfoNCE to Recall

    cs.LG 2026-05 unverdicted novelty 6.0

    Coordinate heterogeneity governs binary quantization performance via closed-form ranking fidelity expressions and a two-parameter scaling law, validated on 13 datasets across 6 embedding families.

  7. Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs

    cs.LG 2026-05 unverdicted novelty 6.0

    LRD framework with Frenet, NRS, and GFMI metrics shows layer-wise structure in 31 models provides usable signal for model selection and pruning on MTEB tasks.

  8. TrajPrism: A Multi-Task Benchmark for Language-Grounded Urban Trajectory Understanding

    cs.AI 2026-05 unverdicted novelty 6.0

    TrajPrism introduces a multi-task benchmark with 300K real-world urban trajectories and 2.1M language-grounded task instances across three cities, plus proof-of-concept models showing large gaps versus geometry-only b...

  9. Black-box model classification under the discriminative factorization

    cs.LG 2026-05 unverdicted novelty 6.0

    Discriminative factorization distinguishes high-quality query sets for black-box model classification, with chance-level error decaying exponentially in query budget and parameters predicting empirical decay rates on ...

  10. MLAIRE: Multilingual Language-Aware Information Retrieval Evaluation Protocal

    cs.IR 2026-05 unverdicted novelty 6.0

    MLAIRE is a protocol that evaluates multilingual retrievers on both semantic accuracy and query-language preference using parallel passages and new metrics like LPR and Lang-nDCG, showing that standard metrics hide di...

  11. Is Textual Similarity Invariant under Machine Translation? Evidence Based on the Political Manifesto Corpus

    cs.CL 2026-05 unverdicted novelty 6.0

    Machine translation preserves embedding similarity structure for ten languages but distorts it for four in the Manifesto Corpus, via a new non-inferiority testing framework.

  12. LLMs Corrupt Your Documents When You Delegate

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

  13. ASTRA: Mapping Art-Technology Institutions via Conceptual Axes, Text Embeddings, and Unsupervised Clustering

    cs.DL 2026-03 accept novelty 6.0

    ASTRA combines an eight-axis conceptual framework with text embeddings and unsupervised clustering to map and group 78 art-technology institutions into coherent thematic clusters.

  14. SitEmb-v1.5: Improved Context-Aware Dense Retrieval for Semantic Association and Long Story Comprehension

    cs.CL 2025-08 unverdicted novelty 6.0

    SitEmb-v1.5 uses a new training paradigm to produce context-situated embeddings for short chunks, outperforming larger models by over 10% on a curated book-plot retrieval benchmark.

  15. GraphRAG on Consumer Hardware: Benchmarking Local LLMs for Healthcare EHR Schema Retrieval

    cs.CL 2026-05 unverdicted novelty 5.0

    GraphRAG with 7-8B local LLMs on 8GB VRAM hardware builds knowledge graphs from EHR docs and answers queries, with Llama 3.1 creating the largest graph, Qwen 2.5 scoring highest on quality, and models below ~7B failin...

  16. Control Charts for Multi-agent Systems

    cs.MA 2026-05 unverdicted novelty 5.0

    Adaptive control charts can monitor learning multi-agent systems but are vulnerable to gradual adversarial defection, revealing a fundamental tradeoff between allowing agents to learn and maintaining security against ...

  17. Towards Platonic Representation for Table Reasoning: A Foundation for Permutation-Invariant Retrieval

    cs.AI 2026-04 unverdicted novelty 5.0

    Table representations must be permutation-invariant to preserve semantic structure, and a new header-aligned encoder moves toward this ideal while exposing fragility in existing LLM table embeddings.

  18. BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

    cs.IR 2026-04 unverdicted novelty 5.0

    BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...

  19. Gaussian mixture models as a proxy for interacting language models

    cs.CL 2025-05 unverdicted novelty 5.0

    Interacting Gaussian mixture models with RAG-style updates are shown to mimic aspects of interacting LLMs and are used to prove lower bounds on polarization probability in the resulting Markov chain.

  20. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    cs.CL 2024-12 unverdicted novelty 5.0

    ModernBERT is a new bidirectional encoder model achieving SOTA performance on diverse classification and retrieval benchmarks while offering superior speed and memory efficiency for long-context inference.

  21. Health System Scale Semantic Search Across Unstructured Clinical Notes

    cs.IR 2026-04 unverdicted novelty 4.0

    A semantic search system was deployed at health-system scale across 166 million clinical notes, delivering sub-second latency, ~$4000 monthly cost, and 24-89% faster chart abstraction with maintained agreement.

Reference graph

Works this paper leans on

142 extracted references · 142 canonical work pages · cited by 20 Pith papers · 4 internal anchors

  1. [1]

    Ms marco: A human generated machine reading comprehension dataset, 2018

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, Mir Rosenberg, Xia Song, Alina Stoica, Saurabh Tiwary, and Tong Wang. Ms marco: A human generated machine reading comprehension dataset, 2018

  2. [2]

    NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation

    bloc97. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. , 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/

  3. [3]

    Bowman, Gabor Angeli, Christopher Potts, and Christopher D

    Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. A large annotated corpus for learning natural language inference. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 2015

  4. [4]

    Extending Context Window of Large Language Models via Positional Interpolation

    Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation, 2023. URL https://arxiv.org/abs/2306.15595

  5. [5]

    S imple E nglish W ikipedia: A new text simplification task

    William Coster and David Kauchak. S imple E nglish W ikipedia: A new text simplification task. In Dekang Lin, Yuji Matsumoto, and Rada Mihalcea (eds.), Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp.\ 665--669, Portland, Oregon, USA, June 2011. Association for Computational Linguist...

  6. [6]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

  7. [7]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A. Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. 2021

  8. [8]

    Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding, 2019

  9. [9]

    Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023

    emozilla. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/

  10. [10]

    Open Question Answering Over Curated and Extracted Knowledge Bases

    Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. Open Question Answering Over Curated and Extracted Knowledge Bases . In KDD, 2014

  11. [11]

    ELI5: long form question answering

    Angela Fan, Yacine Jernite, Ethan Perez, David Grangier, Jason Weston, and Michael Auli. ELI5: long form question answering. In Anna Korhonen, David R. Traum, and Llu \' s M \` a rquez (eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers , pp....

  12. [12]

    Overcoming the lack of parallel data in sentence compression

    Katja Filippova and Yasemin Altun. Overcoming the lack of parallel data in sentence compression. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.\ 1481--1491, Seattle, Washington, USA, October 2013. Association for Comput...

  13. [13]

    Wikimedia downloads

    Wikimedia Foundation. Wikimedia downloads. URL https://dumps.wikimedia.org

  14. [14]

    Condenser: a pre-training architecture for dense retrieval, 2021

    Luyu Gao and Jamie Callan. Condenser: a pre-training architecture for dense retrieval, 2021

  15. [15]

    Simcse: Simple contrastive learning of sentence embeddings, 2022

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings, 2022

  16. [16]

    Cramming: Training a language model on a single gpu in one day, 2022

    Jonas Geiping and Tom Goldstein. Cramming: Training a language model on a single gpu in one day, 2022

  17. [17]

    Amazonqa: A review-based question answering task, 2019

    Mansi Gupta, Nitish Kulkarni, Raghuveer Chanda, Anirudha Rayasam, and Zachary C Lipton. Amazonqa: A review-based question answering task, 2019

  18. [18]

    Jina embeddings: A novel set of high-performance sentence embedding models, 2023

    Michael Günther, Louis Milliken, Jonathan Geuter, Georgios Mastrapas, Bo Wang, and Han Xiao. Jina embeddings: A novel set of high-performance sentence embedding models, 2023

  19. [19]

    Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2024

    Michael Günther, Jackmin Ong, Isabelle Mohr, Alaeddine Abdessalem, Tanguy Abel, Mohammad Kalim Akram, Susana Guzman, Georgios Mastrapas, Saba Sturua, Bo Wang, Maximilian Werk, Nan Wang, and Han Xiao. Jina embeddings 2: 8192-token general-purpose text embeddings for long documents, 2024

  20. [22]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. CodeSearchNet challenge: Evaluating the state of semantic code search. arXiv preprint arXiv:1909.09436, 2019

  21. [23]

    Unsupervised dense information retrieval with contrastive learning, 2022 a

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning, 2022 a

  22. [24]

    Atlas: Few-shot learning with retrieval augmented language models, 2022 b

    Gautier Izacard, Patrick Lewis, Maria Lomeli, Lucas Hosseini, Fabio Petroni, Timo Schick, Jane Dwivedi-Yu, Armand Joulin, Sebastian Riedel, and Edouard Grave. Atlas: Few-shot learning with retrieval augmented language models, 2022 b

  23. [25]

    Things I'm learning while training superhot

    kaiokendev. Things I'm learning while training superhot. , 2023. URL https://kaiokendev.github.io/til#extending-context-to-8k

  24. [26]

    Dense passage retrieval for open-domain question answering, 2020

    Vladimir Karpukhin, Barlas Oğuz, Sewon Min, Patrick Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen tau Yih. Dense passage retrieval for open-domain question answering, 2020

  25. [27]

    Gooaq: Open question answering with diverse answer types, 2021

    Daniel Khashabi, Amos Ng, Tushar Khot, Ashish Sabharwal, Hannaneh Hajishirzi, and Chris Callison-Burch. Gooaq: Open question answering with diverse answer types, 2021

  26. [28]

    Wikihow: A large scale text summarization dataset, 2018

    Mahnaz Koupaee and William Yang Wang. Wikihow: A large scale text summarization dataset, 2018

  27. [29]

    Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021 a

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive nlp tasks, 2021 a

  28. [30]

    Paq: 65 million probably-asked questions and what you can do with them, 2021 b

    Patrick Lewis, Yuxiang Wu, Linqing Liu, Pasquale Minervini, Heinrich Küttler, Aleksandra Piktus, Pontus Stenetorp, and Sebastian Riedel. Paq: 65 million probably-asked questions and what you can do with them, 2021 b

  29. [31]

    Towards general text embeddings with multi-stage contrastive learning, 2023

    Zehan Li, Xin Zhang, Yanzhao Zhang, Dingkun Long, Pengjun Xie, and Meishan Zhang. Towards general text embeddings with multi-stage contrastive learning, 2023

  30. [32]

    Roberta: A robustly optimized bert pretraining approach, 2019

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach, 2019

  31. [33]

    Kyle Lo, Lucy Lu Wang, Mark Neumann, Rodney Kinney, and Dan S. Weld. S2orc: The semantic scholar open research corpus, 2020

  32. [34]

    Decoupled weight decay regularization, 2019

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization, 2019

  33. [35]

    Scaling deep contrastive learning batch size under memory limited setup

    Jiawei Han Luyu Gao, Yunyi Zhang and Jamie Callan. Scaling deep contrastive learning batch size under memory limited setup. In Proceedings of the 6th Workshop on Representation Learning for NLP, 2021

  34. [36]

    Mixed precision training, 2018

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, and Hao Wu. Mixed precision training, 2018

  35. [37]

    Sgpt: Gpt sentence embeddings for semantic search, 2022

    Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search, 2022

  36. [38]

    Mteb: Massive text embedding benchmark, 2023

    Niklas Muennighoff, Nouamane Tazi, Loïc Magne, and Nils Reimers. Mteb: Massive text embedding benchmark, 2023

  37. [39]

    Text and code embeddings by contrastive pre-training, 2022

    Arvind Neelakantan, Tao Xu, Raul Puri, Alec Radford, Jesse Michael Han, Jerry Tworek, Qiming Yuan, Nikolas Tezak, Jong Wook Kim, Chris Hallacy, Johannes Heidecke, Pranav Shyam, Boris Power, Tyna Eloundou Nekoul, Girish Sastry, Gretchen Krueger, David Schnurr, Felipe Petroski Such, Kenny Hsu, Madeleine Thompson, Tabarak Khan, Toki Sherbakov, Joanne Jang, P...

  38. [41]

    Zhao, Yi Luan, Keith B

    Jianmo Ni, Chen Qu, Jing Lu, Zhuyun Dai, Gustavo Hernández Ábrego, Ji Ma, Vincent Y. Zhao, Yi Luan, Keith B. Hall, Ming-Wei Chang, and Yinfei Yang. Large dual encoders are generalizable retrievers, 2021 a

  39. [42]

    Hall, Daniel Cer, and Yinfei Yang

    Jianmo Ni, Gustavo Hernández Ábrego, Noah Constant, Ji Ma, Keith B. Hall, Daniel Cer, and Yinfei Yang. Sentence-t5: Scalable sentence encoders from pre-trained text-to-text models, 2021 b

  40. [44]

    Yarn: Efficient context window extension of large language models, 2023

    Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models, 2023

  41. [46]

    Mosaicbert: A bidirectional encoder optimized for fast pretraining, 2023

    Jacob Portes, Alex Trott, Sam Havens, Daniel King, Abhinav Venigalla, Moin Nadeem, Nikhil Sardana, Daya Khudia, and Jonathan Frankle. Mosaicbert: A bidirectional encoder optimized for fast pretraining, 2023

  42. [47]

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints, 2019

  43. [48]

    Zero: Memory optimizations toward training trillion parameter models, 2020

    Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. Zero: Memory optimizations toward training trillion parameter models, 2020

  44. [49]

    SQuAD: 100,000+ Questions for Machine Comprehension of Text

    Pranav Rajpurkar , Jian Zhang , Konstantin Lopyrev , and Percy Liang . SQuAD: 100,000+ Questions for Machine Comprehension of Text . arXiv e-prints, art. arXiv:1606.05250, 2016

  45. [50]

    In-context retrieval-augmented language models, 2023

    Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. In-context retrieval-augmented language models, 2023

  46. [51]

    Sentence-bert: Sentence embeddings using siamese bert-networks, 2019

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019

  47. [52]

    Introducing embed v3, Nov 2023

    Nils Reimers, Elliot Choi, Amr Kayid, Alekhya Nandula, Manoj Govindassamy, and Abdullah Elkady. Introducing embed v3, Nov 2023. URL https://txt.cohere.com/introducing-embed-v3/

  48. [53]

    Long-context retrieval models with monarch mixer, Jan 2024

    Jon Saad-Falcon, Dan Fu, and Simran Arora. Long-context retrieval models with monarch mixer, Jan 2024. URL https://hazyresearch.stanford.edu/blog/2024-01-11-m2-bert-retrieval

  49. [55]

    SCROLLS : Standardized C ompa R ison over long language sequences

    Uri Shaham, Elad Segal, Maor Ivgi, Avia Efrat, Ori Yoran, Adi Haviv, Ankit Gupta, Wenhan Xiong, Mor Geva, Jonathan Berant, and Omer Levy. SCROLLS : Standardized C ompa R ison over long language sequences. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 12007--12021, Abu Dhabi, United Arab Emirates, December ...

  50. [57]

    Glu variants improve transformer, 2020

    Noam Shazeer. Glu variants improve transformer, 2020

  51. [58]

    Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020

  52. [59]

    Smith, Luke Zettlemoyer, and Tao Yu

    Hongjin Su, Weijia Shi, Jungo Kasai, Yizhong Wang, Yushi Hu, Mari Ostendorf, Wen tau Yih, Noah A. Smith, Luke Zettlemoyer, and Tao Yu. One embedder, any task: Instruction-finetuned text embeddings, 2023 a

  53. [60]

    Roformer: Enhanced transformer with rotary position embedding, 2023 b

    Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding, 2023 b

  54. [61]

    Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T

    Matthew Tancik, Pratul P. Srinivasan, Ben Mildenhall, Sara Fridovich-Keil, Nithin Raghavan, Utkarsh Singhal, Ravi Ramamoorthi, Jonathan T. Barron, and Ren Ng. Fourier features let networks learn high frequency functions in low dimensional domains, 2020. URL https://arxiv.org/abs/2006.10739

  55. [62]

    Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021

    Nandan Thakur, Nils Reimers, Andreas Rücklé, Abhishek Srivastava, and Iryna Gurevych. Beir: A heterogenous benchmark for zero-shot evaluation of information retrieval models, 2021

  56. [63]

    FEVER : a large-scale dataset for fact extraction and VERification

    James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. FEVER : a large-scale dataset for fact extraction and VERification . In NAACL-HLT, 2018

  57. [64]

    Representation learning with contrastive predictive coding, 2019

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding, 2019

  58. [65]

    Excited to announce voyage embeddings!, Nov 2023

    Voyage. Excited to announce voyage embeddings!, Nov 2023. URL https://blog.voyageai.com/2023/10/29/voyage-embeddings/

  59. [67]

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R. Bowman. GLUE : A multi-task benchmark and analysis platform for natural language understanding. 2019. In the Proceedings of ICLR

  60. [68]

    Text embeddings by weakly-supervised contrastive pre-training, 2022

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training, 2022

  61. [69]

    Simlm: Pre-training with representation bottleneck for dense passage retrieval, 2023 a

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Simlm: Pre-training with representation bottleneck for dense passage retrieval, 2023 a

  62. [70]

    Improving text embeddings with large language models, 2023 b

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Rangan Majumder, and Furu Wei. Improving text embeddings with large language models, 2023 b

  63. [71]

    C-pack: Packaged resources to advance general chinese embedding, 2023

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-pack: Packaged resources to advance general chinese embedding, 2023

  64. [72]

    Cohen, Ruslan Salakhutdinov, and Christopher D

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W. Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA : A dataset for diverse, explainable multi-hop question answering. In Conference on Empirical Methods in Natural Language Processing ( EMNLP ) , 2018

  65. [73]

    Character-level convolutional networks for text classification, 2016

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification, 2016

  66. [74]

    Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015

    Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books, 2015

  67. [75]

    2024 , eprint=

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. 2024 , eprint=

  68. [76]

    Liu , title =

    Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. arXiv e-prints , year =

  69. [77]

    2018 , eprint =

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset , author =. 2018 , eprint =

  70. [78]

    2022 , eprint =

    FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author =. 2022 , eprint =

  71. [79]

    2019 , eprint =

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author =. 2019 , eprint =

  72. [80]

    Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , author =

    Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , url =. Dynamically Scaled RoPE further increases performance of long context LLaMA with zero fine-tuning , author =

  73. [81]

    2021 , eprint =

    Condenser: a Pre-training Architecture for Dense Retrieval , author =. 2021 , eprint =

  74. [82]

    2022 , eprint =

    SimCSE: Simple Contrastive Learning of Sentence Embeddings , author =. 2022 , eprint =

  75. [83]

    2022 , eprint =

    Cramming: Training a Language Model on a Single GPU in One Day , author =. 2022 , eprint =

  76. [84]

    2023 , eprint =

    Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models , author =. 2023 , eprint =

  77. [85]

    2024 , eprint =

    Jina Embeddings 2: 8192-Token General-Purpose Text Embeddings for Long Documents , author =. 2024 , eprint =

  78. [86]

    2020 , eprint =

    Dense Passage Retrieval for Open-Domain Question Answering , author =. 2020 , eprint =

  79. [87]

    2023 , eprint =

    Towards General Text Embeddings with Multi-stage Contrastive Learning , author =. 2023 , eprint =

  80. [88]

    2019 , eprint =

    RoBERTa: A Robustly Optimized BERT Pretraining Approach , author =. 2019 , eprint =

Showing first 80 references.