pith. sign in

arxiv: 2412.13663 · v2 · pith:3S5Y6A5Lnew · submitted 2024-12-18 · 💻 cs.CL · cs.AI

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Pith reviewed 2026-05-20 17:36 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords encoder-only transformerbidirectional encoderlong contextmodel efficiencyretrievalclassificationModernBERT
0
0 comments X

The pith

ModernBERT updates bidirectional encoders with modern optimizations, 2 trillion tokens of training, and native 8192-token context to deliver better accuracy plus faster and lighter inference than prior models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ModernBERT as an encoder-only transformer that incorporates recent architectural and training improvements into the BERT lineage. It trains these models on 2 trillion tokens while supporting sequences up to 8192 tokens without truncation. The resulting models reach state-of-the-art scores on many classification benchmarks and on both single-vector and multi-vector retrieval across domains that include code. At the same time they run faster and use less memory than earlier encoders when run on ordinary GPUs. A reader would care because encoder-only models remain the backbone of many production retrieval and classification systems, so an upgrade that improves both quality and cost directly affects deployed pipelines.

Core claim

ModernBERT brings modern model optimizations to encoder-only transformers and, after training on 2 trillion tokens with a native 8192 sequence length, produces state-of-the-art results on diverse classification tasks plus single- and multi-vector retrieval on multiple domains including code, while also being the fastest and most memory-efficient encoder designed for inference on common GPUs.

What carries the argument

The ModernBERT encoder, which integrates contemporary transformer optimizations with long native context support and large-scale pretraining to improve the performance-efficiency frontier of bidirectional models.

If this is right

  • Encoder-only pipelines for retrieval and classification can adopt longer contexts without custom truncation or chunking.
  • Production systems gain both higher accuracy and lower inference latency on standard hardware.
  • Multi-vector retrieval benefits from the same architecture that improves single-vector scores.
  • Code-domain retrieval tasks see measurable lifts without switching to decoder-only models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Older BERT-style models may be replaceable in many pipelines by a single updated encoder rather than by larger decoder-only alternatives.
  • The efficiency improvements could allow higher throughput or smaller hardware footprints for the same workload.
  • Native long-context handling may reduce the need for separate retrieval-augmented or chunking strategies in downstream applications.

Load-bearing premise

The combination of modern optimizations and training on 2 trillion tokens at native 8192 length is what produces the reported gains in accuracy and efficiency over previous encoder-only models.

What would settle it

Benchmark ModernBERT against recent encoder baselines on the same classification and retrieval suites and find no improvement in accuracy, speed, or memory use at comparable model sizes.

read the original abstract

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces ModernBERT, a modernized encoder-only transformer that incorporates recent architectural optimizations (e.g., attention variants, normalization, positional encodings). It is pretrained on 2 trillion tokens with a native maximum sequence length of 8192. The central claims are state-of-the-art results across diverse classification tasks and both single- and multi-vector retrieval (including code domains), together with superior speed and memory efficiency for inference on common GPUs relative to prior encoder-only models.

Significance. If the empirical results and efficiency measurements hold under scrutiny, this would constitute a meaningful Pareto improvement for encoder-only models, which remain central to production retrieval and classification pipelines. The native long-context support combined with claimed efficiency gains could enable broader adoption for longer-document tasks without post-processing or truncation.

major comments (2)
  1. [§4] §4 (Experiments) and associated tables: the manuscript reports SOTA rankings but supplies no ablations that isolate the contribution of each modern optimization from the effects of scale (2T tokens, native 8192 length). This is load-bearing for the central attribution claim; without such controls it is impossible to determine whether the reported gains arise from the described architecture or from training compute and data volume alone.
  2. [§4.1] §4.1 and evaluation tables: no error bars, variance estimates, or explicit confirmation that benchmark protocols (e.g., retrieval metrics, task selection) exactly match those used in the cited prior encoder baselines. Small protocol differences can alter SOTA status and therefore undermine the Pareto-improvement assertion.
minor comments (1)
  1. [Figures 3-5] Figure captions and axis labels in the efficiency plots could more explicitly state the hardware (GPU model, batch size) and measurement methodology to allow direct reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We are grateful to the referee for the insightful review of our manuscript introducing ModernBERT. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to make.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and associated tables: the manuscript reports SOTA rankings but supplies no ablations that isolate the contribution of each modern optimization from the effects of scale (2T tokens, native 8192 length). This is load-bearing for the central attribution claim; without such controls it is impossible to determine whether the reported gains arise from the described architecture or from training compute and data volume alone.

    Authors: We agree that ablations isolating the contributions of individual optimizations would strengthen attribution of the gains. However, the computational cost of training multiple full-scale models on 2 trillion tokens renders comprehensive ablations infeasible. In the revised manuscript we will add a dedicated discussion of this limitation, include supporting evidence from smaller-scale ablation experiments on key components (such as attention variants and normalization), and reference prior literature demonstrating the benefits of these optimizations. We will also clarify that the reported Pareto improvements are measured end-to-end against prior encoder models. revision: partial

  2. Referee: [§4.1] §4.1 and evaluation tables: no error bars, variance estimates, or explicit confirmation that benchmark protocols (e.g., retrieval metrics, task selection) exactly match those used in the cited prior encoder baselines. Small protocol differences can alter SOTA status and therefore undermine the Pareto-improvement assertion.

    Authors: We thank the referee for highlighting the need for rigorous protocol documentation and statistical reporting. In the revised version we will expand the experimental setup section to explicitly confirm that all benchmark protocols, metrics, and task selections match those used in the cited baselines, with direct references to the original evaluation papers or leaderboards. Where multiple runs were performed we will report error bars; for the primary large-scale results we will note the standard practice of single-run evaluation at this scale while discussing observed metric stability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on empirical pretraining and evaluation

full rationale

The paper introduces ModernBERT via architectural optimizations and large-scale training (2T tokens at native 8192 length), then reports downstream results on classification and retrieval benchmarks. No derivation chain, uniqueness theorem, or fitted parameter is presented that reduces by construction to the inputs; performance claims are externally falsifiable via the stated evaluations rather than self-referential. This is the standard non-circular outcome for an empirical model paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical model-training paper. No mathematical axioms, free parameters, or invented entities are described in the abstract; standard transformer components and large-scale pretraining are assumed from prior literature.

pith-pipeline@v0.9.0 · 5715 in / 1119 out tokens · 101701 ms · 2026-05-20T17:36:26.120064+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Who Owns This Agent? Tracing AI Agents Back to Their Owners

    cs.CR 2026-05 unverdicted novelty 8.0

    A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.

  2. Is She Even Relevant? When BERT Ignores Explicit Gender Cues

    cs.CL 2026-05 conditional novelty 7.0

    A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.

  3. ProteinJEPA: Latent prediction complements protein language models

    cs.LG 2026-05 unverdicted novelty 7.0

    Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.

  4. HyperTransport: Amortized Conditioning of T2I Generative Models

    cs.LG 2026-05 unverdicted novelty 7.0

    HyperTransport amortizes activation steering for T2I models via a hypernetwork that predicts intervention parameters from CLIP embeddings, delivering 3600-7000x speedup and matching per-concept baselines on 167 unseen...

  5. NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus

    cs.CL 2026-04 unverdicted novelty 7.0

    NorBERTo, a ModernBERT encoder trained on the largest open Portuguese corpus of 331B tokens, reports top encoder results on several PLUE and ASSIN 2 tasks.

  6. Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings

    q-bio.QM 2026-04 unverdicted novelty 7.0

    Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...

  7. RetroMotion: Retrocausal Motion Forecasting Models are Instructable

    cs.CV 2025-05 unverdicted novelty 7.0

    Retrocausal transformer decomposes multi-agent motion forecasts into marginals and pairwise joints, models uncertainty with compressed exponentials, achieves strong Waymo results, generalizes to Argoverse 2 and V2X-Se...

  8. HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools

    cs.CL 2026-05 unverdicted novelty 6.0

    HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up...

  9. GLiGuard: Schema-Conditioned Classification for LLM Safeguard

    cs.CL 2026-05 unverdicted novelty 6.0

    GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.

  10. Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation

    cs.LG 2026-04 unverdicted novelty 6.0

    Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.

  11. Rag Performance Prediction for Question Answering

    cs.CL 2026-04 unverdicted novelty 6.0

    A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.

  12. Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution

    cs.CL 2025-12 unverdicted novelty 6.0

    Explanation biases in feature attribution methods are systematic products of lexical and positional preferences, with observed trade-offs across models and higher bias in anomalous explanations.

  13. Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

    cs.CV 2025-08 unverdicted novelty 6.0

    PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.

  14. Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records

    cs.LG 2025-07 unverdicted novelty 6.0

    AACE is an annotation-assisted method for causal policy learning from multimodal EHRs that outperforms risk-based and representation-based baselines on synthetic, semi-synthetic, and real datasets.

  15. Should We Still Pretrain Encoders with Masked Language Modeling?

    cs.CL 2025-07 accept novelty 6.0

    Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improv...

  16. Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

    cs.CL 2026-05 conditional novelty 5.0

    Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.

  17. Efficient Listwise Reranking with Compressed Document Representations

    cs.IR 2026-04 unverdicted novelty 5.0

    RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.

  18. Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding

    cs.CL 2026-04 unverdicted novelty 5.0

    Augmenting commonsense knowledge corpora with negation produces over 2M new triples that benefit LLM negation understanding when used for pre-training.

  19. m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

    cs.CL 2026-05 unverdicted novelty 4.0

    m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.

  20. Filter-then-Verify: A Multiphase GNN and ModernBERT Framework for Social Engineering Detection in Email Networks

    cs.CR 2026-05 unverdicted novelty 4.0

    A two-stage GNN-plus-ModernBERT framework detects social engineering attacks in email networks by first filtering structural anomalies at 86% recall and then verifying content to reach over 92% precision on augmented ...

  21. Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

    cs.CL 2026-03 unverdicted novelty 4.0

    Zero-shot GPT-OSS detects depression from 1,108 primary care encounter transcripts with AUPRC 0.51 and AUROC 0.77, with meaningful signals in the first 128 patient tokens and added value from dyadic mirroring.

Reference graph

Works this paper leans on

172 extracted references · 172 canonical work pages · cited by 21 Pith papers · 29 internal anchors

  1. [2]

    International Conference on Learning Representations , year=

    Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. International Conference on Learning Representations , year=

  2. [6]

    On Layer Normalization in the Transformer Architecture , booktitle =

    Ruibin Xiong and Yunchang Yang and Di He and Kai Zheng and Shuxin Zheng and Chen Xing and Huishuai Zhang and Yanyan Lan and Liwei Wang and Tie. On Layer Normalization in the Transformer Architecture , booktitle =. 2020 , url =

  3. [7]

    Passing the Torch: Training a Mamba Model for Smooth Handover , author=

  4. [8]

    Improving Text Embeddings with Large Language Models

    Liang Wang and Nan Yang and Xiaolong Huang and Linjun Yang and Rangan Majumder and Furu Wei , editor =. Improving Text Embeddings with Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.642 , timestamp =

  5. [10]

    2018 , eprint=

    A Walk with SGD , author=. 2018 , eprint=

  6. [11]

    LLM2Vec: large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961,

    Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.05961 , eprinttype =. 2404.05961 , timestamp =

  7. [12]

    Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track) , pages=

    The economic trade-offs of large language models: A case study , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track) , pages=

  8. [19]

    Advances in neural information processing systems , volume=

    Understanding and improving layer normalization , author=. Advances in neural information processing systems , volume=

  9. [20]

    ArXiv e-prints , pages=

    Layer normalization , author=. ArXiv e-prints , pages=

  10. [21]

    The Twelfth International Conference on Learning Representations , year=

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=

  11. [22]

    Advances in Neural Information Processing Systems , volume=

    Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=

  12. [23]

    The Eleventh International Conference on Learning Representations,

    Pengcheng He and Jianfeng Gao and Weizhu Chen , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  13. [24]

    Advances in Neural Information Processing Systems , volume=

    Are aligned neural networks adversarially aligned? , author=. Advances in Neural Information Processing Systems , volume=

  14. [25]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  15. [26]

    2023 , eprint=

    Textbooks Are All You Need II: phi-1.5 technical report , author=. 2023 , eprint=

  16. [27]

    Microsoft Research Blog , volume=

    Phi-2: The surprising power of small language models , author=. Microsoft Research Blog , volume=

  17. [28]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Phi-3 technical report: A highly capable language model locally on your phone , author=. arXiv preprint arXiv:2404.14219 , year=

  18. [30]

    mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval , booktitle =

    Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang , editor =. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval , booktitle =. 2024 , url =

  19. [33]

    International Conference on Learning Representations (ICLR) 22 , year=

    Scale Efficiently: Insights from Pretraining and Finetuning Transformers , author=. International Conference on Learning Representations (ICLR) 22 , year=

  20. [36]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=

  21. [38]

    Proceedings of BigScience Episode\# 5--Workshop on Challenges & Perspectives in Creating Large Language Models , pages=

    GPT-NeoX-20B: An Open-Source Autoregressive Language Model , author=. Proceedings of BigScience Episode\# 5--Workshop on Challenges & Perspectives in Creating Large Language Models , pages=

  22. [39]

    Neural networks , volume=

    Sigmoid-weighted linear units for neural network function approximation in reinforcement learning , author=. Neural networks , volume=. 2018 , publisher=

  23. [40]

    2024 , journal=

    RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. 2024 , journal=

  24. [41]

    Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

    A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=

  25. [43]

    2024 , eprint=

    WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=

  26. [44]

    International Conference on Machine Learning , pages=

    Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  27. [45]

    2023 , publisher=

    The most dramatic optimization to nanoGPT so far ( 25\ author=. 2023 , publisher=

  28. [46]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  29. [48]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  30. [53]

    Journal of machine learning research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

  31. [54]

    2024 , eprint=

    JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources , author=. 2024 , eprint=

  32. [56]

    MosaicBERT:

    Jacob Portes and Alexander Trott and Sam Havens and Daniel King and Abhinav Venigalla and Moin Nadeem and Nikhil Sardana and Daya Khudia and Jonathan Frankle , editor =. MosaicBERT:. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2...

  33. [58]

    Gomez and Lukasz Kaiser and Illia Polosukhin , editor =

    Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =

  34. [62]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    EvoR: Evolving Retrieval for Code Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  35. [63]

    2021 , howpublished =

    Mosaic ML Team, The , title =. 2021 , howpublished =

  36. [65]

    Github , url=

    Kamradt, Gregory , year=. Github , url=

  37. [66]

    Announcing Vespa Long-Context

    Bergum, Jo Kristian , year=. Announcing Vespa Long-Context. Vespa Blog , url=

  38. [68]

    8th International Conference on Learning Representations,

    Kevin Clark and Minh. 8th International Conference on Learning Representations,. 2020 , url =

  39. [69]

    Cramming: Training a Language Model on a single

    Jonas Geiping and Tom Goldstein , editor =. Cramming: Training a Language Model on a single. International Conference on Machine Learning,. 2023 , url =

  40. [70]

    Smith and Mike Lewis , title =

    Ofir Press and Noah A. Smith and Mike Lewis , title =. The Tenth International Conference on Learning Representations,. 2022 , url =

  41. [72]

    The Twelfth International Conference on Learning Representations,

    Yuhui Xu and Lingxi Xie and Xiaotao Gu and Xin Chen and Heng Chang and Hengheng Zhang and Zhengsu Chen and Xiaopeng Zhang and Qi Tian , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  42. [73]

    PaLM: Scaling Language Modeling with Pathways , journal =

    Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam Shazeer and Vinodkumar Prabhakaran and Emi...

  43. [74]

    Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual , year =

    Nandan Thakur and Nils Reimers and Andreas R. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual , year =

  44. [75]

    2024 , eprint=

    The Case for Co-Designing Model Architectures with Hardware , author=. 2024 , eprint=

  45. [76]

    Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

    Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019

  46. [77]

    PyLate: Flexible Training and Retrieval for Late Interaction Models , author=

  47. [78]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Language Modeling with Gated Convolutional Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  48. [80]

    ACM SIGIR Forum , volume=

    TREC-COVID: constructing a pandemic information retrieval test collection , author=. ACM SIGIR Forum , volume=. 2021 , organization=

  49. [81]

    ArXiv , year=

    Representation Degeneration Problem in Training Natural Language Generation Models , author=. ArXiv , year=

  50. [83]

    International Conference on Machine Learning , pages=

    On layer normalization in the transformer architecture , author=. International Conference on Machine Learning , pages=. 2020 , organization=

  51. [84]

    Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XIV 16 , pages=

    Hard negative examples are hard, but useful , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XIV 16 , pages=. 2020 , organization=

  52. [85]

    2024 , eprint=

    The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=

  53. [86]

    2024 , eprint=

    MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases , author=. 2024 , eprint=

  54. [87]

    Benjamin Lefaudeux and Francisco Massa and Diana Liskovich and Wenhan Xiong and Vittorio Caggiano and Sean Naren and Min Xu and Jieru Hu and Marta Tintore and Susan Zhang and Patrick Labatut and Daniel Haziza and Luca Wehrstedt and Jeremy Reizenstein and Grigory Sizov , title =

  55. [88]

    2023 , eprint=

    C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=

  56. [90]

    2024 , eprint=

    Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations , author=. 2024 , eprint=

  57. [91]

    Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

    Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

  58. [93]

    2020 , eprint=

    Improving Low Compute Language Modeling with In-Domain Embedding Initialisation , author=. 2020 , eprint=

  59. [94]

    2023 , eprint=

    Should You Mask 15\ author=. 2023 , eprint=

  60. [98]

    Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =

  61. [99]

    2023 , eprint=

    Stable and low-precision training for large-scale vision-language models , author=. 2023 , eprint=

  62. [100]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  63. [101]

    Proceedings of the 35th International Conference on Machine Learning , pages =

    Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =

  64. [102]

    2024 , eprint=

    Data Engineering for Scaling Language Models to 128K Context , author=. 2024 , eprint=

  65. [103]

    2024 , eprint=

    How to Train Long-Context Language Models (Effectively) , author=. 2024 , eprint=

  66. [104]

    OpenAI Tech Report

    Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskeve , title =. OpenAI Tech Report

  67. [105]

    Language Models are Unsupervised Multitask Learners , year =

    Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , journal =. Language Models are Unsupervised Multitask Learners , year =

  68. [106]

    Narasimhan and Yuan Cao , title =

    Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

  69. [107]

    Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =

    Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =

  70. [109]

    Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , volume=

    PyTorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation , author=. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , volume=

  71. [110]

    2020 , eprint=

    Longformer: The Long-Document Transformer , author=. 2020 , eprint=

  72. [111]

    2022 , eprint=

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher , author=. 2022 , eprint=

  73. [112]

    2022 , eprint=

    Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance , author=. 2022 , eprint=

  74. [113]

    Warner, Benjamin , url=

  75. [114]

    Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. 2024. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programmin...

  76. [115]

    Quentin Anthony, Jacob Hatef, Deepak Narayanan, Stella Biderman, Stas Bekman, Junqi Yin, Aamir Shafi, Hari Subramoni, and Dhabaleswar Panda. 2024. https://arxiv.org/abs/2401.14489 The case for co-designing model architectures with hardware . Preprint, arXiv:2401.14489

  77. [116]

    Ash and Ryan P

    Jordan T. Ash and Ryan P. Adams. 2019. https://arxiv.org/abs/1910.08475 On the difficulty of warm-starting neural network training . CoRR, abs/1910.08475

  78. [117]

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609

  79. [118]

    Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268

  80. [119]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://arxiv.org/abs/2004.05150 Longformer: The long-document transformer . Preprint, arXiv:2004.05150

Showing first 80 references.