Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference
Pith reviewed 2026-05-20 17:36 UTC · model grok-4.3
The pith
ModernBERT updates bidirectional encoders with modern optimizations, 2 trillion tokens of training, and native 8192-token context to deliver better accuracy plus faster and lighter inference than prior models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ModernBERT brings modern model optimizations to encoder-only transformers and, after training on 2 trillion tokens with a native 8192 sequence length, produces state-of-the-art results on diverse classification tasks plus single- and multi-vector retrieval on multiple domains including code, while also being the fastest and most memory-efficient encoder designed for inference on common GPUs.
What carries the argument
The ModernBERT encoder, which integrates contemporary transformer optimizations with long native context support and large-scale pretraining to improve the performance-efficiency frontier of bidirectional models.
If this is right
- Encoder-only pipelines for retrieval and classification can adopt longer contexts without custom truncation or chunking.
- Production systems gain both higher accuracy and lower inference latency on standard hardware.
- Multi-vector retrieval benefits from the same architecture that improves single-vector scores.
- Code-domain retrieval tasks see measurable lifts without switching to decoder-only models.
Where Pith is reading between the lines
- Older BERT-style models may be replaceable in many pipelines by a single updated encoder rather than by larger decoder-only alternatives.
- The efficiency improvements could allow higher throughput or smaller hardware footprints for the same workload.
- Native long-context handling may reduce the need for separate retrieval-augmented or chunking strategies in downstream applications.
Load-bearing premise
The combination of modern optimizations and training on 2 trillion tokens at native 8192 length is what produces the reported gains in accuracy and efficiency over previous encoder-only models.
What would settle it
Benchmark ModernBERT against recent encoder baselines on the same classification and retrieval suites and find no improvement in accuracy, speed, or memory use at comparable model sizes.
read the original abstract
Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ModernBERT, a modernized encoder-only transformer that incorporates recent architectural optimizations (e.g., attention variants, normalization, positional encodings). It is pretrained on 2 trillion tokens with a native maximum sequence length of 8192. The central claims are state-of-the-art results across diverse classification tasks and both single- and multi-vector retrieval (including code domains), together with superior speed and memory efficiency for inference on common GPUs relative to prior encoder-only models.
Significance. If the empirical results and efficiency measurements hold under scrutiny, this would constitute a meaningful Pareto improvement for encoder-only models, which remain central to production retrieval and classification pipelines. The native long-context support combined with claimed efficiency gains could enable broader adoption for longer-document tasks without post-processing or truncation.
major comments (2)
- [§4] §4 (Experiments) and associated tables: the manuscript reports SOTA rankings but supplies no ablations that isolate the contribution of each modern optimization from the effects of scale (2T tokens, native 8192 length). This is load-bearing for the central attribution claim; without such controls it is impossible to determine whether the reported gains arise from the described architecture or from training compute and data volume alone.
- [§4.1] §4.1 and evaluation tables: no error bars, variance estimates, or explicit confirmation that benchmark protocols (e.g., retrieval metrics, task selection) exactly match those used in the cited prior encoder baselines. Small protocol differences can alter SOTA status and therefore undermine the Pareto-improvement assertion.
minor comments (1)
- [Figures 3-5] Figure captions and axis labels in the efficiency plots could more explicitly state the hardware (GPU model, batch size) and measurement methodology to allow direct reproduction.
Simulated Author's Rebuttal
We are grateful to the referee for the insightful review of our manuscript introducing ModernBERT. Below, we provide point-by-point responses to the major comments and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [§4] §4 (Experiments) and associated tables: the manuscript reports SOTA rankings but supplies no ablations that isolate the contribution of each modern optimization from the effects of scale (2T tokens, native 8192 length). This is load-bearing for the central attribution claim; without such controls it is impossible to determine whether the reported gains arise from the described architecture or from training compute and data volume alone.
Authors: We agree that ablations isolating the contributions of individual optimizations would strengthen attribution of the gains. However, the computational cost of training multiple full-scale models on 2 trillion tokens renders comprehensive ablations infeasible. In the revised manuscript we will add a dedicated discussion of this limitation, include supporting evidence from smaller-scale ablation experiments on key components (such as attention variants and normalization), and reference prior literature demonstrating the benefits of these optimizations. We will also clarify that the reported Pareto improvements are measured end-to-end against prior encoder models. revision: partial
-
Referee: [§4.1] §4.1 and evaluation tables: no error bars, variance estimates, or explicit confirmation that benchmark protocols (e.g., retrieval metrics, task selection) exactly match those used in the cited prior encoder baselines. Small protocol differences can alter SOTA status and therefore undermine the Pareto-improvement assertion.
Authors: We thank the referee for highlighting the need for rigorous protocol documentation and statistical reporting. In the revised version we will expand the experimental setup section to explicitly confirm that all benchmark protocols, metrics, and task selections match those used in the cited baselines, with direct references to the original evaluation papers or leaderboards. Where multiple runs were performed we will report error bars; for the primary large-scale results we will note the standard practice of single-run evaluation at this scale while discussing observed metric stability. revision: yes
Circularity Check
No significant circularity; claims rest on empirical pretraining and evaluation
full rationale
The paper introduces ModernBERT via architectural optimizations and large-scale training (2T tokens at native 8192 length), then reports downstream results on classification and retrieval benchmarks. No derivation chain, uniqueness theorem, or fitted parameter is presented that reduces by construction to the inputs; performance claims are externally falsifiable via the stated evaluations rather than self-referential. This is the standard non-circular outcome for an empirical model paper.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 21 Pith papers
-
Who Owns This Agent? Tracing AI Agents Back to Their Owners
A canary injection protocol for linking observed AI agent behavior to the responsible account at the hosting vendor, with robust variants for adversarial filtering.
-
Is She Even Relevant? When BERT Ignores Explicit Gender Cues
A Dutch BERT model encodes gender linearly by epoch 20 but does not dynamically update its representations when explicit female cues contradict learned stereotypical associations in short sentence templates.
-
ProteinJEPA: Latent prediction complements protein language models
Masked-position MLM plus JEPA latent prediction outperforms MLM-only pretraining on 10-11 of 16 downstream tasks for 35M-150M protein models while JEPA alone fails.
-
HyperTransport: Amortized Conditioning of T2I Generative Models
HyperTransport amortizes activation steering for T2I models via a hypernetwork that predicts intervention parameters from CLIP embeddings, delivering 3600-7000x speedup and matching per-concept baselines on 167 unseen...
-
NorBERTo: A ModernBERT Model Trained for Portuguese with 331 Billion Tokens Corpus
NorBERTo, a ModernBERT encoder trained on the largest open Portuguese corpus of 331B tokens, reports top encoder results on several PLUE and ASSIN 2 tasks.
-
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
-
RetroMotion: Retrocausal Motion Forecasting Models are Instructable
Retrocausal transformer decomposes multi-agent motion forecasts into marginals and pairwise joints, models uncertainty with compressed exponentials, achieves strong Waymo results, generalizes to Argoverse 2 and V2X-Se...
-
HyDRA: Hybrid Dynamic Routing Architecture for Heterogeneous LLM Pools
HyDRA routes queries to cost-effective LLMs by predicting multi-dimensional capability requirements with a multi-head encoder and applying shortfall matching against configuration-defined model profiles, delivering up...
-
GLiGuard: Schema-Conditioned Classification for LLM Safeguard
GLiGuard is a compact schema-conditioned bidirectional encoder that matches 7B-27B guard models on safety benchmarks while delivering up to 16x higher throughput and 17x lower latency.
-
Do Synthetic Trajectories Reflect Real Reward Hacking? A Systematic Study on Monitoring In-the-Wild Hacking in Code Generation
Synthetic reward hacking data does not capture natural hacking behaviors in code generation RL, causing monitors trained on it to generalize poorly compared to those trained on in-the-wild trajectories.
-
Rag Performance Prediction for Question Answering
A novel supervised predictor modeling semantic relationships among question, retrieved passages, and generated answer best forecasts when RAG improves QA performance.
-
Explanation Bias is a Product: Revealing the Hidden Lexical and Position Preferences in Post-Hoc Feature Attribution
Explanation biases in feature attribution methods are systematic products of lexical and positional preferences, with observed trade-offs across models and higher bias in anomalous explanations.
-
Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering
PMSR progressively constructs structured reasoning trajectories with dual-scope queries and compositional reasoning to improve knowledge acquisition and answer accuracy in knowledge-intensive VQA.
-
Annotation-Assisted Learning of Treatment Policies From Multimodal Electronic Health Records
AACE is an annotation-assisted method for causal policy learning from multimodal EHRs that outperforms risk-based and representation-based baselines on synthetic, semi-synthetic, and real datasets.
-
Should We Still Pretrain Encoders with Masked Language Modeling?
Controlled ablations of 38 models find MLM superior to CLM on representation benchmarks while CLM offers better data efficiency and stability; a biphasic CLM-then-MLM schedule is optimal under fixed compute and improv...
-
Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning
Fine-tuned transformers with multi-task learning recover substantial wording-derived signal for item difficulty at small sample sizes typical in applied testing.
-
Efficient Listwise Reranking with Compressed Document Representations
RRK compresses documents to multi-token embeddings for efficient listwise reranking, enabling an 8B model to achieve 3x-18x speedups over smaller models with comparable or better effectiveness.
-
Commonsense Knowledge with Negation: A Resource to Enhance Negation Understanding
Augmenting commonsense knowledge corpora with negation produces over 2M new triples that benefit LLM negation understanding when used for pre-training.
-
m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder
m3BERT uses a three-stage Matryoshka pretraining approach on a bidirectional encoder to support variable embedding sizes while outperforming prior models on large-scale retrieval tasks.
-
Filter-then-Verify: A Multiphase GNN and ModernBERT Framework for Social Engineering Detection in Email Networks
A two-stage GNN-plus-ModernBERT framework detects social engineering attacks in email networks by first filtering structural anomalies at 86% recall and then verifying content to reach over 92% precision on augmented ...
-
Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters
Zero-shot GPT-OSS detects depression from 1,108 primary care encounter transcripts with AUPRC 0.51 and AUROC 0.77, with meaningful signals in the first 128 patient tokens and added value from dyadic mirroring.
Reference graph
Works this paper leans on
-
[2]
International Conference on Learning Representations , year=
Do Wide and Deep Networks Learn the Same Things? Uncovering How Neural Network Representations Vary with Width and Depth , author=. International Conference on Learning Representations , year=
-
[6]
On Layer Normalization in the Transformer Architecture , booktitle =
Ruibin Xiong and Yunchang Yang and Di He and Kai Zheng and Shuxin Zheng and Chen Xing and Huishuai Zhang and Yanyan Lan and Liwei Wang and Tie. On Layer Normalization in the Transformer Architecture , booktitle =. 2020 , url =
work page 2020
-
[7]
Passing the Torch: Training a Mamba Model for Smooth Handover , author=
-
[8]
Improving Text Embeddings with Large Language Models
Liang Wang and Nan Yang and Xiaolong Huang and Linjun Yang and Rangan Majumder and Furu Wei , editor =. Improving Text Embeddings with Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.ACL-LONG.642 , timestamp =
- [10]
-
[11]
LLM2Vec: large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961,
Parishad BehnamGhader and Vaibhav Adlakha and Marius Mosbach and Dzmitry Bahdanau and Nicolas Chapados and Siva Reddy , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2404.05961 , eprinttype =. 2404.05961 , timestamp =
-
[12]
The economic trade-offs of large language models: A case study , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 5: Industry Track) , pages=
-
[19]
Advances in neural information processing systems , volume=
Understanding and improving layer normalization , author=. Advances in neural information processing systems , volume=
- [20]
-
[21]
The Twelfth International Conference on Learning Representations , year=
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning , author=. The Twelfth International Conference on Learning Representations , year=
-
[22]
Advances in Neural Information Processing Systems , volume=
Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=
-
[23]
The Eleventh International Conference on Learning Representations,
Pengcheng He and Jianfeng Gao and Weizhu Chen , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
work page 2023
-
[24]
Advances in Neural Information Processing Systems , volume=
Are aligned neural networks adversarially aligned? , author=. Advances in Neural Information Processing Systems , volume=
-
[25]
Universal and Transferable Adversarial Attacks on Aligned Language Models
Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
Textbooks Are All You Need II: phi-1.5 technical report , author=. 2023 , eprint=
work page 2023
-
[27]
Microsoft Research Blog , volume=
Phi-2: The surprising power of small language models , author=. Microsoft Research Blog , volume=
-
[28]
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone
Phi-3 technical report: A highly capable language model locally on your phone , author=. arXiv preprint arXiv:2404.14219 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
Xin Zhang and Yanzhao Zhang and Dingkun Long and Wen Xie and Ziqi Dai and Jialong Tang and Huan Lin and Baosong Yang and Pengjun Xie and Fei Huang and Meishan Zhang and Wenjie Li and Min Zhang , editor =. mGTE: Generalized Long-Context Text Representation and Reranking Models for Multilingual Text Retrieval , booktitle =. 2024 , url =
work page 2024
-
[33]
International Conference on Learning Representations (ICLR) 22 , year=
Scale Efficiently: Insights from Pretraining and Finetuning Transformers , author=. International Conference on Learning Representations (ICLR) 22 , year=
-
[36]
Gemma: Open Models Based on Gemini Research and Technology
Gemma: Open models based on gemini research and technology , author=. arXiv preprint arXiv:2403.08295 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
GPT-NeoX-20B: An Open-Source Autoregressive Language Model , author=. Proceedings of BigScience Episode\# 5--Workshop on Challenges & Perspectives in Creating Large Language Models , pages=
-
[39]
Sigmoid-weighted linear units for neural network function approximation in reinforcement learning , author=. Neural networks , volume=. 2018 , publisher=
work page 2018
-
[40]
RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. 2024 , journal=
work page 2024
-
[41]
A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference , author=. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) , pages=
work page 2018
-
[43]
WildTeaming at Scale: From In-the-Wild Jailbreaks to (Adversarially) Safer Language Models , author=. 2024 , eprint=
work page 2024
-
[44]
International Conference on Machine Learning , pages=
Pythia: A suite for analyzing large language models across training and scaling , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[45]
The most dramatic optimization to nanoGPT so far ( 25\ author=. 2023 , publisher=
work page 2023
-
[46]
GLiNER: Generalist Model for Named Entity Recognition using Bidirectional Transformer , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=
work page 2024
-
[48]
Advances in Neural Information Processing Systems (NeurIPS) , volume=
Retrieval-augmented generation for knowledge-intensive nlp tasks , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=
-
[53]
Journal of machine learning research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
-
[54]
JaColBERTv2.5: Optimising Multi-Vector Retrievers to Create State-of-the-Art Japanese Retrievers with Constrained Resources , author=. 2024 , eprint=
work page 2024
-
[56]
Jacob Portes and Alexander Trott and Sam Havens and Daniel King and Abhinav Venigalla and Moin Nadeem and Nikhil Sardana and Daya Khudia and Jonathan Frankle , editor =. MosaicBERT:. Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2...
work page 2023
-
[58]
Gomez and Lukasz Kaiser and Illia Polosukhin , editor =
Ashish Vaswani and Noam Shazeer and Niki Parmar and Jakob Uszkoreit and Llion Jones and Aidan N. Gomez and Lukasz Kaiser and Illia Polosukhin , editor =. Attention is All you Need , booktitle =. 2017 , url =
work page 2017
-
[62]
Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
EvoR: Evolving Retrieval for Code Generation , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=
work page 2024
- [63]
- [65]
-
[66]
Bergum, Jo Kristian , year=. Announcing Vespa Long-Context. Vespa Blog , url=
-
[68]
8th International Conference on Learning Representations,
Kevin Clark and Minh. 8th International Conference on Learning Representations,. 2020 , url =
work page 2020
-
[69]
Cramming: Training a Language Model on a single
Jonas Geiping and Tom Goldstein , editor =. Cramming: Training a Language Model on a single. International Conference on Machine Learning,. 2023 , url =
work page 2023
-
[70]
Smith and Mike Lewis , title =
Ofir Press and Noah A. Smith and Mike Lewis , title =. The Tenth International Conference on Learning Representations,. 2022 , url =
work page 2022
-
[72]
The Twelfth International Conference on Learning Representations,
Yuhui Xu and Lingxi Xie and Xiaotao Gu and Xin Chen and Heng Chang and Hengheng Zhang and Zhengsu Chen and Xiaopeng Zhang and Qi Tian , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[73]
PaLM: Scaling Language Modeling with Pathways , journal =
Aakanksha Chowdhery and Sharan Narang and Jacob Devlin and Maarten Bosma and Gaurav Mishra and Adam Roberts and Paul Barham and Hyung Won Chung and Charles Sutton and Sebastian Gehrmann and Parker Schuh and Kensen Shi and Sasha Tsvyashchenko and Joshua Maynez and Abhishek Rao and Parker Barnes and Yi Tay and Noam Shazeer and Vinodkumar Prabhakaran and Emi...
work page 2023
-
[74]
Nandan Thakur and Nils Reimers and Andreas R. Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1, NeurIPS Datasets and Benchmarks 2021, December 2021, virtual , year =
work page 2021
-
[75]
The Case for Co-Designing Model Architectures with Hardware , author=. 2024 , eprint=
work page 2024
-
[76]
Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
Reimers, Nils and Gurevych, Iryna. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 2019
work page 2019
-
[77]
PyLate: Flexible Training and Retrieval for Late Interaction Models , author=
-
[78]
Proceedings of the 34th International Conference on Machine Learning , pages =
Language Modeling with Gated Convolutional Networks , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =
work page 2017
-
[80]
TREC-COVID: constructing a pandemic information retrieval test collection , author=. ACM SIGIR Forum , volume=. 2021 , organization=
work page 2021
-
[81]
Representation Degeneration Problem in Training Natural Language Generation Models , author=. ArXiv , year=
-
[83]
International Conference on Machine Learning , pages=
On layer normalization in the transformer architecture , author=. International Conference on Machine Learning , pages=. 2020 , organization=
work page 2020
-
[84]
Hard negative examples are hard, but useful , author=. Computer Vision--ECCV 2020: 16th European Conference, Glasgow, UK, August 23--28, 2020, Proceedings, Part XIV 16 , pages=. 2020 , organization=
work page 2020
-
[85]
The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale , author=. 2024 , eprint=
work page 2024
-
[86]
MobileLLM: Optimizing Sub-billion Parameter Language Models for On-Device Use Cases , author=. 2024 , eprint=
work page 2024
-
[87]
Benjamin Lefaudeux and Francisco Massa and Diana Liskovich and Wenhan Xiong and Vittorio Caggiano and Sean Naren and Min Xu and Jieru Hu and Marta Tintore and Susan Zhang and Patrick Labatut and Daniel Haziza and Luca Wehrstedt and Jeremy Reizenstein and Grigory Sizov , title =
-
[88]
C-Pack: Packaged Resources To Advance General Chinese Embedding , author=. 2023 , eprint=
work page 2023
-
[90]
Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations , author=. 2024 , eprint=
work page 2024
-
[91]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[93]
Improving Low Compute Language Modeling with In-Domain Embedding Initialisation , author=. 2020 , eprint=
work page 2020
- [94]
-
[98]
Tom B. Brown and Benjamin Mann and Nick Ryder and Melanie Subbiah and Jared Kaplan and Prafulla Dhariwal and Arvind Neelakantan and Pranav Shyam and Girish Sastry and Amanda Askell and Sandhini Agarwal and Ariel Herbert. Language Models are Few-Shot Learners , booktitle =. 2020 , url =
work page 2020
-
[99]
Stable and low-precision training for large-scale vision-language models , author=. 2023 , eprint=
work page 2023
-
[100]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[101]
Proceedings of the 35th International Conference on Machine Learning , pages =
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , author =. Proceedings of the 35th International Conference on Machine Learning , pages =. 2018 , editor =
work page 2018
-
[102]
Data Engineering for Scaling Language Models to 128K Context , author=. 2024 , eprint=
work page 2024
-
[103]
How to Train Long-Context Language Models (Effectively) , author=. 2024 , eprint=
work page 2024
-
[104]
Alec Radford and Karthik Narasimhan and Tim Salimans and Ilya Sutskeve , title =. OpenAI Tech Report
-
[105]
Language Models are Unsupervised Multitask Learners , year =
Radford, Alec and Wu, Jeff and Child, Rewon and Luan, David and Amodei, Dario and Sutskever, Ilya , journal =. Language Models are Unsupervised Multitask Learners , year =
-
[106]
Narasimhan and Yuan Cao , title =
Shunyu Yao and Jeffrey Zhao and Dian Yu and Nan Du and Izhak Shafran and Karthik R. Narasimhan and Yuan Cao , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =
work page 2023
-
[107]
Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =
Timo Schick and Jane Dwivedi. Toolformer: Language Models Can Teach Themselves to Use Tools , booktitle =. 2023 , url =
work page 2023
-
[109]
PyTorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation , author=. Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems , volume=
- [110]
-
[111]
Scaling Language Models: Methods, Analysis & Insights from Training Gopher , author=. 2022 , eprint=
work page 2022
-
[112]
Efficient Sequence Packing without Cross-contamination: Accelerating Large Language Models without Impacting Performance , author=. 2022 , eprint=
work page 2022
-
[113]
Warner, Benjamin , url=
-
[114]
Jason Ansel, Edward Yang, Horace He, Natalia Gimelshein, Animesh Jain, Michael Voznesensky, Bin Bao, Peter Bell, David Berard, Evgeni Burovski, et al. 2024. Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programmin...
work page 2024
- [115]
-
[116]
Jordan T. Ash and Ryan P. Adams. 2019. https://arxiv.org/abs/1910.08475 On the difficulty of warm-starting neural network training . CoRR, abs/1910.08475
-
[117]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. 2023. Qwen technical report. arXiv preprint arXiv:2309.16609
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[118]
Payal Bajaj, Daniel Campos, Nick Craswell, Li Deng, Jianfeng Gao, Xiaodong Liu, Rangan Majumder, Andrew McNamara, Bhaskar Mitra, Tri Nguyen, et al. 2016. Ms marco: A human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[119]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. 2020. https://arxiv.org/abs/2004.05150 Longformer: The long-document transformer . Preprint, arXiv:2004.05150
work page internal anchor Pith review Pith/arXiv arXiv 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.