Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models

Faezeh Ghaderi; Mahdi Naser Moghadasi

arxiv: 2605.15413 · v1 · pith:UYBVMRVFnew · submitted 2026-05-14 · 💻 cs.LG

Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models

Mahdi Naser Moghadasi , Faezeh Ghaderi This is my paper

Pith reviewed 2026-05-19 16:06 UTC · model grok-4.3

classification 💻 cs.LG

keywords transformer modelsscalability crisisattention complexityempirical analysisperformance wallssequence lengthsmodel efficiency

0 comments

The pith

Benchmark of 118 transformers shows performance walls where success drops to zero at 2048 tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests 118 transformer models from seven architectural categories on sequences from 128 to 2048 tokens to find where they hit limits. It reports that 88.1 percent handle 512 tokens but only 44.9 percent manage 1024, and none succeed at 2048. Compressed models run more efficiently per parameter than larger ones. These results question the assumption that transformers can scale indefinitely with more compute. The findings help guide which models are practical for real applications.

Core claim

The paper establishes that the quadratic attention complexity leads to measurable performance walls, with complete failure at 2048 tokens across all models and superior efficiency in compressed variants at 649.2 tokens per second per million parameters versus 12.5 for large models.

What carries the argument

The large-scale empirical benchmarking of memory consumption, loading times, and computational efficiency across varying sequence lengths in diverse model categories.

If this is right

Only 44.9% of models process 1024 tokens successfully, falling to 0% at 2048 tokens.
Compressed models provide higher parameter efficiency than large generative models.
Scaling assumptions for transformers require reevaluation based on these empirical limits.
Practical deployment must account for sequence length constraints from the start.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future model designs may prioritize linear or sub-quadratic attention to extend usable lengths.
Testing protocols for new models should include long-sequence benchmarks as standard.
Hardware-specific optimizations could mitigate some walls observed in the study.

Load-bearing premise

The chosen 118 models and sequence lengths from 128 to 2048 tokens represent general transformer scalability behavior rather than results specific to the selected architectures or test hardware.

What would settle it

Observing even one model successfully processing a 2048-token sequence without failure under similar test conditions would contradict the reported 0% success rate.

Figures

Figures reproduced from arXiv: 2605.15413 by Faezeh Ghaderi, Mahdi Naser Moghadasi.

**Figure 2.** Figure 2: Throughput Scaling Analysis: Logarithmic scaling reveals architec [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Parameter Efficiency Hierarchy: Compressed models achieve 52 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Despite the remarkable success of transformer architectures in natural language processing, their scalability limitations remain poorly understood through systematic empirical analysis. This paper presents the first comprehensive large-scale evaluation of 118 transformer models across seven distinct architectural categories, revealing fundamental performance walls that manifest as hard deployment constraints. Our systematic benchmarking methodology uncovers a critical scalability crisis: while 88.1% of models successfully process sequences up to 512 tokens, this drops dramatically to 44.9% at 1024 tokens, with complete failure (0%) at 2048 tokens. Through rigorous analysis of loading times, memory consumption, and computational efficiency across sequence lengths from 128 to 2048 tokens, we demonstrate that compressed models achieve superior parameter efficiency (649.2 tokens/sec/M parameters) compared to large generative models (12.5 tokens/sec/M). Our findings challenge prevailing scaling assumptions and provide the first quantitative evidence that the theoretical O(n2) attention complexity translates into measurable performance walls. This work establishes new benchmarking methodologies for transformer evaluation and provides critical insights for practical deployment decisions in production environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives concrete success-rate numbers on 118 models but the 0% at 2048 tokens looks like it could be an OOM artifact rather than proof of a new fundamental wall.

read the letter

The main point here is that success rates fall from 88.1% at 512 tokens to 44.9% at 1024 and then to 0% at 2048, yet the write-up supplies almost no information on model sizes, hardware, or whether standard memory optimizations were applied. That gap makes it hard to read the drop as evidence of an inherent scalability crisis instead of the expected result when quadratic memory use hits the limits of whatever GPUs were used. The work does collect data across 118 models in seven categories and reports efficiency numbers that separate compressed models from large generative ones. Those measurements are new in their specific counts and could be useful for people who need quick benchmarks on practical throughput. The efficiency comparison (649 tokens per second per million parameters versus 12.5) is the clearest takeaway and stands on its own as a data point. The soft spot is the missing experimental detail. No parameter counts, no VRAM figures, no batch sizes, and no statement on whether FlashAttention or similar kernels were in play. Without those, the claim that the numbers challenge scaling assumptions rests on an untested assumption that the tested setup is representative rather than constrained. The paper is aimed at practitioners who pick models for longer contexts and at researchers who want raw scaling data. A reader looking for deployment guidance might pull the efficiency figures; someone expecting a derivation or controlled isolation of the quadratic term will not find it. I would send this to peer review so referees can ask for the missing setup information and check whether the percentages survive that scrutiny, but it would need those additions before it could stand as a strong contribution.

Referee Report

2 major / 1 minor

Summary. The paper claims to conduct the first large-scale empirical evaluation of 118 transformer models across seven architectural categories, benchmarking performance on sequence lengths from 128 to 2048 tokens. It reports success rates dropping from 88.1% at 512 tokens to 44.9% at 1024 tokens and 0% at 2048 tokens, attributing this to inherent O(n²) attention complexity creating measurable performance walls, while also comparing parameter efficiency between compressed and large generative models.

Significance. If the experimental setup were fully documented and the results shown to be independent of specific hardware or model-size artifacts, the work could provide useful empirical data on practical deployment limits for transformers. However, the current presentation does not establish that the observed failures reflect fundamental architectural constraints rather than memory or implementation constraints, limiting the potential impact on scaling-law discussions.

major comments (2)

[Abstract] Abstract and Experimental Setup: The central claim that the drop to 0% success at 2048 tokens demonstrates a 'scalability crisis' and 'fundamental performance walls' is not supported by any reported details on model parameter counts, hardware (GPU/TPU memory, batch size), or use of optimized kernels such as FlashAttention. Without these, the percentages cannot be distinguished from OOM failures on large models under limited VRAM.
[Abstract] Abstract: No model-selection criteria, exclusion rules, or statistical tests are described for the 118 models, so it is impossible to evaluate whether the seven architectural categories and chosen sequence lengths are representative or whether the 0% result at 2048 tokens is generalizable beyond the tested hardware.

minor comments (1)

[Abstract] The abstract states precise efficiency numbers (649.2 tokens/sec/M parameters) without indicating how these were normalized or averaged across models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and rigor of our experimental documentation. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract and Experimental Setup: The central claim that the drop to 0% success at 2048 tokens demonstrates a 'scalability crisis' and 'fundamental performance walls' is not supported by any reported details on model parameter counts, hardware (GPU/TPU memory, batch size), or use of optimized kernels such as FlashAttention. Without these, the percentages cannot be distinguished from OOM failures on large models under limited VRAM.

Authors: We acknowledge the validity of this concern. The original abstract was concise and omitted key experimental details. In the revised version, we have added a dedicated paragraph in the Experimental Setup section specifying the hardware used (NVIDIA A100 GPUs with 80GB VRAM, batch size of 1 for sequence length tests), average parameter counts per category, and that we used the standard PyTorch attention implementation without FlashAttention or other optimizations to reflect typical deployment scenarios. We have also clarified that while some failures may involve memory limits, the pattern of increasing failure rates with sequence length across diverse model sizes supports our interpretation of scalability challenges. We have moderated the language from 'fundamental performance walls' to 'practical performance limits observed in our benchmarks'. revision: yes
Referee: [Abstract] Abstract: No model-selection criteria, exclusion rules, or statistical tests are described for the 118 models, so it is impossible to evaluate whether the seven architectural categories and chosen sequence lengths are representative or whether the 0% result at 2048 tokens is generalizable beyond the tested hardware.

Authors: We agree that these details were insufficiently documented. We have revised the manuscript to include a 'Model Selection and Dataset' subsection describing the criteria: models were selected from the Hugging Face Transformers library based on popularity (top 20 per category by downloads), support for variable-length inputs, and exclusion of models with documented compatibility issues or those requiring custom hardware. The seven categories were chosen to cover encoder-only, decoder-only, encoder-decoder, and variants like sparse attention models. We have added bootstrap confidence intervals for the success rates and a discussion of limitations regarding generalizability to other hardware setups. While we cannot test every possible hardware configuration, the consistency across 118 models on standard GPU hardware provides a strong empirical basis. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations or self-referential predictions

full rationale

This paper is a large-scale empirical measurement study that benchmarks 118 transformer models across sequence lengths from 128 to 2048 tokens, reporting observed success rates, memory usage, and efficiency metrics. The central claims consist of direct experimental observations (e.g., success dropping from 88.1% at 512 tokens to 0% at 2048 tokens) rather than any mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's own inputs. The analysis is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the chosen models and sequence lengths plus the assumption that the benchmarking captures real deployment constraints without unstated selection biases.

axioms (1)

domain assumption The 118 models and seven categories sufficiently represent the space of modern transformer architectures for generalizing performance walls.
Invoked to support claims about fundamental limitations across all transformers.

pith-pipeline@v0.9.0 · 5725 in / 1188 out tokens · 49745 ms · 2026-05-19T16:06:40.953973+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

while 88.1% of models successfully process sequences up to 512 tokens, this drops dramatically to 44.9% at 1024 tokens, with complete failure (0%) at 2048 tokens... theoretical O(n²) attention complexity translates into measurable performance walls
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Systematic benchmarking methodology uncovers a critical scalability crisis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 24 internal anchors

[1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ”Attention is all you need,” inAdvances in neural information processing systems, 2017, pp. 5998–6008

work page 2017
[2]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ”BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186

work page 2019
[3]

Brown et al., ”Language models are few-shot learners,” inAdvances in neural information processing systems, vol

T. Brown et al., ”Language models are few-shot learners,” inAdvances in neural information processing systems, vol. 33, 2020, pp. 1877–1901

work page 2020
[4]

Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

A. Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

work page 2021
[5]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, ”Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[6]

Evaluating Large Language Models Trained on Code

M. Chen et al., ”Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[7]

Adiwardana, M.-T

D. Adiwardana et al., ”Towards a human-like open-domain chatbot,” arXiv preprint arXiv:2001.09977, 2020

work page arXiv 2001
[8]

Y . Tay, M. Dehghani, D. Rao, W. Fedus, A. Abnar, H. W. Chung, S. Narang, D. Yogatama, A. Katharopoulos, N. Kamigaito et al., ”Efficient transformers: A survey,”ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, 2022

work page 2022
[9]

X. Qiu, T. Sun, Y . Xu, Y . Shao, N. Dai, and X. Huang, ”Pre-trained models for natural language processing: A survey,”Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020

work page 2020
[10]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever, ”Generating long sequences with sparse transformers,”arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[11]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, ”Transformers are rnns: Fast autoregressive transformers with linear attention,” in International Conference on Machine Learning, 2020, pp. 5156–5165

work page 2020
[12]

Zaheer, G

M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., ”Big bird: Transformers for longer sequences,”Advances in neural information processing systems, vol. 33, pp. 17283–17297, 2020

work page 2020
[13]

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, ”Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006
[14]

Rethinking Attention with Performers

K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., ”Rethinking attention with performers,”arXiv preprint arXiv:2009.14794, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009
[15]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, ”Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Peng et al., ”RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp

B. Peng et al., ”RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 14048–14077

work page 2023
[17]

Retentive Network: A Successor to Transformer for Large Language Models

Y . Sun et al., ”Retentive network: A successor to transformer for large language models,”arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, ”Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[19]

Training Compute-Optimal Large Language Models

J. Hoffmann et al., ”Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[20]

Alexey Romanov and Chaitanya Shivade

Y . Tay et al., ”Scaling up models and data witht5xandseqio,”arXiv preprint arXiv:2203.17189, 2022

work page arXiv 2022
[21]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer, ”Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

work page 2022
[22]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ”GLUE: A multi-task benchmark and analysis platform for natural language understanding,”arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ”SuperGLUE: A stickier benchmark for general-purpose language understanding systems,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[24]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava et al., ”Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[25]

Holistic Evaluation of Language Models

P. Liang et al., ”Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Strubell, A

E. Strubell, A. Ganesh, and A. McCallum, ”Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

work page 2019
[27]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., ”Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019
[28]

OPT: Open Pre-trained Transformer Language Models

S. Zhang et al., ”OPT: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

T. L. Scao et al., ”BLOOM: A 176b-parameter open-access multilingual language model,”arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[30]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

N. Dey et al., ”Cerebras-GPT: Open compute-optimal language models trained on the cerebras wafer-scale cluster,”arXiv preprint arXiv:2304.03208, 2023

work page arXiv 2023
[31]

Biderman et al., ”Pythia: A suite for analyzing large language models across training and scaling,” inInternational Conference on Machine Learning, 2023, pp

S. Biderman et al., ”Pythia: A suite for analyzing large language models across training and scaling,” inInternational Conference on Machine Learning, 2023, pp. 2397–2430

work page 2023
[32]

A. Q. Jiang et al., ”Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

The Falcon Series of Open Language Models

E. Almazrouei et al., ”The Falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y . Liu et al., ”RoBERTa: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[35]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, ”ALBERT: A lite BERT for self-supervised learning of language rep- resentations,” inInternational Conference on Learning Representations, 2020

work page 2020
[36]

Clark, M.-T

K. Clark, M.-T. Luong, Q. V . Le, and C. D. Manning, ”ELECTRA: Pre-training text encoders as discriminators rather than generators,” in International Conference on Learning Representations, 2020

work page 2020
[37]

P. He, X. Liu, J. Gao, and W. Chen, ”DeBERTa: Decoding-enhanced BERT with disentangled attention,” inInternational Conference on Learning Representations, 2021

work page 2021
[38]

Beltagy, K

I. Beltagy, K. Lo, and A. Cohan, ”SciBERT: A pretrained language model for scientific text,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019, pp. 3615– 3620

work page 2019
[39]

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

D. Araci, ”FinBERT: Financial sentiment analysis with pre-trained language models,”arXiv preprint arXiv:1908.10063, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908
[40]

Lee et al., ”BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol

J. Lee et al., ”BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020

work page 2020
[41]

V . Sanh, L. Debut, J. Chaumond, and T. Wolf, ”DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910
[42]

Textbooks Are All You Need

S. Gunasekar et al., ”Textbooks are all you need,”arXiv preprint arXiv:2306.11644, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

TinyLlama: An Open-Source Small Language Model

P. Zhang et al., ”TinyLlama: An open-source small language model,” arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

Feng et al., ”CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp

Z. Feng et al., ”CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547

work page 2020
[45]

Jiao et al., ”TinyBERT: Distilling BERT for natural language under- standing,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp

X. Jiao et al., ”TinyBERT: Distilling BERT for natural language under- standing,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4163–4174

work page 2020
[46]

Lewis et al., ”Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol

P. Lewis et al., ”Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

work page 2020
[47]

Borgeaud et al., ”Improving language models by retrieving from trillions of tokens,” inInternational Conference on Machine Learning, 2022, pp

S. Borgeaud et al., ”Improving language models by retrieving from trillions of tokens,” inInternational Conference on Machine Learning, 2022, pp. 2206–2240

work page 2022
[48]

A. Gu, K. Goel, and C. R ´e, ”Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022

work page 2022
[49]

Ainslie et al., ”ETC: Encoding long and structured inputs in trans- formers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp

J. Ainslie et al., ”ETC: Encoding long and structured inputs in trans- formers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 268–284

work page 2020
[50]

J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, ”Com- pressive transformers for long-range sequence modelling,” inInterna- tional Conference on Learning Representations, 2020

work page 2020
[51]

NVIDIA, ”NVIDIA A100 tensor core GPU architecture,” NVIDIA whitepaper, 2020

work page 2020
[52]

N. P. Jouppi et al., ”In-datacenter performance analysis of a tensor pro- cessing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

work page 2017
[53]

Rogers, O

A. Rogers, O. Kovaleva, and A. Rumshisky, ”A primer in BERTology: What we know about how BERT works,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020

work page 2020
[54]

M. Ott et al., ”fairseq: A fast, extensible toolkit for sequence modeling,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 48–53

work page 2019
[55]

Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol

A. Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[56]

Abadi et al., ”TensorFlow: A system for large-scale machine learn- ing,” in12th USENIX symposium on operating systems design and implementation, 2016, pp

M. Abadi et al., ”TensorFlow: A system for large-scale machine learn- ing,” in12th USENIX symposium on operating systems design and implementation, 2016, pp. 265–283

work page 2016
[57]

Bradbury et al., ”JAX: composable transformations of Python+NumPy programs,” 2018

J. Bradbury et al., ”JAX: composable transformations of Python+NumPy programs,” 2018

work page 2018
[58]

Carbon Emissions and Large Neural Network Training

D. Patterson et al., ”Carbon emissions and large neural network training,” arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[59]

Gong, Y .-A

Y . Gong, Y .-A. Chung, and J. Glass, ”AST: Audio spectrogram trans- former,” inProceedings of the Interspeech 2021, 2021, pp. 571–575

work page 2021
[60]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi et al., ”Megatron-LM: Training multi-billion param- eter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[61]

Huang et al., ”GPipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems, vol

Y . Huang et al., ”GPipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems, vol. 32, 2019

work page 2019
[62]

T. Wolf et al., ”Transformers: State-of-the-art natural language process- ing,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45

work page 2020
[63]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery et al., ”PaLM: Scaling language modeling with path- ways,”arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[64]

J. D. M.-W. C. Kenton and L. K. Toutanova, ”BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of NAACL-HLT, 2019, pp. 4171–4186

work page 2019

[1] [1]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ”Attention is all you need,” inAdvances in neural information processing systems, 2017, pp. 5998–6008

work page 2017

[2] [2]

Devlin, M.-W

J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ”BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186

work page 2019

[3] [3]

Brown et al., ”Language models are few-shot learners,” inAdvances in neural information processing systems, vol

T. Brown et al., ”Language models are few-shot learners,” inAdvances in neural information processing systems, vol. 33, 2020, pp. 1877–1901

work page 2020

[4] [4]

Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

A. Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

work page 2021

[5] [5]

Longformer: The Long-Document Transformer

I. Beltagy, M. E. Peters, and A. Cohan, ”Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004

[6] [6]

Evaluating Large Language Models Trained on Code

M. Chen et al., ”Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[7] [7]

Adiwardana, M.-T

D. Adiwardana et al., ”Towards a human-like open-domain chatbot,” arXiv preprint arXiv:2001.09977, 2020

work page arXiv 2001

[8] [8]

Y . Tay, M. Dehghani, D. Rao, W. Fedus, A. Abnar, H. W. Chung, S. Narang, D. Yogatama, A. Katharopoulos, N. Kamigaito et al., ”Efficient transformers: A survey,”ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, 2022

work page 2022

[9] [9]

X. Qiu, T. Sun, Y . Xu, Y . Shao, N. Dai, and X. Huang, ”Pre-trained models for natural language processing: A survey,”Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020

work page 2020

[10] [10]

Generating Long Sequences with Sparse Transformers

R. Child, S. Gray, A. Radford, and I. Sutskever, ”Generating long sequences with sparse transformers,”arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[11] [11]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, ”Transformers are rnns: Fast autoregressive transformers with linear attention,” in International Conference on Machine Learning, 2020, pp. 5156–5165

work page 2020

[12] [12]

Zaheer, G

M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., ”Big bird: Transformers for longer sequences,”Advances in neural information processing systems, vol. 33, pp. 17283–17297, 2020

work page 2020

[13] [13]

S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, ”Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2006

[14] [14]

Rethinking Attention with Performers

K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., ”Rethinking attention with performers,”arXiv preprint arXiv:2009.14794, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2009

[15] [15]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

A. Gu and T. Dao, ”Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[16] [16]

Peng et al., ”RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp

B. Peng et al., ”RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 14048–14077

work page 2023

[17] [17]

Retentive Network: A Successor to Transformer for Large Language Models

Y . Sun et al., ”Retentive network: A successor to transformer for large language models,”arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[18] [18]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, ”Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001

[19] [19]

Training Compute-Optimal Large Language Models

J. Hoffmann et al., ”Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[20] [20]

Alexey Romanov and Chaitanya Shivade

Y . Tay et al., ”Scaling up models and data witht5xandseqio,”arXiv preprint arXiv:2203.17189, 2022

work page arXiv 2022

[21] [21]

Fedus, B

W. Fedus, B. Zoph, and N. Shazeer, ”Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

work page 2022

[22] [22]

A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ”GLUE: A multi-task benchmark and analysis platform for natural language understanding,”arXiv preprint arXiv:1804.07461, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[23] [23]

A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ”SuperGLUE: A stickier benchmark for general-purpose language understanding systems,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[24] [24]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

A. Srivastava et al., ”Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”arXiv preprint arXiv:2206.04615, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[25] [25]

Holistic Evaluation of Language Models

P. Liang et al., ”Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Strubell, A

E. Strubell, A. Ganesh, and A. McCallum, ”Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

work page 2019

[27] [27]

Radford, J

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., ”Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

work page 2019

[28] [28]

OPT: Open Pre-trained Transformer Language Models

S. Zhang et al., ”OPT: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

T. L. Scao et al., ”BLOOM: A 176b-parameter open-access multilingual language model,”arXiv preprint arXiv:2211.05100, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[30] [30]

S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

N. Dey et al., ”Cerebras-GPT: Open compute-optimal language models trained on the cerebras wafer-scale cluster,”arXiv preprint arXiv:2304.03208, 2023

work page arXiv 2023

[31] [31]

Biderman et al., ”Pythia: A suite for analyzing large language models across training and scaling,” inInternational Conference on Machine Learning, 2023, pp

S. Biderman et al., ”Pythia: A suite for analyzing large language models across training and scaling,” inInternational Conference on Machine Learning, 2023, pp. 2397–2430

work page 2023

[32] [32]

A. Q. Jiang et al., ”Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

The Falcon Series of Open Language Models

E. Almazrouei et al., ”The Falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[34] [34]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Y . Liu et al., ”RoBERTa: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[35] [35]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, ”ALBERT: A lite BERT for self-supervised learning of language rep- resentations,” inInternational Conference on Learning Representations, 2020

work page 2020

[36] [36]

Clark, M.-T

K. Clark, M.-T. Luong, Q. V . Le, and C. D. Manning, ”ELECTRA: Pre-training text encoders as discriminators rather than generators,” in International Conference on Learning Representations, 2020

work page 2020

[37] [37]

P. He, X. Liu, J. Gao, and W. Chen, ”DeBERTa: Decoding-enhanced BERT with disentangled attention,” inInternational Conference on Learning Representations, 2021

work page 2021

[38] [38]

Beltagy, K

I. Beltagy, K. Lo, and A. Cohan, ”SciBERT: A pretrained language model for scientific text,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019, pp. 3615– 3620

work page 2019

[39] [39]

FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

D. Araci, ”FinBERT: Financial sentiment analysis with pre-trained language models,”arXiv preprint arXiv:1908.10063, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1908

[40] [40]

Lee et al., ”BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol

J. Lee et al., ”BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020

work page 2020

[41] [41]

V . Sanh, L. Debut, J. Chaumond, and T. Wolf, ”DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1910

[42] [42]

Textbooks Are All You Need

S. Gunasekar et al., ”Textbooks are all you need,”arXiv preprint arXiv:2306.11644, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

TinyLlama: An Open-Source Small Language Model

P. Zhang et al., ”TinyLlama: An open-source small language model,” arXiv preprint arXiv:2401.02385, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[44] [44]

Feng et al., ”CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp

Z. Feng et al., ”CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547

work page 2020

[45] [45]

Jiao et al., ”TinyBERT: Distilling BERT for natural language under- standing,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp

X. Jiao et al., ”TinyBERT: Distilling BERT for natural language under- standing,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4163–4174

work page 2020

[46] [46]

Lewis et al., ”Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol

P. Lewis et al., ”Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

work page 2020

[47] [47]

Borgeaud et al., ”Improving language models by retrieving from trillions of tokens,” inInternational Conference on Machine Learning, 2022, pp

S. Borgeaud et al., ”Improving language models by retrieving from trillions of tokens,” inInternational Conference on Machine Learning, 2022, pp. 2206–2240

work page 2022

[48] [48]

A. Gu, K. Goel, and C. R ´e, ”Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022

work page 2022

[49] [49]

Ainslie et al., ”ETC: Encoding long and structured inputs in trans- formers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp

J. Ainslie et al., ”ETC: Encoding long and structured inputs in trans- formers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 268–284

work page 2020

[50] [50]

J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, ”Com- pressive transformers for long-range sequence modelling,” inInterna- tional Conference on Learning Representations, 2020

work page 2020

[51] [51]

NVIDIA, ”NVIDIA A100 tensor core GPU architecture,” NVIDIA whitepaper, 2020

work page 2020

[52] [52]

N. P. Jouppi et al., ”In-datacenter performance analysis of a tensor pro- cessing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

work page 2017

[53] [53]

Rogers, O

A. Rogers, O. Kovaleva, and A. Rumshisky, ”A primer in BERTology: What we know about how BERT works,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020

work page 2020

[54] [54]

M. Ott et al., ”fairseq: A fast, extensible toolkit for sequence modeling,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 48–53

work page 2019

[55] [55]

Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol

A. Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[56] [56]

Abadi et al., ”TensorFlow: A system for large-scale machine learn- ing,” in12th USENIX symposium on operating systems design and implementation, 2016, pp

M. Abadi et al., ”TensorFlow: A system for large-scale machine learn- ing,” in12th USENIX symposium on operating systems design and implementation, 2016, pp. 265–283

work page 2016

[57] [57]

Bradbury et al., ”JAX: composable transformations of Python+NumPy programs,” 2018

J. Bradbury et al., ”JAX: composable transformations of Python+NumPy programs,” 2018

work page 2018

[58] [58]

Carbon Emissions and Large Neural Network Training

D. Patterson et al., ”Carbon emissions and large neural network training,” arXiv preprint arXiv:2104.10350, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[59] [59]

Gong, Y .-A

Y . Gong, Y .-A. Chung, and J. Glass, ”AST: Audio spectrogram trans- former,” inProceedings of the Interspeech 2021, 2021, pp. 571–575

work page 2021

[60] [60]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi et al., ”Megatron-LM: Training multi-billion param- eter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909

[61] [61]

Huang et al., ”GPipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems, vol

Y . Huang et al., ”GPipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems, vol. 32, 2019

work page 2019

[62] [62]

T. Wolf et al., ”Transformers: State-of-the-art natural language process- ing,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45

work page 2020

[63] [63]

PaLM: Scaling Language Modeling with Pathways

A. Chowdhery et al., ”PaLM: Scaling language modeling with path- ways,”arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[64] [64]

J. D. M.-W. C. Kenton and L. K. Toutanova, ”BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of NAACL-HLT, 2019, pp. 4171–4186

work page 2019