pith. sign in

arxiv: 2605.15413 · v1 · pith:UYBVMRVFnew · submitted 2026-05-14 · 💻 cs.LG

Transformer Scalability Crisis: The First Comprehensive Empirical Analysis of Performance Walls in Modern Language Models

Pith reviewed 2026-05-19 16:06 UTC · model grok-4.3

classification 💻 cs.LG
keywords transformer modelsscalability crisisattention complexityempirical analysisperformance wallssequence lengthsmodel efficiency
0
0 comments X

The pith

Benchmark of 118 transformers shows performance walls where success drops to zero at 2048 tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests 118 transformer models from seven architectural categories on sequences from 128 to 2048 tokens to find where they hit limits. It reports that 88.1 percent handle 512 tokens but only 44.9 percent manage 1024, and none succeed at 2048. Compressed models run more efficiently per parameter than larger ones. These results question the assumption that transformers can scale indefinitely with more compute. The findings help guide which models are practical for real applications.

Core claim

The paper establishes that the quadratic attention complexity leads to measurable performance walls, with complete failure at 2048 tokens across all models and superior efficiency in compressed variants at 649.2 tokens per second per million parameters versus 12.5 for large models.

What carries the argument

The large-scale empirical benchmarking of memory consumption, loading times, and computational efficiency across varying sequence lengths in diverse model categories.

If this is right

  • Only 44.9% of models process 1024 tokens successfully, falling to 0% at 2048 tokens.
  • Compressed models provide higher parameter efficiency than large generative models.
  • Scaling assumptions for transformers require reevaluation based on these empirical limits.
  • Practical deployment must account for sequence length constraints from the start.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model designs may prioritize linear or sub-quadratic attention to extend usable lengths.
  • Testing protocols for new models should include long-sequence benchmarks as standard.
  • Hardware-specific optimizations could mitigate some walls observed in the study.

Load-bearing premise

The chosen 118 models and sequence lengths from 128 to 2048 tokens represent general transformer scalability behavior rather than results specific to the selected architectures or test hardware.

What would settle it

Observing even one model successfully processing a 2048-token sequence without failure under similar test conditions would contradict the reported 0% success rate.

Figures

Figures reproduced from arXiv: 2605.15413 by Faezeh Ghaderi, Mahdi Naser Moghadasi.

Figure 1
Figure 1. Figure 1: The Transformer Scalability Wall: Empirical evidence of dramatic [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Throughput Scaling Analysis: Logarithmic scaling reveals architec [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Parameter Efficiency Hierarchy: Compressed models achieve 52 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
read the original abstract

Despite the remarkable success of transformer architectures in natural language processing, their scalability limitations remain poorly understood through systematic empirical analysis. This paper presents the first comprehensive large-scale evaluation of 118 transformer models across seven distinct architectural categories, revealing fundamental performance walls that manifest as hard deployment constraints. Our systematic benchmarking methodology uncovers a critical scalability crisis: while 88.1% of models successfully process sequences up to 512 tokens, this drops dramatically to 44.9% at 1024 tokens, with complete failure (0%) at 2048 tokens. Through rigorous analysis of loading times, memory consumption, and computational efficiency across sequence lengths from 128 to 2048 tokens, we demonstrate that compressed models achieve superior parameter efficiency (649.2 tokens/sec/M parameters) compared to large generative models (12.5 tokens/sec/M). Our findings challenge prevailing scaling assumptions and provide the first quantitative evidence that the theoretical O(n2) attention complexity translates into measurable performance walls. This work establishes new benchmarking methodologies for transformer evaluation and provides critical insights for practical deployment decisions in production environments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to conduct the first large-scale empirical evaluation of 118 transformer models across seven architectural categories, benchmarking performance on sequence lengths from 128 to 2048 tokens. It reports success rates dropping from 88.1% at 512 tokens to 44.9% at 1024 tokens and 0% at 2048 tokens, attributing this to inherent O(n²) attention complexity creating measurable performance walls, while also comparing parameter efficiency between compressed and large generative models.

Significance. If the experimental setup were fully documented and the results shown to be independent of specific hardware or model-size artifacts, the work could provide useful empirical data on practical deployment limits for transformers. However, the current presentation does not establish that the observed failures reflect fundamental architectural constraints rather than memory or implementation constraints, limiting the potential impact on scaling-law discussions.

major comments (2)
  1. [Abstract] Abstract and Experimental Setup: The central claim that the drop to 0% success at 2048 tokens demonstrates a 'scalability crisis' and 'fundamental performance walls' is not supported by any reported details on model parameter counts, hardware (GPU/TPU memory, batch size), or use of optimized kernels such as FlashAttention. Without these, the percentages cannot be distinguished from OOM failures on large models under limited VRAM.
  2. [Abstract] Abstract: No model-selection criteria, exclusion rules, or statistical tests are described for the 118 models, so it is impossible to evaluate whether the seven architectural categories and chosen sequence lengths are representative or whether the 0% result at 2048 tokens is generalizable beyond the tested hardware.
minor comments (1)
  1. [Abstract] The abstract states precise efficiency numbers (649.2 tokens/sec/M parameters) without indicating how these were normalized or averaged across models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which has helped us improve the clarity and rigor of our experimental documentation. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experimental Setup: The central claim that the drop to 0% success at 2048 tokens demonstrates a 'scalability crisis' and 'fundamental performance walls' is not supported by any reported details on model parameter counts, hardware (GPU/TPU memory, batch size), or use of optimized kernels such as FlashAttention. Without these, the percentages cannot be distinguished from OOM failures on large models under limited VRAM.

    Authors: We acknowledge the validity of this concern. The original abstract was concise and omitted key experimental details. In the revised version, we have added a dedicated paragraph in the Experimental Setup section specifying the hardware used (NVIDIA A100 GPUs with 80GB VRAM, batch size of 1 for sequence length tests), average parameter counts per category, and that we used the standard PyTorch attention implementation without FlashAttention or other optimizations to reflect typical deployment scenarios. We have also clarified that while some failures may involve memory limits, the pattern of increasing failure rates with sequence length across diverse model sizes supports our interpretation of scalability challenges. We have moderated the language from 'fundamental performance walls' to 'practical performance limits observed in our benchmarks'. revision: yes

  2. Referee: [Abstract] Abstract: No model-selection criteria, exclusion rules, or statistical tests are described for the 118 models, so it is impossible to evaluate whether the seven architectural categories and chosen sequence lengths are representative or whether the 0% result at 2048 tokens is generalizable beyond the tested hardware.

    Authors: We agree that these details were insufficiently documented. We have revised the manuscript to include a 'Model Selection and Dataset' subsection describing the criteria: models were selected from the Hugging Face Transformers library based on popularity (top 20 per category by downloads), support for variable-length inputs, and exclusion of models with documented compatibility issues or those requiring custom hardware. The seven categories were chosen to cover encoder-only, decoder-only, encoder-decoder, and variants like sparse attention models. We have added bootstrap confidence intervals for the success rates and a discussion of limitations regarding generalizability to other hardware setups. While we cannot test every possible hardware configuration, the consistency across 118 models on standard GPU hardware provides a strong empirical basis. revision: yes

Circularity Check

0 steps flagged

Empirical benchmarking study with no derivations or self-referential predictions

full rationale

This paper is a large-scale empirical measurement study that benchmarks 118 transformer models across sequence lengths from 128 to 2048 tokens, reporting observed success rates, memory usage, and efficiency metrics. The central claims consist of direct experimental observations (e.g., success dropping from 88.1% at 512 tokens to 0% at 2048 tokens) rather than any mathematical derivation chain, fitted parameters renamed as predictions, or load-bearing self-citations. No equations, ansatzes, or uniqueness theorems are invoked that reduce to the paper's own inputs. The analysis is therefore self-contained against external benchmarks and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the chosen models and sequence lengths plus the assumption that the benchmarking captures real deployment constraints without unstated selection biases.

axioms (1)
  • domain assumption The 118 models and seven categories sufficiently represent the space of modern transformer architectures for generalizing performance walls.
    Invoked to support claims about fundamental limitations across all transformers.

pith-pipeline@v0.9.0 · 5725 in / 1188 out tokens · 49745 ms · 2026-05-19T16:06:40.953973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 24 internal anchors

  1. [1]

    Vaswani, N

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, ”Attention is all you need,” inAdvances in neural information processing systems, 2017, pp. 5998–6008

  2. [2]

    Devlin, M.-W

    J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, ”BERT: Pre- training of deep bidirectional transformers for language understanding,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics, 2019, pp. 4171–4186

  3. [3]

    Brown et al., ”Language models are few-shot learners,” inAdvances in neural information processing systems, vol

    T. Brown et al., ”Language models are few-shot learners,” inAdvances in neural information processing systems, vol. 33, 2020, pp. 1877–1901

  4. [4]

    Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

    A. Dosovitskiy et al., ”An image is worth 16x16 words: Transformers for image recognition at scale,” inInternational Conference on Learning Representations, 2021

  5. [5]

    Longformer: The Long-Document Transformer

    I. Beltagy, M. E. Peters, and A. Cohan, ”Longformer: The long- document transformer,”arXiv preprint arXiv:2004.05150, 2020

  6. [6]

    Evaluating Large Language Models Trained on Code

    M. Chen et al., ”Evaluating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021

  7. [7]

    Adiwardana, M.-T

    D. Adiwardana et al., ”Towards a human-like open-domain chatbot,” arXiv preprint arXiv:2001.09977, 2020

  8. [8]

    Y . Tay, M. Dehghani, D. Rao, W. Fedus, A. Abnar, H. W. Chung, S. Narang, D. Yogatama, A. Katharopoulos, N. Kamigaito et al., ”Efficient transformers: A survey,”ACM Computing Surveys, vol. 55, no. 6, pp. 1–28, 2022

  9. [9]

    X. Qiu, T. Sun, Y . Xu, Y . Shao, N. Dai, and X. Huang, ”Pre-trained models for natural language processing: A survey,”Science China Technological Sciences, vol. 63, no. 10, pp. 1872–1897, 2020

  10. [10]

    Generating Long Sequences with Sparse Transformers

    R. Child, S. Gray, A. Radford, and I. Sutskever, ”Generating long sequences with sparse transformers,”arXiv preprint arXiv:1904.10509, 2019

  11. [11]

    Katharopoulos, A

    A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, ”Transformers are rnns: Fast autoregressive transformers with linear attention,” in International Conference on Machine Learning, 2020, pp. 5156–5165

  12. [12]

    Zaheer, G

    M. Zaheer, G. Guruganesh, K. A. Dubey, J. Ainslie, C. Alberti, S. Ontanon, P. Pham, A. Ravula, Q. Wang, L. Yang et al., ”Big bird: Transformers for longer sequences,”Advances in neural information processing systems, vol. 33, pp. 17283–17297, 2020

  13. [13]

    S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, ”Linformer: Self-attention with linear complexity,”arXiv preprint arXiv:2006.04768, 2020

  14. [14]

    Rethinking Attention with Performers

    K. Choromanski, V . Likhosherstov, D. Dohan, X. Song, A. Gane, T. Sarlos, P. Hawkins, J. Davis, A. Mohiuddin, L. Kaiser et al., ”Rethinking attention with performers,”arXiv preprint arXiv:2009.14794, 2020

  15. [15]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces

    A. Gu and T. Dao, ”Mamba: Linear-time sequence modeling with selective state spaces,”arXiv preprint arXiv:2312.00752, 2023

  16. [16]

    Peng et al., ”RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp

    B. Peng et al., ”RWKV: Reinventing RNNs for the transformer era,” inFindings of the Association for Computational Linguistics: EMNLP 2023, 2023, pp. 14048–14077

  17. [17]

    Retentive Network: A Successor to Transformer for Large Language Models

    Y . Sun et al., ”Retentive network: A successor to transformer for large language models,”arXiv preprint arXiv:2307.08621, 2023

  18. [18]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, ”Scaling laws for neural language models,”arXiv preprint arXiv:2001.08361, 2020

  19. [19]

    Training Compute-Optimal Large Language Models

    J. Hoffmann et al., ”Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022

  20. [20]

    Alexey Romanov and Chaitanya Shivade

    Y . Tay et al., ”Scaling up models and data witht5xandseqio,”arXiv preprint arXiv:2203.17189, 2022

  21. [21]

    Fedus, B

    W. Fedus, B. Zoph, and N. Shazeer, ”Switch transformer: Scaling to trillion parameter models with simple and efficient sparsity,”Journal of Machine Learning Research, vol. 23, no. 120, pp. 1–39, 2022

  22. [22]

    A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ”GLUE: A multi-task benchmark and analysis platform for natural language understanding,”arXiv preprint arXiv:1804.07461, 2018

  23. [23]

    A. Wang, Y . Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman, ”SuperGLUE: A stickier benchmark for general-purpose language understanding systems,”Advances in neural information processing systems, vol. 32, 2019

  24. [24]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    A. Srivastava et al., ”Beyond the imitation game: Quantifying and extrapolating the capabilities of language models,”arXiv preprint arXiv:2206.04615, 2022

  25. [25]

    Holistic Evaluation of Language Models

    P. Liang et al., ”Holistic evaluation of language models,”arXiv preprint arXiv:2211.09110, 2022

  26. [26]

    Strubell, A

    E. Strubell, A. Ganesh, and A. McCallum, ”Energy and policy consid- erations for deep learning in NLP,” inProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 3645–3650

  27. [27]

    Radford, J

    A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., ”Language models are unsupervised multitask learners,”OpenAI blog, vol. 1, no. 8, p. 9, 2019

  28. [28]

    OPT: Open Pre-trained Transformer Language Models

    S. Zhang et al., ”OPT: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022

  29. [29]

    T. L. Scao et al., ”BLOOM: A 176b-parameter open-access multilingual language model,”arXiv preprint arXiv:2211.05100, 2022

  30. [30]

    S., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J

    N. Dey et al., ”Cerebras-GPT: Open compute-optimal language models trained on the cerebras wafer-scale cluster,”arXiv preprint arXiv:2304.03208, 2023

  31. [31]

    Biderman et al., ”Pythia: A suite for analyzing large language models across training and scaling,” inInternational Conference on Machine Learning, 2023, pp

    S. Biderman et al., ”Pythia: A suite for analyzing large language models across training and scaling,” inInternational Conference on Machine Learning, 2023, pp. 2397–2430

  32. [32]

    A. Q. Jiang et al., ”Mistral 7B,”arXiv preprint arXiv:2310.06825, 2023

  33. [33]

    The Falcon Series of Open Language Models

    E. Almazrouei et al., ”The Falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023

  34. [34]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu et al., ”RoBERTa: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

  35. [35]

    Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, ”ALBERT: A lite BERT for self-supervised learning of language rep- resentations,” inInternational Conference on Learning Representations, 2020

  36. [36]

    Clark, M.-T

    K. Clark, M.-T. Luong, Q. V . Le, and C. D. Manning, ”ELECTRA: Pre-training text encoders as discriminators rather than generators,” in International Conference on Learning Representations, 2020

  37. [37]

    P. He, X. Liu, J. Gao, and W. Chen, ”DeBERTa: Decoding-enhanced BERT with disentangled attention,” inInternational Conference on Learning Representations, 2021

  38. [38]

    Beltagy, K

    I. Beltagy, K. Lo, and A. Cohan, ”SciBERT: A pretrained language model for scientific text,” inProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019, pp. 3615– 3620

  39. [39]

    FinBERT: Financial Sentiment Analysis with Pre-trained Language Models

    D. Araci, ”FinBERT: Financial sentiment analysis with pre-trained language models,”arXiv preprint arXiv:1908.10063, 2019

  40. [40]

    Lee et al., ”BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol

    J. Lee et al., ”BioBERT: a pre-trained biomedical language representa- tion model for biomedical text mining,”Bioinformatics, vol. 36, no. 4, pp. 1234–1240, 2020

  41. [41]

    V . Sanh, L. Debut, J. Chaumond, and T. Wolf, ”DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter,”arXiv preprint arXiv:1910.01108, 2019

  42. [42]

    Textbooks Are All You Need

    S. Gunasekar et al., ”Textbooks are all you need,”arXiv preprint arXiv:2306.11644, 2023

  43. [43]

    TinyLlama: An Open-Source Small Language Model

    P. Zhang et al., ”TinyLlama: An open-source small language model,” arXiv preprint arXiv:2401.02385, 2024

  44. [44]

    Feng et al., ”CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp

    Z. Feng et al., ”CodeBERT: A pre-trained model for programming and natural languages,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 1536–1547

  45. [45]

    Jiao et al., ”TinyBERT: Distilling BERT for natural language under- standing,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp

    X. Jiao et al., ”TinyBERT: Distilling BERT for natural language under- standing,” inFindings of the Association for Computational Linguistics: EMNLP 2020, 2020, pp. 4163–4174

  46. [46]

    Lewis et al., ”Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol

    P. Lewis et al., ”Retrieval-augmented generation for knowledge-intensive nlp tasks,”Advances in neural information processing systems, vol. 33, pp. 9459–9474, 2020

  47. [47]

    Borgeaud et al., ”Improving language models by retrieving from trillions of tokens,” inInternational Conference on Machine Learning, 2022, pp

    S. Borgeaud et al., ”Improving language models by retrieving from trillions of tokens,” inInternational Conference on Machine Learning, 2022, pp. 2206–2240

  48. [48]

    A. Gu, K. Goel, and C. R ´e, ”Efficiently modeling long sequences with structured state spaces,” inInternational Conference on Learning Representations, 2022

  49. [49]

    Ainslie et al., ”ETC: Encoding long and structured inputs in trans- formers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp

    J. Ainslie et al., ”ETC: Encoding long and structured inputs in trans- formers,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020, pp. 268–284

  50. [50]

    J. W. Rae, A. Potapenko, S. M. Jayakumar, and T. P. Lillicrap, ”Com- pressive transformers for long-range sequence modelling,” inInterna- tional Conference on Learning Representations, 2020

  51. [51]

    NVIDIA, ”NVIDIA A100 tensor core GPU architecture,” NVIDIA whitepaper, 2020

  52. [52]

    N. P. Jouppi et al., ”In-datacenter performance analysis of a tensor pro- cessing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

  53. [53]

    Rogers, O

    A. Rogers, O. Kovaleva, and A. Rumshisky, ”A primer in BERTology: What we know about how BERT works,”Transactions of the Association for Computational Linguistics, vol. 8, pp. 842–866, 2020

  54. [54]

    M. Ott et al., ”fairseq: A fast, extensible toolkit for sequence modeling,” inProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), 2019, pp. 48–53

  55. [55]

    Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol

    A. Paszke et al., ”PyTorch: An imperative style, high-performance deep learning library,”Advances in neural information processing systems, vol. 32, 2019

  56. [56]

    Abadi et al., ”TensorFlow: A system for large-scale machine learn- ing,” in12th USENIX symposium on operating systems design and implementation, 2016, pp

    M. Abadi et al., ”TensorFlow: A system for large-scale machine learn- ing,” in12th USENIX symposium on operating systems design and implementation, 2016, pp. 265–283

  57. [57]

    Bradbury et al., ”JAX: composable transformations of Python+NumPy programs,” 2018

    J. Bradbury et al., ”JAX: composable transformations of Python+NumPy programs,” 2018

  58. [58]

    Carbon Emissions and Large Neural Network Training

    D. Patterson et al., ”Carbon emissions and large neural network training,” arXiv preprint arXiv:2104.10350, 2021

  59. [59]

    Gong, Y .-A

    Y . Gong, Y .-A. Chung, and J. Glass, ”AST: Audio spectrogram trans- former,” inProceedings of the Interspeech 2021, 2021, pp. 571–575

  60. [60]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    M. Shoeybi et al., ”Megatron-LM: Training multi-billion param- eter language models using model parallelism,”arXiv preprint arXiv:1909.08053, 2019

  61. [61]

    Huang et al., ”GPipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems, vol

    Y . Huang et al., ”GPipe: Efficient training of giant neural networks using pipeline parallelism,”Advances in neural information processing systems, vol. 32, 2019

  62. [62]

    T. Wolf et al., ”Transformers: State-of-the-art natural language process- ing,” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2020, pp. 38–45

  63. [63]

    PaLM: Scaling Language Modeling with Pathways

    A. Chowdhery et al., ”PaLM: Scaling language modeling with path- ways,”arXiv preprint arXiv:2204.02311, 2022

  64. [64]

    J. D. M.-W. C. Kenton and L. K. Toutanova, ”BERT: Pre-training of deep bidirectional transformers for language understanding,” inProceedings of NAACL-HLT, 2019, pp. 4171–4186