The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Alexandru Meterez; Costin-Andrei Oncescu; Depen Morwani; Mujin Kwun; Samy Jelassi; Sham Kakade

arxiv: 2604.21215 · v1 · submitted 2026-04-23 · 💻 cs.LG

The Recurrent Transformer: Greater Effective Depth and Efficient Decoding

Costin-Andrei Oncescu , Depen Morwani , Samy Jelassi , Alexandru Meterez , Mujin Kwun , Sham Kakade This is my paper

Pith reviewed 2026-05-09 22:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords recurrent transformertransformer architecturelanguage pretrainingefficient decodingattention mechanismsequence modelingC4 datasetkv cache

0 comments

The pith

Recurrent Transformers improve C4 pretraining cross-entropy by adding per-layer recurrence that increases effective depth while allowing fewer layers at fixed parameter count.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Recurrent Transformer, in which each layer attends to key-value pairs computed from its own activations instead of the preceding layer. This produces layer-wise recurrent memory that raises effective depth without raising the layer count. The design is shown to emulate both a standard Transformer and token-to-token recurrence under mild conditions, and an exact tiling algorithm is supplied to keep training and prefill efficient by lowering HBM traffic from quadratic to near-linear in sequence length. On 150 million and 300 million parameter models pretrained on C4, the recurrent versions reach lower cross-entropy than parameter-matched baselines, with the gains obtained even when the recurrent model uses fewer layers.

Core claim

By recomputing key and value projections from each layer's own hidden states, the Recurrent Transformer injects recurrence across layers while preserving autoregressive decoding cost. This change yields greater effective depth at fixed parameter budgets, producing lower cross-entropy on C4 pretraining than standard Transformers and permitting the same accuracy with shallower stacks. The accompanying tiling procedure reduces memory traffic to Theta(N log N) and raises arithmetic intensity to Theta(N / log N), making the sequential dependencies practical to train.

What carries the argument

Per-layer recurrent attention, where each layer attends to key-value pairs derived from its own activations rather than the prior layer's outputs.

If this is right

Lower cross-entropy loss on C4 pretraining for both 150M and 300M parameter models relative to standard Transformers.
Performance gains remain available when the recurrent model is configured with fewer layers than the baseline at matched parameter count.
Smaller KV cache footprint and lower inference latency because effective depth is obtained with shallower stacks.
Training and prefill arithmetic intensity rises to Theta(N / log N) for sequence length N through exact tiling.
The architecture can replicate either conventional Transformer behavior or token-to-token recurrence as required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The depth-for-width trade-off may let practitioners reach higher effective depth without proportional growth in parameter count or inference memory.
The tiling technique could be reused for other sequence models that introduce intra-layer dependencies during training.
Selective application of recurrence to only some layers might further balance quality against speed on long sequences.
Similar per-layer recurrence might be tested on non-language sequence tasks to check whether the depth gain generalizes.

Load-bearing premise

The per-layer recurrence can be optimized without instability and the tiling algorithm exactly reproduces the sequential computation.

What would settle it

A side-by-side C4 pretraining run in which a Recurrent Transformer with fewer layers fails to reach lower cross-entropy than its parameter-matched standard Transformer baseline.

Figures

Figures reproduced from arXiv: 2604.21215 by Alexandru Meterez, Costin-Andrei Oncescu, Depen Morwani, Mujin Kwun, Samy Jelassi, Sham Kakade.

**Figure 2.** Figure 2: C4 pretraining: loss curves for 300m parameter [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: We use the tiling of Oncescu et al. [2025] to increase arithmetic intensity during the forward pass since [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: One-layer forward-pass latency as a function of sequence length at batch size [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Sequence-level accuracy of the Recurrent Transformer and a regular Transformer on MAD synthetic tasks [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Token level accuracies on synthetic diagnostics (MAD + copy). [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: C4 pretraining: Ablating the use of RMSNorm in Recurrent Transformer for 150M parameter model at [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: C4 pretraining: loss curve for the 150M parameter model at batch size 512. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: C4 pretraining: loss curve for the 150M parameter model at batch size 256. [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near $1$ because keys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematical computation while reducing HBM traffic from $\Theta(N^2)$ to $\Theta(N\log N)$, increasing effective arithmetic intensity to $\Theta(N/\log N)$ for sequence length $N$. On 150M and 300M parameter C4 pretraining, Recurrent Transformers improve cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewer layers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memory footprint and inference latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

Recurrent Transformers add intra-layer recurrence to standard attention and pair it with a tiling trick that claims to keep training exact while cutting HBM traffic, with C4 results suggesting shallower models can match or beat deeper baselines. The main novelty is the layerwise self-recurrent KV setup: each layer attends to keys and values generated from its own activations rather than only the layer below. This is presented as able to emulate both a plain transformer and token-level recurrence under mild conditions, without the usual instability. The tiling algorithm is the practical piece; it turns what would be sequential KV revelation into a blocked computation that drops HBM traffic from quadratic to N log N and raises arithmetic intensity accordingly. If the math holds exactly, that is a concrete win for prefill and training on long sequences. The reported C4 pretraining at 150M and 300M parameters shows lower cross-entropy than parameter-matched transformer baselines, and the gains appear with fewer layers at fixed parameter count. That directly supports the claim that recurrence can trade depth for width and shrink KV cache at inference. The experiments are the part that still needs the most scrutiny. The abstract gives no numbers, error bars, or ablation tables, so the full paper must show the baselines are matched on total compute and data, not just parameter count, and that no extra tricks were used to stabilize training. The tiling claim is load-bearing: any deviation in masking or accumulation order would mean the trained model is not the one described, and the inference savings would be overstated. The emulation statements also rest on assumptions that should be spelled out clearly. This is aimed at people building efficient long-context models who want a drop-in architectural change rather than a full redesign. It is worth sending to referees because the core idea is simple to reimplement and the efficiency argument is testable with concrete metrics.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the Recurrent Transformer, a modification to the standard Transformer where each layer attends to key-value pairs computed from its own activations rather than the prior layer. This yields layerwise recurrent memory while preserving autoregressive decoding. The architecture is shown to emulate both conventional Transformers and token-to-token recurrence under mild assumptions without optimization instability. A key contribution is an exact tiling algorithm that reduces HBM traffic from Θ(N²) to Θ(N log N) during prefill/training, raising arithmetic intensity to Θ(N/log N). On C4 pretraining, 150M- and 300M-parameter Recurrent Transformers achieve lower cross-entropy than parameter-matched baselines while using fewer layers, suggesting recurrence can trade depth for width and thereby reduce KV-cache footprint and inference latency.

Significance. If the central claims hold, the work provides a practical route to greater effective depth in Transformers without increasing layer count, enabling wider-shallower models that cut inference memory and latency while improving language-modeling performance. The tiling algorithm directly addresses the bandwidth bottleneck that has historically limited recurrent-style computations on accelerators. Credit is due for the clean architectural equivalence results and the focus on both training-time efficiency and downstream inference benefits.

major comments (2)

[§4] §4 (tiling algorithm): The claim that the exact tiling algorithm preserves the mathematical computation (including causal masking and sequential K/V revelation) while reducing HBM traffic to Θ(N log N) is load-bearing for both the reported pretraining gains and the efficiency assertions. The manuscript should supply either a formal equivalence argument or complete pseudocode that demonstrates identical outputs to the naïve sequential implementation; any discrepancy in accumulation order or masking would invalidate the C4 results as evidence for the intended Recurrent Transformer.
[§5] §5 (experiments): The central empirical claim—that Recurrent Transformers outperform parameter-matched Transformer baselines on C4 with fewer layers—is load-bearing for the depth-for-width trade-off argument. The section must report exact layer counts, hyperparameter-matching protocol, number of independent runs, error bars or confidence intervals, and at least one ablation isolating the recurrence mechanism; without these, the magnitude and reliability of the reported cross-entropy improvement cannot be assessed.

minor comments (2)

[Abstract] Abstract and §3: The phrase 'under mild assumptions' for the emulation properties is repeated but never enumerated; a short explicit list of the assumptions would improve clarity.
[§4] Notation: Sequence length is denoted N in the complexity statements but occasionally appears as other symbols in the tiling description; consistent use throughout would aid readability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their thoughtful and constructive review. The comments highlight important areas for strengthening the presentation of the tiling algorithm and the experimental results. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [§4] §4 (tiling algorithm): The claim that the exact tiling algorithm preserves the mathematical computation (including causal masking and sequential K/V revelation) while reducing HBM traffic to Θ(N log N) is load-bearing for both the reported pretraining gains and the efficiency assertions. The manuscript should supply either a formal equivalence argument or complete pseudocode that demonstrates identical outputs to the naïve sequential implementation; any discrepancy in accumulation order or masking would invalidate the C4 results as evidence for the intended Recurrent Transformer.

Authors: We agree that a fully rigorous demonstration of equivalence is essential. Section 4 describes the tiling procedure and explains how it preserves sequential KV revelation and applies causal masking at each step to ensure mathematical identity with the naive implementation. To strengthen this, the revised manuscript will include complete pseudocode for the tiled prefill/training algorithm together with a concise equivalence argument showing that the output, accumulation order, and masking behavior are identical to the sequential version. revision: yes
Referee: [§5] §5 (experiments): The central empirical claim—that Recurrent Transformers outperform parameter-matched Transformer baselines on C4 with fewer layers—is load-bearing for the depth-for-width trade-off argument. The section must report exact layer counts, hyperparameter-matching protocol, number of independent runs, error bars or confidence intervals, and at least one ablation isolating the recurrence mechanism; without these, the magnitude and reliability of the reported cross-entropy improvement cannot be assessed.

Authors: We acknowledge that the current experimental section would benefit from greater detail. The revised manuscript will explicitly report the layer counts used for the 150M- and 300M-parameter models, provide a full description of the hyperparameter-matching protocol (total parameters, optimizer settings, learning-rate schedule, and data order), and add an ablation that isolates the recurrence mechanism by comparing against a non-recurrent architecture with otherwise identical structure. Because the original runs were performed singly owing to compute constraints, we will state this limitation clearly and report the observed cross-entropy values as point estimates; additional runs will be pursued if resources permit. revision: partial

standing simulated objections not resolved

Reporting error bars or confidence intervals from multiple independent runs, as the original C4 pretraining experiments were conducted as single runs due to computational cost.

Circularity Check

0 steps flagged

No circularity: architecture, equivalence claims, and efficiency algorithm are self-contained definitions and algorithms; empirical gains are reported from independent pretraining runs.

full rationale

The paper introduces the Recurrent Transformer via explicit architectural modifications (each layer attends to its own activations), states mild assumptions under which it emulates a standard Transformer or token-level recurrence, and presents a tiling algorithm claimed to preserve exact computation while changing memory traffic. These are definitional and algorithmic steps, not derivations that reduce to fitted parameters or prior self-citations. The central performance claims rest on C4 pretraining experiments with parameter-matched baselines, which are external to any internal fitting loop. No load-bearing step matches the enumerated circularity patterns; the derivation chain is independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the definition of the new recurrent attention rule and the correctness of the tiling algorithm; no explicit free parameters are introduced beyond standard training hyperparameters; axioms are the standard transformer attention equations plus the mild emulation assumptions stated in the abstract.

axioms (2)

standard math Standard multi-head attention equations
The model is defined by modifying the standard transformer attention computation.
domain assumption Mild assumptions allow emulation of conventional and recurrent models
Invoked to claim equivalence to both transformer and recurrent behaviors.

invented entities (1)

Recurrent Transformer layer no independent evidence
purpose: To provide layerwise recurrent memory while preserving autoregressive decoding cost
New architectural primitive introduced by the paper.

pith-pipeline@v0.9.0 · 5548 in / 1400 out tokens · 42535 ms · 2026-05-09T22:00:15.942197+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training-Free Looped Transformers
cs.LG 2026-05 unverdicted novelty 7.0

Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.

Reference graph

Works this paper leans on

81 extracted references · 81 canonical work pages · cited by 1 Pith paper · 7 internal anchors

[1]

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond , url =

Oncescu, Costin-Andrei and Purandare, Sanket Jayant and Idreos, Stratos and Kakade, Sham , booktitle =. Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond , url =

work page
[2]

NeurIPS , year=

Attention Is All You Need , author=. NeurIPS , year=

work page
[6]

Transformer-

Dai, Zihang and Yang, Zhilin and Yang, Yiming and others , journal=. Transformer-

work page
[8]

Forty-first International Conference on Machine Learning , year=

Mechanistic Design and Scaling of Hybrid Architectures , author=. Forty-first International Conference on Machine Learning , year=

work page
[9]

TACL , year=

Saturated Transformers are Constant-Depth Threshold Circuits , author=. TACL , year=

work page
[10]

ICLR , year=

Transformers Learn Shortcuts to Automata , author=. ICLR , year=

work page
[11]

Transactions of the Association for Computational Linguistics , volume=

Saturated Transformers are Constant-Depth Threshold Circuits , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , doi=

work page 2022
[12]

2024 , eprint=

TransformerFAM: Feedback attention is working memory , author=. 2024 , eprint=

work page 2024
[13]

2023 , eprint=

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit , author=. 2023 , eprint=

work page 2023
[14]

2023 , eprint=

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks , author=. 2023 , eprint=

work page 2023
[15]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

work page 2022
[16]

Communications of the ACM , volume=

Roofline: An Insightful Visual Performance Model for Multicore Architectures , author=. Communications of the ACM , volume=. 2009 , doi=

work page 2009
[17]

1994 , howpublished=

Learning Long-Term Dependencies with Gradient Descent is Difficult , author=. 1994 , howpublished=

work page 1994
[18]

2013 , eprint=

On the difficulty of training Recurrent Neural Networks , author=. 2013 , eprint=

work page 2013
[19]

Self-attention does not need

Rabe, Markus N and Staats, Charles , journal=. Self-attention does not need

work page
[20]

International conference on machine learning , pages=

Scaling vision transformers to 22 billion parameters , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[21]

International conference on machine learning , pages=

On layer normalization in the transformer architecture , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[22]

NIPS-W , year=

Automatic differentiation in PyTorch , author=. NIPS-W , year=

work page
[24]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page
[25]

First conference on language modeling , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

work page
[26]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[30]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

work page 1997
[31]

Advances in Neural Information Processing Systems , volume=

Recurrent memory transformer , author=. Advances in Neural Information Processing Systems , volume=

work page
[32]

arXiv preprint, 2019 , author=

Compressive transformers for long-range sequence modelling. arXiv preprint, 2019 , author=. URL https://arxiv. org/abs , year=

work page 2019
[33]

International Conference on Machine Learning , pages=

Resurrecting recurrent neural networks for long sequences , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023
[34]

The Thirteenth International Conference on Learning Representations , year=

Deconstructing What Makes a Good Optimizer for Autoregressive Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[35]

The Thirteenth International Conference on Learning Representations , year=

How Does Critical Batch Size Scale in Pre-training? , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[36]

2025 , eprint=

Establishing Task Scaling Laws via Compute-Efficient Model Ladders , author=. 2025 , eprint=

work page 2025
[37]

2024 , eprint=

OLMo: Accelerating the Science of Language Models , author=. 2024 , eprint=

work page 2024
[38]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page
[39]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Rowan Zellers and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[41]

Proceedings of the EMNLP , year =

Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal , title =. Proceedings of the EMNLP , year =

work page
[42]

Liu and Matt Gardner , title =

Johannes Welbl and Nelson F. Liu and Matt Gardner , title =. Proceedings of the Workshop on Noisy User-generated Text (WNUT) , year =

work page
[43]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page
[46]

Root Mean Square Layer Normalization , url =

Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , url =

work page
[47]

2022 , eprint=

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. 2022 , eprint=

work page 2022
[48]

Bengio, P

Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. Often cited via the 1994 journal version

work page 1994
[49]

H., Soldaini, L., Smith, N

A. Bhagia, J. Liu, A. Wettig, D. Heineman, O. Tafjord, A. H. Jha, L. Soldaini, N. A. Smith, D. Groeneveld, P. W. Koh, J. Dodge, and H. Hajishirzi. Establishing task scaling laws via compute-efficient model ladders, 2025. URL https://arxiv.org/abs/2412.04403

work page arXiv 2025
[50]

Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020
[51]

Bordelon, L

B. Bordelon, L. Noci, M. B. Li, B. Hanin, and C. Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023

work page 2023
[52]

Bulatov, Y

A. Bulatov, Y. Kuratov, and M. Burtsev. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35: 0 11079--11091, 2022

work page 2022
[53]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. In arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

Z. Dai, Z. Yang, Y. Yang, et al. Transformer- XL : Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

work page Pith review arXiv 1901
[55]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022
[56]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480--7512. PMLR, 2023

work page 2023
[57]

A. Fan, T. Lavril, E. Grave, A. Joulin, and S. Sukhbaatar. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402, 2020

work page arXiv 2002
[58]

OLMo: Accelerating the Science of Language Models

D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, E. S...

work page internal anchor Pith review arXiv 2024
[59]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

work page 2024
[60]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997

work page 1997
[61]

Transformerfam: Feedback attention is working memory

D. Hwang, W. Wang, Z. Huo, K. C. Sim, and P. Moreno Mengibar. Transformerfam: Feedback attention is working memory. arXiv preprint arXiv:2404.09173, 2024

work page arXiv 2024
[62]

Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

S. Jelassi, D. Brandfonbrener, S. M. Kakade, and E. Malach. Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032, 2024

work page arXiv 2024
[63]

D. Ju, S. Roller, S. Sukhbaatar, and J. Weston. Staircase attention for recurrent processing of sequences. arXiv preprint arXiv:2106.04279, 2021

work page arXiv 2021
[64]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020

work page 2020
[65]

B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang. Transformers learn shortcuts to automata. In ICLR, 2023. arXiv:2210.10749

work page arXiv 2023
[66]

An Empirical Model of Large-Batch Training

S. McCandlish, J. Kaplan, D. Amodei, and O. Team. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018

work page Pith review arXiv 2018
[67]

URLhttps://aclanthology.org/2022.tacl-1.49/

W. Merrill, A. Sabharwal, and N. A. Smith. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10: 0 843--856, 2022. doi:10.1162/tacl_a_00493. URL https://aclanthology.org/2022.tacl-1.49/

work page doi:10.1162/tacl_a_00493 2022
[68]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the EMNLP, 2018

work page 2018
[69]

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review arXiv 2024
[70]

Oncescu, S

C.-A. Oncescu, S. J. Purandare, S. Idreos, and S. Kakade. Flash inference: Near linear time inference for long convolution sequence models and beyond. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 49732--49757, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/f...

work page 2025
[71]

Orvieto, S

A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023

work page 2023
[72]

Pascanu, T

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks, 2013

work page 2013
[73]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017

work page 2017
[74]

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review arXiv 2023
[75]

M. Poli, A. W. Thomas, E. Nguyen, P. Ponnusamy, B. Deiseroth, K. Kersting, T. Suzuki, B. Hie, S. Ermon, C. Re, C. Zhang, and S. Massaroli. Mechanistic design and scaling of hybrid architectures. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=GDp7Gyd9nf

work page 2024
[76]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review arXiv 2022
[77]

M. N. Rabe and C. Staats. Self-attention does not need O (n^2) memory. arXiv preprint arXiv:2112.05682, 2021

work page arXiv 2021
[78]

J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. arxiv preprint, 2019. URL https://arxiv. org/abs, 1911

work page 2019
[79]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

work page 2020
[80]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020
[81]

C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018

work page Pith review arXiv 2018
[82]

J. T. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022

work page internal anchor Pith review arXiv 2022
[83]

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review arXiv 2023
[84]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need. In NeurIPS, 2017

work page 2017
[85]

Welbl, N

J. Welbl, N. F. Liu, and M. Gardner. Crowdsourcing multiple choice science questions. In Proceedings of the Workshop on Noisy User-generated Text (WNUT), 2017

work page 2017
[86]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52 0 (4): 0 65--76, 2009. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[87]

Xiong, Y

R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. In International conference on machine learning, pages 10524--10533. PMLR, 2020

work page 2020
[88]

G. Yang, D. Yu, C. Zhu, and S. Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks, 2023

work page 2023
[89]

Zellers, Y

R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019

work page 2019
[90]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf

work page 2019
[91]

Zhang, D

H. Zhang, D. Morwani, N. Vyas, J. Wu, D. Zou, U. Ghai, D. Foster, and S. M. Kakade. How does critical batch size scale in pre-training? In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=JCiF03qnmi

work page 2025

Showing first 80 references.

[1] [1]

Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond , url =

Oncescu, Costin-Andrei and Purandare, Sanket Jayant and Idreos, Stratos and Kakade, Sham , booktitle =. Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond , url =

work page

[2] [2]

NeurIPS , year=

Attention Is All You Need , author=. NeurIPS , year=

work page

[3] [6]

Transformer-

Dai, Zihang and Yang, Zhilin and Yang, Yiming and others , journal=. Transformer-

work page

[4] [8]

Forty-first International Conference on Machine Learning , year=

Mechanistic Design and Scaling of Hybrid Architectures , author=. Forty-first International Conference on Machine Learning , year=

work page

[5] [9]

TACL , year=

Saturated Transformers are Constant-Depth Threshold Circuits , author=. TACL , year=

work page

[6] [10]

ICLR , year=

Transformers Learn Shortcuts to Automata , author=. ICLR , year=

work page

[7] [11]

Transactions of the Association for Computational Linguistics , volume=

Saturated Transformers are Constant-Depth Threshold Circuits , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , doi=

work page 2022

[8] [12]

2024 , eprint=

TransformerFAM: Feedback attention is working memory , author=. 2024 , eprint=

work page 2024

[9] [13]

2023 , eprint=

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit , author=. 2023 , eprint=

work page 2023

[10] [14]

2023 , eprint=

Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks , author=. 2023 , eprint=

work page 2023

[11] [15]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

work page 2022

[12] [16]

Communications of the ACM , volume=

Roofline: An Insightful Visual Performance Model for Multicore Architectures , author=. Communications of the ACM , volume=. 2009 , doi=

work page 2009

[13] [17]

1994 , howpublished=

Learning Long-Term Dependencies with Gradient Descent is Difficult , author=. 1994 , howpublished=

work page 1994

[14] [18]

2013 , eprint=

On the difficulty of training Recurrent Neural Networks , author=. 2013 , eprint=

work page 2013

[15] [19]

Self-attention does not need

Rabe, Markus N and Staats, Charles , journal=. Self-attention does not need

work page

[16] [20]

International conference on machine learning , pages=

Scaling vision transformers to 22 billion parameters , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023

[17] [21]

International conference on machine learning , pages=

On layer normalization in the transformer architecture , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[18] [22]

NIPS-W , year=

Automatic differentiation in PyTorch , author=. NIPS-W , year=

work page

[19] [24]

Journal of machine learning research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=

work page

[20] [25]

First conference on language modeling , year=

Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=

work page

[21] [26]

International conference on machine learning , pages=

Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[22] [30]

Neural computation , volume=

Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=

work page 1997

[23] [31]

Advances in Neural Information Processing Systems , volume=

Recurrent memory transformer , author=. Advances in Neural Information Processing Systems , volume=

work page

[24] [32]

arXiv preprint, 2019 , author=

Compressive transformers for long-range sequence modelling. arXiv preprint, 2019 , author=. URL https://arxiv. org/abs , year=

work page 2019

[25] [33]

International Conference on Machine Learning , pages=

Resurrecting recurrent neural networks for long sequences , author=. International Conference on Machine Learning , pages=. 2023 , organization=

work page 2023

[26] [34]

The Thirteenth International Conference on Learning Representations , year=

Deconstructing What Makes a Good Optimizer for Autoregressive Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[27] [35]

The Thirteenth International Conference on Learning Representations , year=

How Does Critical Batch Size Scale in Pre-training? , author=. The Thirteenth International Conference on Learning Representations , year=

work page

[28] [36]

2025 , eprint=

Establishing Task Scaling Laws via Compute-Efficient Model Ladders , author=. 2025 , eprint=

work page 2025

[29] [37]

2024 , eprint=

OLMo: Accelerating the Science of Language Models , author=. 2024 , eprint=

work page 2024

[30] [38]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page

[31] [39]

Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =

Rowan Zellers and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page

[32] [41]

Proceedings of the EMNLP , year =

Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal , title =. Proceedings of the EMNLP , year =

work page

[33] [42]

Liu and Matt Gardner , title =

Johannes Welbl and Nelson F. Liu and Matt Gardner , title =. Proceedings of the Workshop on Noisy User-generated Text (WNUT) , year =

work page

[34] [43]

Proceedings of the AAAI Conference on Artificial Intelligence , year =

Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =

work page

[35] [46]

Root Mean Square Layer Normalization , url =

Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , url =

work page

[36] [47]

2022 , eprint=

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. 2022 , eprint=

work page 2022

[37] [48]

Bengio, P

Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 1994. Often cited via the 1994 journal version

work page 1994

[38] [49]

H., Soldaini, L., Smith, N

A. Bhagia, J. Liu, A. Wettig, D. Heineman, O. Tafjord, A. H. Jha, L. Soldaini, N. A. Smith, D. Groeneveld, P. W. Koh, J. Dodge, and H. Hajishirzi. Establishing task scaling laws via compute-efficient model ladders, 2025. URL https://arxiv.org/abs/2412.04403

work page arXiv 2025

[39] [50]

Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020

[40] [51]

Bordelon, L

B. Bordelon, L. Noci, M. B. Li, B. Hanin, and C. Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023

work page 2023

[41] [52]

Bulatov, Y

A. Bulatov, Y. Kuratov, and M. Burtsev. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35: 0 11079--11091, 2022

work page 2022

[42] [53]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. In arXiv preprint arXiv:1803.05457, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[43] [54]

Z. Dai, Z. Yang, Y. Yang, et al. Transformer- XL : Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019

work page Pith review arXiv 1901

[44] [55]

T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022

work page 2022

[45] [56]

Dehghani, J

M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480--7512. PMLR, 2023

work page 2023

[46] [57]

A. Fan, T. Lavril, E. Grave, A. Joulin, and S. Sukhbaatar. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402, 2020

work page arXiv 2002

[47] [58]

OLMo: Accelerating the Science of Language Models

D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, E. S...

work page internal anchor Pith review arXiv 2024

[48] [59]

Gu and T

A. Gu and T. Dao. Mamba: Linear-time sequence modeling with selective state spaces. In First conference on language modeling, 2024

work page 2024

[49] [60]

Hochreiter and J

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997

work page 1997

[50] [61]

Transformerfam: Feedback attention is working memory

D. Hwang, W. Wang, Z. Huo, K. C. Sim, and P. Moreno Mengibar. Transformerfam: Feedback attention is working memory. arXiv preprint arXiv:2404.09173, 2024

work page arXiv 2024

[51] [62]

Repeat after me: Transformers are bet- ter than state space models at copying.arXiv preprint arXiv:2402.01032, 2024

S. Jelassi, D. Brandfonbrener, S. M. Kakade, and E. Malach. Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032, 2024

work page arXiv 2024

[52] [63]

D. Ju, S. Roller, S. Sukhbaatar, and J. Weston. Staircase attention for recurrent processing of sequences. arXiv preprint arXiv:2106.04279, 2021

work page arXiv 2021

[53] [64]

Katharopoulos, A

A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020

work page 2020

[54] [65]

B. Liu, J. Ash, S. Goel, A. Krishnamurthy, and C. Zhang. Transformers learn shortcuts to automata. In ICLR, 2023. arXiv:2210.10749

work page arXiv 2023

[55] [66]

An Empirical Model of Large-Batch Training

S. McCandlish, J. Kaplan, D. Amodei, and O. Team. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018

work page Pith review arXiv 2018

[56] [67]

URLhttps://aclanthology.org/2022.tacl-1.49/

W. Merrill, A. Sabharwal, and N. A. Smith. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10: 0 843--856, 2022. doi:10.1162/tacl_a_00493. URL https://aclanthology.org/2022.tacl-1.49/

work page doi:10.1162/tacl_a_00493 2022

[57] [68]

Mihaylov, P

T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the EMNLP, 2018

work page 2018

[58] [69]

T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024

work page internal anchor Pith review arXiv 2024

[59] [70]

Oncescu, S

C.-A. Oncescu, S. J. Purandare, S. Idreos, and S. Kakade. Flash inference: Near linear time inference for long convolution sequence models and beyond. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 49732--49757, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/f...

work page 2025

[60] [71]

Orvieto, S

A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023

work page 2023

[61] [72]

Pascanu, T

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks, 2013

work page 2013

[62] [73]

Paszke, S

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017

work page 2017

[63] [74]

B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023

work page internal anchor Pith review arXiv 2023

[64] [75]

M. Poli, A. W. Thomas, E. Nguyen, P. Ponnusamy, B. Deiseroth, K. Kersting, T. Suzuki, B. Hie, S. Ermon, C. Re, C. Zhang, and S. Massaroli. Mechanistic design and scaling of hybrid architectures. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=GDp7Gyd9nf

work page 2024

[65] [76]

Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation

O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409

work page internal anchor Pith review arXiv 2022

[66] [77]

M. N. Rabe and C. Staats. Self-attention does not need O (n^2) memory. arXiv preprint arXiv:2112.05682, 2021

work page arXiv 2021

[67] [78]

J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. arxiv preprint, 2019. URL https://arxiv. org/abs, 1911

work page 2019

[68] [79]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

work page 2020

[69] [80]

Sakaguchi, R

K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020

work page 2020

[70] [81]

C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018

work page Pith review arXiv 2018

[71] [82]

J. T. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022

work page internal anchor Pith review arXiv 2022

[72] [83]

Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023

work page internal anchor Pith review arXiv 2023

[73] [84]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need. In NeurIPS, 2017

work page 2017

[74] [85]

Welbl, N

J. Welbl, N. F. Liu, and M. Gardner. Crowdsourcing multiple choice science questions. In Proceedings of the Workshop on Noisy User-generated Text (WNUT), 2017

work page 2017

[75] [86]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52 0 (4): 0 65--76, 2009. doi:10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009

[76] [87]

Xiong, Y

R. Xiong, Y. Yang, D. He, K. Zheng, S. Zheng, C. Xing, H. Zhang, Y. Lan, L. Wang, and T. Liu. On layer normalization in the transformer architecture. In International conference on machine learning, pages 10524--10533. PMLR, 2020

work page 2020

[77] [88]

G. Yang, D. Yu, C. Zhu, and S. Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks, 2023

work page 2023

[78] [89]

Zellers, Y

R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019

work page 2019

[79] [90]

Zhang and R

B. Zhang and R. Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf

work page 2019

[80] [91]

Zhang, D

H. Zhang, D. Morwani, N. Vyas, J. Wu, D. Zou, U. Ghai, D. Foster, and S. M. Kakade. How does critical batch size scale in pre-training? In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=JCiF03qnmi

work page 2025