The Recurrent Transformer: Greater Effective Depth and Efficient Decoding
Pith reviewed 2026-05-09 22:00 UTC · model grok-4.3
The pith
Recurrent Transformers improve C4 pretraining cross-entropy by adding per-layer recurrence that increases effective depth while allowing fewer layers at fixed parameter count.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By recomputing key and value projections from each layer's own hidden states, the Recurrent Transformer injects recurrence across layers while preserving autoregressive decoding cost. This change yields greater effective depth at fixed parameter budgets, producing lower cross-entropy on C4 pretraining than standard Transformers and permitting the same accuracy with shallower stacks. The accompanying tiling procedure reduces memory traffic to Theta(N log N) and raises arithmetic intensity to Theta(N / log N), making the sequential dependencies practical to train.
What carries the argument
Per-layer recurrent attention, where each layer attends to key-value pairs derived from its own activations rather than the prior layer's outputs.
If this is right
- Lower cross-entropy loss on C4 pretraining for both 150M and 300M parameter models relative to standard Transformers.
- Performance gains remain available when the recurrent model is configured with fewer layers than the baseline at matched parameter count.
- Smaller KV cache footprint and lower inference latency because effective depth is obtained with shallower stacks.
- Training and prefill arithmetic intensity rises to Theta(N / log N) for sequence length N through exact tiling.
- The architecture can replicate either conventional Transformer behavior or token-to-token recurrence as required.
Where Pith is reading between the lines
- The depth-for-width trade-off may let practitioners reach higher effective depth without proportional growth in parameter count or inference memory.
- The tiling technique could be reused for other sequence models that introduce intra-layer dependencies during training.
- Selective application of recurrence to only some layers might further balance quality against speed on long sequences.
- Similar per-layer recurrence might be tested on non-language sequence tasks to check whether the depth gain generalizes.
Load-bearing premise
The per-layer recurrence can be optimized without instability and the tiling algorithm exactly reproduces the sequential computation.
What would settle it
A side-by-side C4 pretraining run in which a Recurrent Transformer with fewer layers fails to reach lower cross-entropy than its parameter-matched standard Transformer baseline.
Figures
read the original abstract
Transformers process tokens in parallel but are temporally shallow: at position $t$, each layer attends to key-value pairs computed based on the previous layer, yielding a depth capped by the number of layers. Recurrent models offer unbounded temporal depth but suffer from optimization instability and historically underutilize modern accelerators. We introduce the Recurrent Transformer, a simple architectural change where each layer attends to key-value pairs computed off its own activations, yielding layerwise recurrent memory while preserving standard autoregressive decoding cost. We show that the architecture can emulate both (i) a conventional Transformer and (ii) token-to-token recurrent updates under mild assumptions, while avoiding optimization instability. Naively, prefill/training appears bandwidth-bound with effective arithmetic intensity near $1$ because keys and values are revealed sequentially; we give an exact tiling-based algorithm that preserves the mathematical computation while reducing HBM traffic from $\Theta(N^2)$ to $\Theta(N\log N)$, increasing effective arithmetic intensity to $\Theta(N/\log N)$ for sequence length $N$. On 150M and 300M parameter C4 pretraining, Recurrent Transformers improve cross-entropy over a parameter-matched Transformer baseline and achieve the improvement with fewer layers (fixed parameters), suggesting that recurrence can trade depth for width, thus reducing KV cache memory footprint and inference latency.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Recurrent Transformer, a modification to the standard Transformer where each layer attends to key-value pairs computed from its own activations rather than the prior layer. This yields layerwise recurrent memory while preserving autoregressive decoding. The architecture is shown to emulate both conventional Transformers and token-to-token recurrence under mild assumptions without optimization instability. A key contribution is an exact tiling algorithm that reduces HBM traffic from Θ(N²) to Θ(N log N) during prefill/training, raising arithmetic intensity to Θ(N/log N). On C4 pretraining, 150M- and 300M-parameter Recurrent Transformers achieve lower cross-entropy than parameter-matched baselines while using fewer layers, suggesting recurrence can trade depth for width and thereby reduce KV-cache footprint and inference latency.
Significance. If the central claims hold, the work provides a practical route to greater effective depth in Transformers without increasing layer count, enabling wider-shallower models that cut inference memory and latency while improving language-modeling performance. The tiling algorithm directly addresses the bandwidth bottleneck that has historically limited recurrent-style computations on accelerators. Credit is due for the clean architectural equivalence results and the focus on both training-time efficiency and downstream inference benefits.
major comments (2)
- [§4] §4 (tiling algorithm): The claim that the exact tiling algorithm preserves the mathematical computation (including causal masking and sequential K/V revelation) while reducing HBM traffic to Θ(N log N) is load-bearing for both the reported pretraining gains and the efficiency assertions. The manuscript should supply either a formal equivalence argument or complete pseudocode that demonstrates identical outputs to the naïve sequential implementation; any discrepancy in accumulation order or masking would invalidate the C4 results as evidence for the intended Recurrent Transformer.
- [§5] §5 (experiments): The central empirical claim—that Recurrent Transformers outperform parameter-matched Transformer baselines on C4 with fewer layers—is load-bearing for the depth-for-width trade-off argument. The section must report exact layer counts, hyperparameter-matching protocol, number of independent runs, error bars or confidence intervals, and at least one ablation isolating the recurrence mechanism; without these, the magnitude and reliability of the reported cross-entropy improvement cannot be assessed.
minor comments (2)
- [Abstract] Abstract and §3: The phrase 'under mild assumptions' for the emulation properties is repeated but never enumerated; a short explicit list of the assumptions would improve clarity.
- [§4] Notation: Sequence length is denoted N in the complexity statements but occasionally appears as other symbols in the tiling description; consistent use throughout would aid readability.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive review. The comments highlight important areas for strengthening the presentation of the tiling algorithm and the experimental results. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [§4] §4 (tiling algorithm): The claim that the exact tiling algorithm preserves the mathematical computation (including causal masking and sequential K/V revelation) while reducing HBM traffic to Θ(N log N) is load-bearing for both the reported pretraining gains and the efficiency assertions. The manuscript should supply either a formal equivalence argument or complete pseudocode that demonstrates identical outputs to the naïve sequential implementation; any discrepancy in accumulation order or masking would invalidate the C4 results as evidence for the intended Recurrent Transformer.
Authors: We agree that a fully rigorous demonstration of equivalence is essential. Section 4 describes the tiling procedure and explains how it preserves sequential KV revelation and applies causal masking at each step to ensure mathematical identity with the naive implementation. To strengthen this, the revised manuscript will include complete pseudocode for the tiled prefill/training algorithm together with a concise equivalence argument showing that the output, accumulation order, and masking behavior are identical to the sequential version. revision: yes
-
Referee: [§5] §5 (experiments): The central empirical claim—that Recurrent Transformers outperform parameter-matched Transformer baselines on C4 with fewer layers—is load-bearing for the depth-for-width trade-off argument. The section must report exact layer counts, hyperparameter-matching protocol, number of independent runs, error bars or confidence intervals, and at least one ablation isolating the recurrence mechanism; without these, the magnitude and reliability of the reported cross-entropy improvement cannot be assessed.
Authors: We acknowledge that the current experimental section would benefit from greater detail. The revised manuscript will explicitly report the layer counts used for the 150M- and 300M-parameter models, provide a full description of the hyperparameter-matching protocol (total parameters, optimizer settings, learning-rate schedule, and data order), and add an ablation that isolates the recurrence mechanism by comparing against a non-recurrent architecture with otherwise identical structure. Because the original runs were performed singly owing to compute constraints, we will state this limitation clearly and report the observed cross-entropy values as point estimates; additional runs will be pursued if resources permit. revision: partial
- Reporting error bars or confidence intervals from multiple independent runs, as the original C4 pretraining experiments were conducted as single runs due to computational cost.
Circularity Check
No circularity: architecture, equivalence claims, and efficiency algorithm are self-contained definitions and algorithms; empirical gains are reported from independent pretraining runs.
full rationale
The paper introduces the Recurrent Transformer via explicit architectural modifications (each layer attends to its own activations), states mild assumptions under which it emulates a standard Transformer or token-level recurrence, and presents a tiling algorithm claimed to preserve exact computation while changing memory traffic. These are definitional and algorithmic steps, not derivations that reduce to fitted parameters or prior self-citations. The central performance claims rest on C4 pretraining experiments with parameter-matched baselines, which are external to any internal fitting loop. No load-bearing step matches the enumerated circularity patterns; the derivation chain is independent of its own outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- standard math Standard multi-head attention equations
- domain assumption Mild assumptions allow emulation of conventional and recurrent models
invented entities (1)
-
Recurrent Transformer layer
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Training-Free Looped Transformers
Training-free looped transformers retrofit recurrence to frozen models via damped ODE sub-steps on mid-stack blocks, yielding gains such as +2.64 pp on MMLU-Pro for Qwen3-4B.
Reference graph
Works this paper leans on
-
[1]
Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond , url =
Oncescu, Costin-Andrei and Purandare, Sanket Jayant and Idreos, Stratos and Kakade, Sham , booktitle =. Flash Inference: Near Linear Time Inference for Long Convolution Sequence Models and Beyond , url =
- [2]
-
[6]
Dai, Zihang and Yang, Zhilin and Yang, Yiming and others , journal=. Transformer-
-
[8]
Forty-first International Conference on Machine Learning , year=
Mechanistic Design and Scaling of Hybrid Architectures , author=. Forty-first International Conference on Machine Learning , year=
-
[9]
Saturated Transformers are Constant-Depth Threshold Circuits , author=. TACL , year=
- [10]
-
[11]
Transactions of the Association for Computational Linguistics , volume=
Saturated Transformers are Constant-Depth Threshold Circuits , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , doi=
work page 2022
-
[12]
TransformerFAM: Feedback attention is working memory , author=. 2024 , eprint=
work page 2024
-
[13]
Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit , author=. 2023 , eprint=
work page 2023
-
[14]
Tensor Programs VI: Feature Learning in Infinite-Depth Neural Networks , author=. 2023 , eprint=
work page 2023
-
[15]
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=
work page 2022
-
[16]
Communications of the ACM , volume=
Roofline: An Insightful Visual Performance Model for Multicore Architectures , author=. Communications of the ACM , volume=. 2009 , doi=
work page 2009
-
[17]
Learning Long-Term Dependencies with Gradient Descent is Difficult , author=. 1994 , howpublished=
work page 1994
-
[18]
On the difficulty of training Recurrent Neural Networks , author=. 2013 , eprint=
work page 2013
-
[19]
Rabe, Markus N and Staats, Charles , journal=. Self-attention does not need
-
[20]
International conference on machine learning , pages=
Scaling vision transformers to 22 billion parameters , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[21]
International conference on machine learning , pages=
On layer normalization in the transformer architecture , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
- [22]
-
[24]
Journal of machine learning research , volume=
Exploring the limits of transfer learning with a unified text-to-text transformer , author=. Journal of machine learning research , volume=
-
[25]
First conference on language modeling , year=
Mamba: Linear-time sequence modeling with selective state spaces , author=. First conference on language modeling , year=
-
[26]
International conference on machine learning , pages=
Transformers are rnns: Fast autoregressive transformers with linear attention , author=. International conference on machine learning , pages=. 2020 , organization=
work page 2020
-
[30]
Long short-term memory , author=. Neural computation , volume=. 1997 , publisher=
work page 1997
-
[31]
Advances in Neural Information Processing Systems , volume=
Recurrent memory transformer , author=. Advances in Neural Information Processing Systems , volume=
-
[32]
arXiv preprint, 2019 , author=
Compressive transformers for long-range sequence modelling. arXiv preprint, 2019 , author=. URL https://arxiv. org/abs , year=
work page 2019
-
[33]
International Conference on Machine Learning , pages=
Resurrecting recurrent neural networks for long sequences , author=. International Conference on Machine Learning , pages=. 2023 , organization=
work page 2023
-
[34]
The Thirteenth International Conference on Learning Representations , year=
Deconstructing What Makes a Good Optimizer for Autoregressive Language Models , author=. The Thirteenth International Conference on Learning Representations , year=
-
[35]
The Thirteenth International Conference on Learning Representations , year=
How Does Critical Batch Size Scale in Pre-training? , author=. The Thirteenth International Conference on Learning Representations , year=
-
[36]
Establishing Task Scaling Laws via Compute-Efficient Model Ladders , author=. 2025 , eprint=
work page 2025
-
[37]
OLMo: Accelerating the Science of Language Models , author=. 2024 , eprint=
work page 2024
-
[38]
Proceedings of the AAAI Conference on Artificial Intelligence , year =
Yonatan Bisk and Rowan Zellers and Ronan Le Bras and Jianfeng Gao and Yejin Choi , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
-
[39]
Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =
Rowan Zellers and Yonatan Bisk and Ali Farhadi and Yejin Choi , title =. Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL) , year =
-
[41]
Proceedings of the EMNLP , year =
Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal , title =. Proceedings of the EMNLP , year =
-
[42]
Liu and Matt Gardner , title =
Johannes Welbl and Nelson F. Liu and Matt Gardner , title =. Proceedings of the Workshop on Noisy User-generated Text (WNUT) , year =
-
[43]
Proceedings of the AAAI Conference on Artificial Intelligence , year =
Keisuke Sakaguchi and Ronan Le Bras and Chandra Bhagavatula and Yejin Choi , title =. Proceedings of the AAAI Conference on Artificial Intelligence , year =
-
[46]
Root Mean Square Layer Normalization , url =
Zhang, Biao and Sennrich, Rico , booktitle =. Root Mean Square Layer Normalization , url =
-
[47]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation , author=. 2022 , eprint=
work page 2022
- [48]
-
[49]
A. Bhagia, J. Liu, A. Wettig, D. Heineman, O. Tafjord, A. H. Jha, L. Soldaini, N. A. Smith, D. Groeneveld, P. W. Koh, J. Dodge, and H. Hajishirzi. Establishing task scaling laws via compute-efficient model ladders, 2025. URL https://arxiv.org/abs/2412.04403
-
[50]
Y. Bisk, R. Zellers, R. L. Bras, J. Gao, and Y. Choi. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[51]
B. Bordelon, L. Noci, M. B. Li, B. Hanin, and C. Pehlevan. Depthwise hyperparameter transfer in residual networks: Dynamics and scaling limit, 2023
work page 2023
-
[52]
A. Bulatov, Y. Kuratov, and M. Burtsev. Recurrent memory transformer. Advances in Neural Information Processing Systems, 35: 0 11079--11091, 2022
work page 2022
-
[53]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. In arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[54]
Z. Dai, Z. Yang, Y. Yang, et al. Transformer- XL : Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019
work page Pith review arXiv 1901
-
[55]
T. Dao, D. Y. Fu, S. Ermon, A. Rudra, and C. R \'e . Flashattention: Fast and memory-efficient exact attention with io-awareness, 2022
work page 2022
-
[56]
M. Dehghani, J. Djolonga, B. Mustafa, P. Padlewski, J. Heek, J. Gilmer, A. P. Steiner, M. Caron, R. Geirhos, I. Alabdulmohsin, et al. Scaling vision transformers to 22 billion parameters. In International conference on machine learning, pages 7480--7512. PMLR, 2023
work page 2023
- [57]
-
[58]
OLMo: Accelerating the Science of Language Models
D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia, R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Magnusson, Y. Wang, S. Arora, D. Atkinson, R. Authur, K. R. Chandu, A. Cohan, J. Dumas, Y. Elazar, Y. Gu, J. Hessel, T. Khot, W. Merrill, J. Morrison, N. Muennighoff, A. Naik, C. Nam, M. E. Peters, V. Pyatkin, A. Ravichander, D. Schwenk, S. Shah, W. Smith, E. S...
work page internal anchor Pith review arXiv 2024
- [59]
-
[60]
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9 0 (8): 0 1735--1780, 1997
work page 1997
-
[61]
Transformerfam: Feedback attention is working memory
D. Hwang, W. Wang, Z. Huo, K. C. Sim, and P. Moreno Mengibar. Transformerfam: Feedback attention is working memory. arXiv preprint arXiv:2404.09173, 2024
-
[62]
S. Jelassi, D. Brandfonbrener, S. M. Kakade, and E. Malach. Repeat after me: Transformers are better than state space models at copying. arXiv preprint arXiv:2402.01032, 2024
- [63]
-
[64]
A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156--5165. PMLR, 2020
work page 2020
- [65]
-
[66]
An Empirical Model of Large-Batch Training
S. McCandlish, J. Kaplan, D. Amodei, and O. Team. An empirical model of large-batch training. arXiv preprint arXiv:1812.06162, 2018
work page Pith review arXiv 2018
-
[67]
URLhttps://aclanthology.org/2022.tacl-1.49/
W. Merrill, A. Sabharwal, and N. A. Smith. Saturated transformers are constant-depth threshold circuits. Transactions of the Association for Computational Linguistics, 10: 0 843--856, 2022. doi:10.1162/tacl_a_00493. URL https://aclanthology.org/2022.tacl-1.49/
-
[68]
T. Mihaylov, P. Clark, T. Khot, and A. Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. In Proceedings of the EMNLP, 2018
work page 2018
-
[69]
T. OLMo, P. Walsh, L. Soldaini, D. Groeneveld, K. Lo, S. Arora, A. Bhagia, Y. Gu, S. Huang, M. Jordan, et al. 2 olmo 2 furious. arXiv preprint arXiv:2501.00656, 2024
work page internal anchor Pith review arXiv 2024
-
[70]
C.-A. Oncescu, S. J. Purandare, S. Idreos, and S. Kakade. Flash inference: Near linear time inference for long convolution sequence models and beyond. In Y. Yue, A. Garg, N. Peng, F. Sha, and R. Yu, editors, International Conference on Learning Representations, volume 2025, pages 49732--49757, 2025. URL https://proceedings.iclr.cc/paper_files/paper/2025/f...
work page 2025
-
[71]
A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670--26698. PMLR, 2023
work page 2023
-
[72]
R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks, 2013
work page 2013
- [73]
-
[74]
B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, S. Biderman, H. Cao, X. Cheng, M. Chung, M. Grella, et al. Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048, 2023
work page internal anchor Pith review arXiv 2023
-
[75]
M. Poli, A. W. Thomas, E. Nguyen, P. Ponnusamy, B. Deiseroth, K. Kersting, T. Suzuki, B. Hie, S. Ermon, C. Re, C. Zhang, and S. Massaroli. Mechanistic design and scaling of hybrid architectures. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=GDp7Gyd9nf
work page 2024
-
[76]
Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation
O. Press, N. A. Smith, and M. Lewis. Train short, test long: Attention with linear biases enables input length extrapolation, 2022. URL https://arxiv.org/abs/2108.12409
work page internal anchor Pith review arXiv 2022
- [77]
-
[78]
J. W. Rae, A. Potapenko, S. M. Jayakumar, C. Hillier, and T. P. Lillicrap. Compressive transformers for long-range sequence modelling. arxiv preprint, 2019. URL https://arxiv. org/abs, 1911
work page 2019
- [79]
-
[80]
K. Sakaguchi, R. L. Bras, C. Bhagavatula, and Y. Choi. Winogrande: An adversarial winograd schema challenge at scale. In Proceedings of the AAAI Conference on Artificial Intelligence, 2020
work page 2020
-
[81]
C. J. Shallue, J. Lee, J. Antognini, J. Sohl-Dickstein, R. Frostig, and G. E. Dahl. Measuring the effects of data parallelism on neural network training. arXiv preprint arXiv:1811.03600, 2018
work page Pith review arXiv 2018
-
[82]
J. T. Smith, A. Warrington, and S. W. Linderman. Simplified state space layers for sequence modeling. arXiv preprint arXiv:2208.04933, 2022
work page internal anchor Pith review arXiv 2022
-
[83]
Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei. Retentive network: A successor to transformer for large language models. arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review arXiv 2023
-
[84]
A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need. In NeurIPS, 2017
work page 2017
- [85]
-
[86]
Roofline: An insightful visual performance model for multicore architectures,
S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Communications of the ACM, 52 0 (4): 0 65--76, 2009. doi:10.1145/1498765.1498785
- [87]
-
[88]
G. Yang, D. Yu, C. Zhu, and S. Hayou. Tensor programs vi: Feature learning in infinite-depth neural networks, 2023
work page 2023
-
[89]
R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2019
work page 2019
-
[90]
B. Zhang and R. Sennrich. Root mean square layer normalization. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alch\' e -Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/1e8a19426224ca89e83cef47f1e7f53b-Paper.pdf
work page 2019
- [91]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.