pith. sign in

arxiv: 2606.06479 · v1 · pith:AGTUNUJLnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI

Pretraining Recurrent Networks without Recurrence

Pith reviewed 2026-06-28 02:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords recurrent neural networkssupervised memory trainingparallel traininglanguage modelingpixel sequence modelinglong-range dependenciestransformer encoder
0
0 comments X

The pith

Supervised Memory Training reduces RNN pretraining to supervised one-step memory transitions using Transformer labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard training of recurrent networks uses backpropagation through time, which is sequential and can suffer from unstable gradients over long sequences. The paper proposes Supervised Memory Training to instead generate memory labels with a Transformer trained to predict future states, then supervise the RNN to learn memory updates in one step. This makes training parallel across time steps with fixed gradient length. Sympathetic readers would care if this allows nonlinear RNNs to scale to long sequences in tasks like language modeling where current methods struggle.

Core claim

By training a Transformer encoder on a predictive state objective to produce memory labels, SMT reduces RNN training to supervised learning on pairs (m_t, x_{t+1}) mapping to m_{t+1}, enabling time-parallel training of nonlinear RNNs with stable O(1) length gradient paths between any tokens without unrolling the network, and outperforming BPTT on language and pixel sequence modeling.

What carries the argument

The predictive state objective that trains the Transformer encoder to retain only past information necessary to predict the future, generating memory labels for supervised RNN training.

If this is right

  • RNN training becomes fully parallelizable in time without sequential unrolling.
  • Gradient paths between tokens have constant length independent of sequence length.
  • Various RNN architectures can be pretrained on language modeling and pixel sequences more effectively than with BPTT.
  • Memory content selection is decoupled from the memory update rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach might apply to training other sequential models that currently rely on recurrence.
  • Combining SMT with larger-scale predictive encoders could improve label quality for complex temporal tasks.
  • It opens the possibility of hybrid models where Transformers generate targets for recurrent components at scale.

Load-bearing premise

The memory states generated by the Transformer encoder on the predictive state objective are sufficient for the RNN to learn effective long-range associations when trained via supervised one-step transitions.

What would settle it

An experiment where an RNN trained via SMT on a long-sequence task requiring dependencies across many steps shows no improvement over BPTT or fails to learn those associations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.06479 by Akarsh Kumar, Phillip Isola.

Figure 1
Figure 1. Figure 1: BPTT vs SMT. Left: BPTT trains an RNN by recurrently unrolling the “updater” network in time, and backpropagating gradients through the entire graph. Right: Supervised Memory Training (SMT) trains an RNN with supervised learning on one-step memory transition labels, which are generated by a Transformer encoder-decoder model pair trained to produce predictive states. SMT is fully time-parallel. In SMT, the … view at source ↗
Figure 2
Figure 2. Figure 2: SMT vs DMT. SMT trains the RNN with behavior cloning on the encoder￾generated memory states (off-policy imitation learning). DMT unrolls the RNN with its own memory states and then imitates the encoder trajectory (on-policy imitation learning). Fig￾ure design inspired by Jacobs et al. [59]. After SMT, the RNN achieves low one-step error in predicting (mt, xt+1) → mt+1 when mt comes from the encoder. Howeve… view at source ↗
Figure 3
Figure 3. Figure 3: Synthetic Task Experiments. We evaluate BPTT, SMT, and SMT→DMT using five synthetic tasks with various settings to probe different properties of the algorithms. ∗ signifies that the SMT Encoder is the teacher Transformer (not an RNN) and is used only as a reference. Across all tasks and task settings, SMT→DMT outperforms BPTT, signaling that SMT has better gradient properties, memory utilization, state tra… view at source ↗
Figure 4
Figure 4. Figure 4: Attneave’s MNIST Generation. BPTT fails to effectively capture the long-range depen￾dencies required for pixel sequence modeling, even with a GRU. SMT→DMT captures these depen￾dencies with a non-gated RNN architecture. More samples are in Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Attneave’s Sketchy Generation. SMT→DMT captures the stroke structure of human-drawn sketches through only pixel se￾quence modeling on sparse images. More samples are in Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Sequential Compute and Data Efficiency. We sweep training hyperparameters for BPTT, SMT, and SMT→DMT and plot the resulting runs’ performance along sequential compute (SeqFLOPs) used and data processed (Tokens), across different RNN architectures and datasets. Runs are capped at one day on an H200 GPU. ∗ signifies that the SMT Encoder is the teacher Transformer (not an RNN) and is used only as a reference.… view at source ↗
Figure 8
Figure 8. Figure 8: Scaling Model Size. Sweeping the width and depth of the RNN and teacher shows smooth performance improvements in TinyStories. The RNN imitates the teacher performance better at larger scale. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Scaling Laws for Compression. We plot iso-loss contours for SMT-trained en￾coder models across a range of memory state sizes and training compute budgets. For a fixed target performance, SMT can achieve higher compression (smaller memory size) us￾ing additional compute. This result suggests a new property to scale when given more train￾ing compute: memory state compression. Neural scaling laws predict the … view at source ↗
Figure 11
Figure 11. Figure 11: Gradient Properties of BPTT and SMT. In the needle retrieval task, the loss is applied at the last timestep. BPTT propagates gradients backward through all timesteps, risking vanishing/exploding gradients for each mt, depending on the weight initialization. SMT is non-recurrent and has a O(1) credit path length, making its gradients agnostic to initialization and time-horizon. 8 [PITH_FULL_IMAGE:figures/… view at source ↗
Figure 12
Figure 12. Figure 12: Impact of DMT across many runs with different SMT λdec and λdyn hyperparameters. Left: Applying DMT reduces the drift of the RNN rollout (measured with 1 − R2 of RNN memory prediction mˆ t of encoder ground truth mt). Middle: DMT significantly improves RNN performance across settings. Right: The one-step drift of the RNN only partially correlates with the rollout drift. Drift and DMT As described in Secti… view at source ↗
Figure 13
Figure 13. Figure 13: Sequence Length Generaliza￾tion. An SMT→DMT trained RNN general￾izes better than its Transformer teacher when evaluated on sequence lengths longer than training. The task is synthetic state tracking. Benefit of RNNs over Transformers SMT trains an RNN to mimic a Transformer encoder model, rais￾ing the question of why an RNN is needed at all, given the Transformer. RNNs are qualitatively more efficient tha… view at source ↗
Figure 15
Figure 15. Figure 15: Model Architecture for SMT. Left: The encoder reads the input context tokens and a set of learned register tokens, and outputs the memory, mt, which is a set of memory tokens. The decoder takes in this memory and the future input tokens and predicts the future output tokens, using a causal mask. This setup forces information from the context to be compressed into a memory that is useful for predicting the… view at source ↗
Figure 16
Figure 16. Figure 16: Sweep of λdyn and λunif. Cell color indicates the RNN test loss for each setting. Top number in each cell is the RNN test loss. Bottom number in each cell shows the L unif . L unif varies from 0 (collapsed latent space) to −4 (fully uniform latent space). memory tokens. The encoder is 8 layers deep, while the decoder is 4 layers deep. The RNN is also 8 layers deep, and its readout function is 4 layers dee… view at source ↗
Figure 17
Figure 17. Figure 17: Additional MNIST Samples. Here we give more examples of samples of MNIST images generated by the various methods. SMT→DMT RNN outperforms BPTT, even when BPTT is applied on a GRU architecture, in processing long-horizon information, which is required for pixel modeling. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Additional Sketchy Samples. Here we give more examples of samples of Sketchy images from the dataset and generated by SMT→DMT. Even in this hard sparse domain, SMT→DMT can capture the overall stroke structure, which requires integrating information over hundreds of pixels. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Analysis on Attneave’s Cat. We apply the SMT→DMT-trained RNN on Sketchy and evaluate it on the classic image of Attneave’s cat. The RNN reads the image pixel-by-pixel in raster scan order. Top Left: Input image presented in its original 2D form. Top Middle: 3D t-SNE projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory throughout sequence processing. Top R… view at source ↗
Figure 20
Figure 20. Figure 20: Generations of Attneave’s Cat. We apply the SMT→DMT-trained RNN on Sketchy and apply it to generate part of the image of Attneave’s cat. Given more of the image context, the RNN seems to understand the image better and make somewhat more plausible predictions. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: RNN Memory Evolution on MNIST (PCA). We analyze the memory evolution of our SMT→DMT MNIST RNN. Left: Input image presented in its original 2D form. Middle: 3D PCA projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory during processing. Right: 2D PCA projection of the memory state trajectory over time. Data as Image Memory as Image (t-SNE 3D) Memory Space (… view at source ↗
Figure 22
Figure 22. Figure 22: RNN Memory Evolution on MNIST (t-SNE). We analyze the memory evolution of our SMT→DMT MNIST RNN. Left: Input image presented in its original 2D form. Middle: 3D t-SNE projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory during processing. Right: 2D t-SNE projection of the memory state trajectory over time. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: RNN Memory Evolution on Sketchy (t-SNE). We analyze the memory evolution of our SMT→DMT Sketchy RNN. Left: Input image presented in its original 2D form. Middle: 3D t-SNE projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory during processing. Right: 2D t-SNE projection of the memory state trajectory over time. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗
read the original abstract

Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Supervised Memory Training (SMT) to pretrain nonlinear RNNs without recurrence or BPTT. A Transformer encoder is first trained on a predictive-state objective to produce memory labels m_t that retain only information from the past needed to predict the future; the RNN is then trained via supervised one-step transitions (m_t, x_{t+1}) o m_{t+1}. This is claimed to yield time-parallel training, a stable O(1)-length gradient path between any tokens, and better long-range dependency capture than BPTT on language modeling and pixel-sequence tasks.

Significance. If the empirical claims hold, SMT would decouple memory-label generation from recurrent dynamics and remove the need to unroll RNNs, potentially allowing nonlinear RNNs to scale on long sequences where BPTT fails. No machine-checked proofs, reproducible code, or parameter-free derivations are presented, so the significance rests entirely on the (currently undetailed) experimental results.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim that SMT 'outperforms BPTT' on language modeling and pixel sequence modeling is stated without any quantitative results, baselines, ablation studies, or experimental protocol, rendering the claim impossible to evaluate.
  2. [Abstract] Abstract: the load-bearing assumption that Transformer-generated labels m_t produced by the predictive-state objective contain sufficient long-range state for an RNN trained only on one-step supervised transitions to maintain and propagate that information over hundreds of steps receives no supporting analysis, derivation, or ablation.
minor comments (1)
  1. The transition from the predictive-state loss to the supervised memory labels is described at a high level; an explicit equation relating the two objectives would clarify the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We agree that the abstract requires strengthening to make the empirical claims and underlying assumptions more self-contained and evaluable. We will revise the abstract accordingly while preserving the manuscript's core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim that SMT 'outperforms BPTT' on language modeling and pixel sequence modeling is stated without any quantitative results, baselines, ablation studies, or experimental protocol, rendering the claim impossible to evaluate.

    Authors: We agree that the abstract would benefit from quantitative support. In the revised version we will incorporate specific metrics (e.g., perplexity reductions on language modeling and accuracy gains on long pixel sequences), explicit baselines, and a concise statement of the experimental protocol so that the performance claim can be evaluated directly from the abstract. revision: yes

  2. Referee: [Abstract] Abstract: the load-bearing assumption that Transformer-generated labels m_t produced by the predictive-state objective contain sufficient long-range state for an RNN trained only on one-step supervised transitions to maintain and propagate that information over hundreds of steps receives no supporting analysis, derivation, or ablation.

    Authors: The predictive-state objective is constructed precisely so that each m_t retains only the information required to predict future tokens; the one-step supervised transitions then train the RNN to reproduce this mapping. The main text provides empirical ablations demonstrating improved long-range dependency capture relative to BPTT. Nevertheless, we acknowledge that the abstract itself offers no explicit justification or ablation summary. We will add a short clause in the abstract explaining the objective's design and will expand the discussion of label sufficiency in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent label generation step

full rationale

The paper presents SMT as an empirical training procedure: a Transformer is trained separately on a predictive-state objective to produce memory labels m_t, after which an RNN is trained via supervised one-step transitions on those fixed labels. No equations, derivations, or self-citations are shown that reduce the claimed O(1) gradient path or performance gains to quantities fitted inside the RNN itself or to prior author results. The central performance claims rest on experimental comparisons to BPTT rather than any tautological reduction, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; the central assumption is that the predictive state objective yields useful memory labels, treated here as a domain assumption with no independent evidence supplied.

axioms (1)
  • domain assumption A Transformer encoder trained on a predictive state objective retains only information from the past necessary to predict the future.
    This premise is invoked to justify the quality of the memory labels used for supervised RNN training.
invented entities (1)
  • Supervised Memory Training (SMT) no independent evidence
    purpose: Training procedure that decouples memory content from memory update via supervised labels
    Newly introduced method name and procedure.

pith-pipeline@v0.9.1-grok · 5730 in / 1317 out tokens · 41867 ms · 2026-06-28T02:06:37.902890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

141 extracted references · 43 linked inside Pith

  1. [1]

    What learning algorithm is in-context learning? investigations with linear models

    Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022

  2. [2]

    Validity of the single processor approach to achieving large scale computing capabilities

    Gene M Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485, 1967

  3. [3]

    An evolutionary algorithm that constructs recurrent neural networks

    Peter J Angeline, Gregory M Saunders, and Jordan B Pollack. An evolutionary algorithm that constructs recurrent neural networks. IEEE transactions on Neural Networks, 5(1):54–65, 1994

  4. [4]

    Unitary evolution recurrent neural networks

    Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International conference on machine learning, pages 1120–1128. PMLR, 2016

  5. [5]

    Some informational aspects of visual perception

    Fred Attneave. Some informational aspects of visual perception. Psychological review, 61(3): 183, 1954

  6. [6]

    Neural machine translation by jointly learning to align and translate, 2016

    Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URLhttps://arxiv.org/abs/1409.0473

  7. [7]

    An empirical evaluation of generic convo- lutional and recurrent networks for sequence modeling

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convo- lutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018

  8. [8]

    Deep equilibrium models

    Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in neural information processing systems, 32, 2019

  9. [9]

    xlstm: Ex- tended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. Advances in Neural Information Processing Systems, 37: 107547–107603, 2024

  10. [10]

    Scheduled sampling for sequence prediction with recurrent neural networks, 2015

    Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks, 2015. URL https://arxiv.org/abs/ 1506.03099

  11. [11]

    Learning long-term dependencies with gradient descent is difficult

    Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994

  12. [12]

    A brief history of intelligence: evolution, AI, and the five breakthroughs that made our brains

    Max S Bennett. A brief history of intelligence: evolution, AI, and the five breakthroughs that made our brains. HarperCollins, 2023

  13. [13]

    Prefix sums and their applications

    Guy E Blelloch. Prefix sums and their applications. 1990

  14. [14]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020. 12

  15. [15]

    Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

    Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

  16. [16]

    Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts, 2026

    Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts, 2026. URL https://arxiv.org/abs/ 2601.22156

  17. [17]

    On the properties of neural machine translation: Encoder–decoder approaches

    Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pages 103–111, 2014

  18. [18]

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014

  19. [19]

    Hierarchical multiscale recurrent neural networks

    Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016

  20. [20]

    A taxonomy of problems with fast parallel algorithms

    Stephen A Cook. A taxonomy of problems with fast parallel algorithms. Information and control, 64(1-3):2–22, 1985

  21. [21]

    Transformer-xl: Attentive language models beyond a fixed-length context

    Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdi- nov. Transformer-xl: Attentive language models beyond a fixed-length context. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 2978–2988, 2019

  22. [22]

    Deeppcr: Parallelizing sequential operations in neural networks

    Federico Danieli, Miguel Sarabia, Xavier Suau Cuadros, Pau Rodriguez, and Luca Zap- pella. Deeppcr: Parallelizing sequential operations in neural networks. Advances in Neural Information Processing Systems, 36:47598–47625, 2023

  23. [23]

    Pararnn: Unlocking parallel training of nonlinear rnns for large language models, 2025

    Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, and Luca Zappella. Pararnn: Unlocking parallel training of nonlinear rnns for large language models, 2025. URL https: //arxiv.org/abs/2510.21450

  24. [24]

    Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

    Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

  25. [25]

    Practical learning of predictive state representations

    Carlton Downey, Ahmed Hefny, and Geoffrey Gordon. Practical learning of predictive state representations. arXiv preprint arXiv:1702.04121, 2017

  26. [26]

    Predictive state recurrent neural networks, 2017

    Carlton Downey, Ahmed Hefny, Boyue Li, Byron Boots, and Geoffrey Gordon. Predictive state recurrent neural networks, 2017. URLhttps://arxiv.org/abs/1705.09353

  27. [27]

    Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

    Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

  28. [28]

    Finding structure in time

    Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

  29. [29]

    Addressing some limitations of transformers with feedback memory

    Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402, 2020

  30. [30]

    What is wrong with perplexity for long-context language modeling?, 2025

    Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling?, 2025. URLhttps://arxiv.org/abs/2410.23771

  31. [31]

    Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024

    Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajimir- sadeghi. Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024

  32. [32]

    Neural thickets: Diverse task experts are dense around pretrained weights

    Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights. arXiv preprint arXiv:2603.12228, 2026. 13

  33. [33]

    Scaling up test-time compute with latent reasoning: A recurrent depth approach

    Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025

  34. [34]

    Looped transformers as programmable computers

    Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

  35. [35]

    Towards scal- able and stable parallelization of nonlinear rnns

    Xavier Gonzalez, Andrew Warrington, Jimmy T Smith, and Scott W Linderman. Towards scal- able and stable parallelization of nonlinear rnns. Advances in Neural Information Processing Systems, 37:5817–5849, 2024

  36. [36]

    Predictability enables parallelization of nonlinear state space models

    Xavier Gonzalez, Leo Kozachkov, David M Zoltowski, Kenneth L Clarkson, and Scott W Linderman. Predictability enables parallelization of nonlinear state space models. arXiv preprint arXiv:2508.16817, 2025

  37. [37]

    Neural turing machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

  38. [38]

    On the tradeoffs of state space models and transformers, 2025

    Albert Gu. On the tradeoffs of state space models and transformers, 2025. URL https: //goombalab.github.io/blog/2025/tradeoffs/

  39. [39]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

  40. [40]

    Efficiently modeling long sequences with structured state spaces

    Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

  41. [41]

    Long timescale credit assignment in neuralnetworks with external memory, 2017

    Steven Stenberg Hansen. Long timescale credit assignment in neuralnetworks with external memory, 2017. URLhttps://arxiv.org/abs/1701.03866

  42. [42]

    Training large language models to reason in a continuous latent space, 2025

    Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2025. URL https://arxiv.org/abs/2412.06769

  43. [43]

    Effective distillation to hybrid xlstm architectures, 2026

    Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Effective distillation to hybrid xlstm architectures, 2026. URL https://arxiv. org/abs/2603.15590

  44. [44]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  45. [45]

    Psychology press, 1949

    Donald Olding Hebb.The organization of behavior: A neuropsychological theory. Psychology press, 1949

  46. [46]

    Recurrent predictive state policy networks, 2018

    Ahmed Hefny, Zita Marinho, Wen Sun, Siddhartha Srinivasa, and Geoffrey Gordon. Recurrent predictive state policy networks, 2018. URLhttps://arxiv.org/abs/1803.01489

  47. [47]

    Orthogonal recurrent neural networks with scaled cayley transform

    Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. In International Conference on Machine Learning, pages 1969–1978. PMLR, 2018

  48. [48]

    Hierarchical recurrent neural networks for long-term depen- dencies

    Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term depen- dencies. Advances in neural information processing systems, 8, 1995

  49. [49]

    Data parallel algorithms

    W Daniel Hillis and Guy L Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986

  50. [50]

    Parallel models of associative memory: updated edition

    Geoffrey E Hinton and James A Anderson. Parallel models of associative memory: updated edition. Psychology press, 2014. 14

  51. [51]

    Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

  52. [52]

    Long short-term memory

    Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9 (8):1735–1780, 1997

  53. [53]

    Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001

    Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001

  54. [54]

    Training compute-optimal large language models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 10, 2022

  55. [55]

    The hardware lottery

    Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, 2021

  56. [56]

    Neural networks and physical systems with emergent collective computational abilities

    John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

  57. [57]

    Multilayer feedforward networks are universal approximators

    Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989

  58. [58]

    Universal artificial intelligence: Sequential decisions based on algorithmic probability, volume 300

    Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability, volume 300. Springer, 2005

  59. [59]

    Block-recurrent dynamics in vision transformers

    Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, and T Andy Keller. Block-recurrent dynamics in vision transformers. arXiv preprint arXiv:2512.19941, 2025

  60. [60]

    echo state

    Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks- with an erratum note. Bonn, Germany: German national research center for information technology gmd technical report, 148(34):13, 2001

  61. [61]

    Less is more: Recursive reasoning with tiny networks, 2025

    Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks, 2025. URLhttps://arxiv.org/abs/2510.04871

  62. [62]

    Planning and acting in partially observable stochastic domains

    Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998

  63. [63]

    Training recurrent neural networks via forward propagation through time

    Anil Kag and Venkatesh Saligrama. Training recurrent neural networks via forward propagation through time. In International Conference on Machine Learning, pages 5189–5200. PMLR, 2021

  64. [64]

    Principles of neural science, 2000

    Eric R Kandel. Principles of neural science, 2000

  65. [65]

    Scaling laws for neural language models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

  66. [66]

    Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A. Smith. Finetuning pretrained transformers into rnns, 2021. URLhttps://arxiv.org/abs/2103.13076

  67. [67]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

  68. [68]

    General-purpose in-context learning by meta-learning transformers

    Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, and Luke Metz. General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458, 2022

  69. [69]

    Three approaches to the quantitative definition of information

    Andrei Nikolaevic Kolmogorov. Three approaches to the quantitative definition of information. International journal of computer mathematics, 2(1-4):157–168, 1968

  70. [70]

    Professor forcing: A new algorithm for training recurrent networks

    Alex M Lamb, Anirudh Goyal ALIAS PARTH GOY AL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. Advances in neural information processing systems, 29, 2016. 15

  71. [71]

    The MNIST database of handwritten digits, 1998

    Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits, 1998. URL http://yann.lecun.com/exdb/mnist/

  72. [72]

    Li, Zifan Carl Guo, and Jacob Andreas

    Belinda Z. Li, Zifan Carl Guo, and Jacob Andreas. (how) do language models track state?,

  73. [73]

    URLhttps://arxiv.org/abs/2503.02854

  74. [74]

    Noprop: Training neural networks without full back-propagation or full forward-propagation

    Qinyu Li, Yee Whye Teh, and Razvan Pascanu. Noprop: Training neural networks without full back-propagation or full forward-propagation. arXiv preprint arXiv:2503.24322, 2025

  75. [75]

    Parallelizing non-linear sequential models over the sequence length, 2024

    Yi Heng Lim, Qi Zhu, Joshua Selfridge, and Muhammad Firmansyah Kasim. Parallelizing non-linear sequential models over the sequence length, 2024. URL https://arxiv.org/ abs/2309.12252

  76. [76]

    Predictive representations of state

    Michael Littman and Richard S Sutton. Predictive representations of state. Advances in neural information processing systems, 14, 2001

  77. [77]

    Transform- ers learn shortcuts to automata

    Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transform- ers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

  78. [78]

    The serial scaling hypothesis

    Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. The serial scaling hypothesis. arXiv preprint arXiv:2507.12549, 2025

  79. [79]

    Reservoir computing approaches to recurrent neural network training

    Mantas Lukoševiˇcius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer science review, 3(3):127–149, 2009

  80. [80]

    Parallelizing linear recurrent neural nets over sequence length,

    Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length,

Showing first 80 references.