Pretraining Recurrent Networks without Recurrence

Akarsh Kumar; Phillip Isola

arxiv: 2606.06479 · v1 · pith:AGTUNUJLnew · submitted 2026-06-04 · 💻 cs.LG · cs.AI

Pretraining Recurrent Networks without Recurrence

Akarsh Kumar , Phillip Isola This is my paper

Pith reviewed 2026-06-28 02:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords recurrent neural networkssupervised memory trainingparallel traininglanguage modelingpixel sequence modelinglong-range dependenciestransformer encoder

0 comments

The pith

Supervised Memory Training reduces RNN pretraining to supervised one-step memory transitions using Transformer labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard training of recurrent networks uses backpropagation through time, which is sequential and can suffer from unstable gradients over long sequences. The paper proposes Supervised Memory Training to instead generate memory labels with a Transformer trained to predict future states, then supervise the RNN to learn memory updates in one step. This makes training parallel across time steps with fixed gradient length. Sympathetic readers would care if this allows nonlinear RNNs to scale to long sequences in tasks like language modeling where current methods struggle.

Core claim

By training a Transformer encoder on a predictive state objective to produce memory labels, SMT reduces RNN training to supervised learning on pairs (m_t, x_{t+1}) mapping to m_{t+1}, enabling time-parallel training of nonlinear RNNs with stable O(1) length gradient paths between any tokens without unrolling the network, and outperforming BPTT on language and pixel sequence modeling.

What carries the argument

The predictive state objective that trains the Transformer encoder to retain only past information necessary to predict the future, generating memory labels for supervised RNN training.

If this is right

RNN training becomes fully parallelizable in time without sequential unrolling.
Gradient paths between tokens have constant length independent of sequence length.
Various RNN architectures can be pretrained on language modeling and pixel sequences more effectively than with BPTT.
Memory content selection is decoupled from the memory update rule.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach might apply to training other sequential models that currently rely on recurrence.
Combining SMT with larger-scale predictive encoders could improve label quality for complex temporal tasks.
It opens the possibility of hybrid models where Transformers generate targets for recurrent components at scale.

Load-bearing premise

The memory states generated by the Transformer encoder on the predictive state objective are sufficient for the RNN to learn effective long-range associations when trained via supervised one-step transitions.

What would settle it

An experiment where an RNN trained via SMT on a long-sequence task requiring dependencies across many steps shows no improvement over BPTT or fails to learn those associations would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.06479 by Akarsh Kumar, Phillip Isola.

**Figure 1.** Figure 1: BPTT vs SMT. Left: BPTT trains an RNN by recurrently unrolling the “updater” network in time, and backpropagating gradients through the entire graph. Right: Supervised Memory Training (SMT) trains an RNN with supervised learning on one-step memory transition labels, which are generated by a Transformer encoder-decoder model pair trained to produce predictive states. SMT is fully time-parallel. In SMT, the … view at source ↗

**Figure 2.** Figure 2: SMT vs DMT. SMT trains the RNN with behavior cloning on the encodergenerated memory states (off-policy imitation learning). DMT unrolls the RNN with its own memory states and then imitates the encoder trajectory (on-policy imitation learning). Figure design inspired by Jacobs et al. [59]. After SMT, the RNN achieves low one-step error in predicting (mt, xt+1) → mt+1 when mt comes from the encoder. Howeve… view at source ↗

**Figure 3.** Figure 3: Synthetic Task Experiments. We evaluate BPTT, SMT, and SMT→DMT using five synthetic tasks with various settings to probe different properties of the algorithms. ∗ signifies that the SMT Encoder is the teacher Transformer (not an RNN) and is used only as a reference. Across all tasks and task settings, SMT→DMT outperforms BPTT, signaling that SMT has better gradient properties, memory utilization, state tra… view at source ↗

**Figure 4.** Figure 4: Attneave’s MNIST Generation. BPTT fails to effectively capture the long-range dependencies required for pixel sequence modeling, even with a GRU. SMT→DMT captures these dependencies with a non-gated RNN architecture. More samples are in Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Attneave’s Sketchy Generation. SMT→DMT captures the stroke structure of human-drawn sketches through only pixel sequence modeling on sparse images. More samples are in Appendix [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Sequential Compute and Data Efficiency. We sweep training hyperparameters for BPTT, SMT, and SMT→DMT and plot the resulting runs’ performance along sequential compute (SeqFLOPs) used and data processed (Tokens), across different RNN architectures and datasets. Runs are capped at one day on an H200 GPU. ∗ signifies that the SMT Encoder is the teacher Transformer (not an RNN) and is used only as a reference.… view at source ↗

**Figure 8.** Figure 8: Scaling Model Size. Sweeping the width and depth of the RNN and teacher shows smooth performance improvements in TinyStories. The RNN imitates the teacher performance better at larger scale. 7 [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 9.** Figure 9: Scaling Laws for Compression. We plot iso-loss contours for SMT-trained encoder models across a range of memory state sizes and training compute budgets. For a fixed target performance, SMT can achieve higher compression (smaller memory size) using additional compute. This result suggests a new property to scale when given more training compute: memory state compression. Neural scaling laws predict the … view at source ↗

**Figure 11.** Figure 11: Gradient Properties of BPTT and SMT. In the needle retrieval task, the loss is applied at the last timestep. BPTT propagates gradients backward through all timesteps, risking vanishing/exploding gradients for each mt, depending on the weight initialization. SMT is non-recurrent and has a O(1) credit path length, making its gradients agnostic to initialization and time-horizon. 8 [PITH_FULL_IMAGE:figures/… view at source ↗

**Figure 12.** Figure 12: Impact of DMT across many runs with different SMT λdec and λdyn hyperparameters. Left: Applying DMT reduces the drift of the RNN rollout (measured with 1 − R2 of RNN memory prediction mˆ t of encoder ground truth mt). Middle: DMT significantly improves RNN performance across settings. Right: The one-step drift of the RNN only partially correlates with the rollout drift. Drift and DMT As described in Secti… view at source ↗

**Figure 13.** Figure 13: Sequence Length Generalization. An SMT→DMT trained RNN generalizes better than its Transformer teacher when evaluated on sequence lengths longer than training. The task is synthetic state tracking. Benefit of RNNs over Transformers SMT trains an RNN to mimic a Transformer encoder model, raising the question of why an RNN is needed at all, given the Transformer. RNNs are qualitatively more efficient tha… view at source ↗

**Figure 15.** Figure 15: Model Architecture for SMT. Left: The encoder reads the input context tokens and a set of learned register tokens, and outputs the memory, mt, which is a set of memory tokens. The decoder takes in this memory and the future input tokens and predicts the future output tokens, using a causal mask. This setup forces information from the context to be compressed into a memory that is useful for predicting the… view at source ↗

**Figure 16.** Figure 16: Sweep of λdyn and λunif. Cell color indicates the RNN test loss for each setting. Top number in each cell is the RNN test loss. Bottom number in each cell shows the L unif . L unif varies from 0 (collapsed latent space) to −4 (fully uniform latent space). memory tokens. The encoder is 8 layers deep, while the decoder is 4 layers deep. The RNN is also 8 layers deep, and its readout function is 4 layers dee… view at source ↗

**Figure 17.** Figure 17: Additional MNIST Samples. Here we give more examples of samples of MNIST images generated by the various methods. SMT→DMT RNN outperforms BPTT, even when BPTT is applied on a GRU architecture, in processing long-horizon information, which is required for pixel modeling. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_17.png] view at source ↗

**Figure 18.** Figure 18: Additional Sketchy Samples. Here we give more examples of samples of Sketchy images from the dataset and generated by SMT→DMT. Even in this hard sparse domain, SMT→DMT can capture the overall stroke structure, which requires integrating information over hundreds of pixels. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_18.png] view at source ↗

**Figure 19.** Figure 19: Analysis on Attneave’s Cat. We apply the SMT→DMT-trained RNN on Sketchy and evaluate it on the classic image of Attneave’s cat. The RNN reads the image pixel-by-pixel in raster scan order. Top Left: Input image presented in its original 2D form. Top Middle: 3D t-SNE projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory throughout sequence processing. Top R… view at source ↗

**Figure 20.** Figure 20: Generations of Attneave’s Cat. We apply the SMT→DMT-trained RNN on Sketchy and apply it to generate part of the image of Attneave’s cat. Given more of the image context, the RNN seems to understand the image better and make somewhat more plausible predictions. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_20.png] view at source ↗

**Figure 21.** Figure 21: RNN Memory Evolution on MNIST (PCA). We analyze the memory evolution of our SMT→DMT MNIST RNN. Left: Input image presented in its original 2D form. Middle: 3D PCA projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory during processing. Right: 2D PCA projection of the memory state trajectory over time. Data as Image Memory as Image (t-SNE 3D) Memory Space (… view at source ↗

**Figure 22.** Figure 22: RNN Memory Evolution on MNIST (t-SNE). We analyze the memory evolution of our SMT→DMT MNIST RNN. Left: Input image presented in its original 2D form. Middle: 3D t-SNE projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory during processing. Right: 2D t-SNE projection of the memory state trajectory over time. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_22.png] view at source ↗

**Figure 23.** Figure 23: RNN Memory Evolution on Sketchy (t-SNE). We analyze the memory evolution of our SMT→DMT Sketchy RNN. Left: Input image presented in its original 2D form. Middle: 3D t-SNE projection of the RNN memory state, visualized as RGB values over time, showing the evolution of memory during processing. Right: 2D t-SNE projection of the memory state trajectory over time. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗

read the original abstract

Training recurrent neural networks (RNNs) requires assigning credit across long sequences of computations. Standard backpropagation through time (BPTT) addresses this problem poorly: it is sequential in time, limiting parallelism, and suffers from vanishing or exploding gradients, making long-range associations difficult to learn. We propose Supervised Memory Training (SMT), a method for training nonlinear RNNs that sidesteps recurrent credit propagation entirely by reducing RNN training to supervised learning on one-step memory transition labels $(m_t, x_{t+1}) \rightarrow m_{t+1}$. SMT acquires these memory labels by training a Transformer-based encoder on a predictive state objective--retaining only information from the past necessary to predict the future. By decoupling what to remember from how to update memory, SMT enables time-parallel RNN training with a stable $O(1)$ length gradient path between any two tokens--without ever unrolling the RNN. We find that SMT outperforms BPTT when pretraining various RNN architectures on tasks like language modeling and pixel sequence modeling. SMT enables nonlinear RNNs to better capture long-range dependencies and train in parallel, potentially unlocking the scaling of models that build temporal abstractions of past experience.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SMT reframes RNN pretraining as supervised one-step memory transitions using upstream Transformer labels, but the abstract gives no results or checks on whether those labels actually carry long-range state.

read the letter

The main thing to know is that this paper reduces RNN training to a supervised problem: a Transformer is first trained on a predictive-state objective to produce memory labels m_t, then the RNN is trained to map (m_t, x_{t+1}) to m_{t+1} without ever unrolling or backpropagating through time. This is presented as new and lets training run in parallel with O(1) gradient paths.

It does a clean job stating the standard problems with BPTT (sequential nature and unstable gradients) and separating the question of what to remember from how the memory updates. The framing is straightforward and avoids some of the usual circularity in RNN credit assignment.

The soft spots are straightforward. The abstract claims SMT outperforms BPTT on language modeling and pixel sequences, yet supplies no numbers, baselines, ablations, or details on the predictive-state loss. There is also no check on what information actually ends up in the memory labels or whether the RNN can chain the one-step predictions stably over long horizons. The stress-test concern lands: if the labels mostly capture local patterns, the claimed long-range gains would not follow.

This is for readers working on alternatives to BPTT or ways to scale temporal models. Someone looking for a worked-out method with evidence will not find it here; someone interested in the high-level decoupling might pick up the idea. The work deserves a serious referee to see whether the full experiments close the gap between the framing and the results.

Referee Report

2 major / 1 minor

Summary. The paper proposes Supervised Memory Training (SMT) to pretrain nonlinear RNNs without recurrence or BPTT. A Transformer encoder is first trained on a predictive-state objective to produce memory labels m_t that retain only information from the past needed to predict the future; the RNN is then trained via supervised one-step transitions (m_t, x_{t+1}) o m_{t+1}. This is claimed to yield time-parallel training, a stable O(1)-length gradient path between any tokens, and better long-range dependency capture than BPTT on language modeling and pixel-sequence tasks.

Significance. If the empirical claims hold, SMT would decouple memory-label generation from recurrent dynamics and remove the need to unroll RNNs, potentially allowing nonlinear RNNs to scale on long sequences where BPTT fails. No machine-checked proofs, reproducible code, or parameter-free derivations are presented, so the significance rests entirely on the (currently undetailed) experimental results.

major comments (2)

[Abstract] Abstract: the central empirical claim that SMT 'outperforms BPTT' on language modeling and pixel sequence modeling is stated without any quantitative results, baselines, ablation studies, or experimental protocol, rendering the claim impossible to evaluate.
[Abstract] Abstract: the load-bearing assumption that Transformer-generated labels m_t produced by the predictive-state objective contain sufficient long-range state for an RNN trained only on one-step supervised transitions to maintain and propagate that information over hundreds of steps receives no supporting analysis, derivation, or ablation.

minor comments (1)

The transition from the predictive-state loss to the supervised memory labels is described at a high level; an explicit equation relating the two objectives would clarify the method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We agree that the abstract requires strengthening to make the empirical claims and underlying assumptions more self-contained and evaluable. We will revise the abstract accordingly while preserving the manuscript's core contributions.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim that SMT 'outperforms BPTT' on language modeling and pixel sequence modeling is stated without any quantitative results, baselines, ablation studies, or experimental protocol, rendering the claim impossible to evaluate.

Authors: We agree that the abstract would benefit from quantitative support. In the revised version we will incorporate specific metrics (e.g., perplexity reductions on language modeling and accuracy gains on long pixel sequences), explicit baselines, and a concise statement of the experimental protocol so that the performance claim can be evaluated directly from the abstract. revision: yes
Referee: [Abstract] Abstract: the load-bearing assumption that Transformer-generated labels m_t produced by the predictive-state objective contain sufficient long-range state for an RNN trained only on one-step supervised transitions to maintain and propagate that information over hundreds of steps receives no supporting analysis, derivation, or ablation.

Authors: The predictive-state objective is constructed precisely so that each m_t retains only the information required to predict future tokens; the one-step supervised transitions then train the RNN to reproduce this mapping. The main text provides empirical ablations demonstrating improved long-range dependency capture relative to BPTT. Nevertheless, we acknowledge that the abstract itself offers no explicit justification or ablation summary. We will add a short clause in the abstract explaining the objective's design and will expand the discussion of label sufficiency in the revision. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent label generation step

full rationale

The paper presents SMT as an empirical training procedure: a Transformer is trained separately on a predictive-state objective to produce memory labels m_t, after which an RNN is trained via supervised one-step transitions on those fixed labels. No equations, derivations, or self-citations are shown that reduce the claimed O(1) gradient path or performance gains to quantities fitted inside the RNN itself or to prior author results. The central performance claims rest on experimental comparisons to BPTT rather than any tautological reduction, making the derivation chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based solely on the abstract; the central assumption is that the predictive state objective yields useful memory labels, treated here as a domain assumption with no independent evidence supplied.

axioms (1)

domain assumption A Transformer encoder trained on a predictive state objective retains only information from the past necessary to predict the future.
This premise is invoked to justify the quality of the memory labels used for supervised RNN training.

invented entities (1)

Supervised Memory Training (SMT) no independent evidence
purpose: Training procedure that decouples memory content from memory update via supervised labels
Newly introduced method name and procedure.

pith-pipeline@v0.9.1-grok · 5730 in / 1317 out tokens · 41867 ms · 2026-06-28T02:06:37.902890+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

141 extracted references · 43 linked inside Pith

[1]

What learning algorithm is in-context learning? investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022

Pith/arXiv arXiv 2022
[2]

Validity of the single processor approach to achieving large scale computing capabilities

Gene M Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485, 1967

1967
[3]

An evolutionary algorithm that constructs recurrent neural networks

Peter J Angeline, Gregory M Saunders, and Jordan B Pollack. An evolutionary algorithm that constructs recurrent neural networks. IEEE transactions on Neural Networks, 5(1):54–65, 1994

1994
[4]

Unitary evolution recurrent neural networks

Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International conference on machine learning, pages 1120–1128. PMLR, 2016

2016
[5]

Some informational aspects of visual perception

Fred Attneave. Some informational aspects of visual perception. Psychological review, 61(3): 183, 1954

1954
[6]

Neural machine translation by jointly learning to align and translate, 2016

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URLhttps://arxiv.org/abs/1409.0473

Pith/arXiv arXiv 2016
[7]

An empirical evaluation of generic convo- lutional and recurrent networks for sequence modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convo- lutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018

Pith/arXiv arXiv 2018
[8]

Deep equilibrium models

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in neural information processing systems, 32, 2019

2019
[9]

xlstm: Ex- tended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. Advances in Neural Information Processing Systems, 37: 107547–107603, 2024

2024
[10]

Scheduled sampling for sequence prediction with recurrent neural networks, 2015

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks, 2015. URL https://arxiv.org/abs/ 1506.03099

Pith/arXiv arXiv 2015
[11]

Learning long-term dependencies with gradient descent is difficult

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994

1994
[12]

A brief history of intelligence: evolution, AI, and the five breakthroughs that made our brains

Max S Bennett. A brief history of intelligence: evolution, AI, and the five breakthroughs that made our brains. HarperCollins, 2023

2023
[13]

Prefix sums and their applications

Guy E Blelloch. Prefix sums and their applications. 1990

1990
[14]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020. 12

1901
[15]

Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

2022
[16]

Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts, 2026

Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts, 2026. URL https://arxiv.org/abs/ 2601.22156

arXiv 2026
[17]

On the properties of neural machine translation: Encoder–decoder approaches

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pages 103–111, 2014

2014
[18]

Empirical evaluation of gated recurrent neural networks on sequence modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014

Pith/arXiv arXiv 2014
[19]

Hierarchical multiscale recurrent neural networks

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016

Pith/arXiv arXiv 2016
[20]

A taxonomy of problems with fast parallel algorithms

Stephen A Cook. A taxonomy of problems with fast parallel algorithms. Information and control, 64(1-3):2–22, 1985

1985
[21]

Transformer-xl: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdi- nov. Transformer-xl: Attentive language models beyond a fixed-length context. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 2978–2988, 2019

2019
[22]

Deeppcr: Parallelizing sequential operations in neural networks

Federico Danieli, Miguel Sarabia, Xavier Suau Cuadros, Pau Rodriguez, and Luca Zap- pella. Deeppcr: Parallelizing sequential operations in neural networks. Advances in Neural Information Processing Systems, 36:47598–47625, 2023

2023
[23]

Pararnn: Unlocking parallel training of nonlinear rnns for large language models, 2025

Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, and Luca Zappella. Pararnn: Unlocking parallel training of nonlinear rnns for large language models, 2025. URL https: //arxiv.org/abs/2510.21450

arXiv 2025
[24]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

Pith/arXiv arXiv 2024
[25]

Practical learning of predictive state representations

Carlton Downey, Ahmed Hefny, and Geoffrey Gordon. Practical learning of predictive state representations. arXiv preprint arXiv:1702.04121, 2017

Pith/arXiv arXiv 2017
[26]

Predictive state recurrent neural networks, 2017

Carlton Downey, Ahmed Hefny, Boyue Li, Byron Boots, and Geoffrey Gordon. Predictive state recurrent neural networks, 2017. URLhttps://arxiv.org/abs/1705.09353

Pith/arXiv arXiv 2017
[27]

Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

Pith/arXiv arXiv 2023
[28]

Finding structure in time

Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

1990
[29]

Addressing some limitations of transformers with feedback memory

Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402, 2020

arXiv 2002
[30]

What is wrong with perplexity for long-context language modeling?, 2025

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling?, 2025. URLhttps://arxiv.org/abs/2410.23771

arXiv 2025
[31]

Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024

Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajimir- sadeghi. Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024

arXiv 2024
[32]

Neural thickets: Diverse task experts are dense around pretrained weights

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights. arXiv preprint arXiv:2603.12228, 2026. 13

arXiv 2026
[33]

Scaling up test-time compute with latent reasoning: A recurrent depth approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025

Pith/arXiv arXiv 2025
[34]

Looped transformers as programmable computers

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

2023
[35]

Towards scal- able and stable parallelization of nonlinear rnns

Xavier Gonzalez, Andrew Warrington, Jimmy T Smith, and Scott W Linderman. Towards scal- able and stable parallelization of nonlinear rnns. Advances in Neural Information Processing Systems, 37:5817–5849, 2024

2024
[36]

Predictability enables parallelization of nonlinear state space models

Xavier Gonzalez, Leo Kozachkov, David M Zoltowski, Kenneth L Clarkson, and Scott W Linderman. Predictability enables parallelization of nonlinear state space models. arXiv preprint arXiv:2508.16817, 2025

arXiv 2025
[37]

Neural turing machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

Pith/arXiv arXiv 2014
[38]

On the tradeoffs of state space models and transformers, 2025

Albert Gu. On the tradeoffs of state space models and transformers, 2025. URL https: //goombalab.github.io/blog/2025/tradeoffs/

2025
[39]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023
[40]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

Pith/arXiv arXiv 2021
[41]

Long timescale credit assignment in neuralnetworks with external memory, 2017

Steven Stenberg Hansen. Long timescale credit assignment in neuralnetworks with external memory, 2017. URLhttps://arxiv.org/abs/1701.03866

Pith/arXiv arXiv 2017
[42]

Training large language models to reason in a continuous latent space, 2025

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2025. URL https://arxiv.org/abs/2412.06769

Pith/arXiv arXiv 2025
[43]

Effective distillation to hybrid xlstm architectures, 2026

Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Effective distillation to hybrid xlstm architectures, 2026. URL https://arxiv. org/abs/2603.15590

arXiv 2026
[44]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[45]

Psychology press, 1949

Donald Olding Hebb.The organization of behavior: A neuropsychological theory. Psychology press, 1949

1949
[46]

Recurrent predictive state policy networks, 2018

Ahmed Hefny, Zita Marinho, Wen Sun, Siddhartha Srinivasa, and Geoffrey Gordon. Recurrent predictive state policy networks, 2018. URLhttps://arxiv.org/abs/1803.01489

Pith/arXiv arXiv 2018
[47]

Orthogonal recurrent neural networks with scaled cayley transform

Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. In International Conference on Machine Learning, pages 1969–1978. PMLR, 2018

1969
[48]

Hierarchical recurrent neural networks for long-term depen- dencies

Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term depen- dencies. Advances in neural information processing systems, 8, 1995

1995
[49]

Data parallel algorithms

W Daniel Hillis and Guy L Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986

1986
[50]

Parallel models of associative memory: updated edition

Geoffrey E Hinton and James A Anderson. Parallel models of associative memory: updated edition. Psychology press, 2014. 14

2014
[51]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[52]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9 (8):1735–1780, 1997

1997
[53]

Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001

2001
[54]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 10, 2022

Pith/arXiv arXiv 2022
[55]

The hardware lottery

Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, 2021

2021
[56]

Neural networks and physical systems with emergent collective computational abilities

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

1982
[57]

Multilayer feedforward networks are universal approximators

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989

1989
[58]

Universal artificial intelligence: Sequential decisions based on algorithmic probability, volume 300

Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability, volume 300. Springer, 2005

2005
[59]

Block-recurrent dynamics in vision transformers

Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, and T Andy Keller. Block-recurrent dynamics in vision transformers. arXiv preprint arXiv:2512.19941, 2025

arXiv 2025
[60]

echo state

Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks- with an erratum note. Bonn, Germany: German national research center for information technology gmd technical report, 148(34):13, 2001

2001
[61]

Less is more: Recursive reasoning with tiny networks, 2025

Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks, 2025. URLhttps://arxiv.org/abs/2510.04871

Pith/arXiv arXiv 2025
[62]

Planning and acting in partially observable stochastic domains

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998

1998
[63]

Training recurrent neural networks via forward propagation through time

Anil Kag and Venkatesh Saligrama. Training recurrent neural networks via forward propagation through time. In International Conference on Machine Learning, pages 5189–5200. PMLR, 2021

2021
[64]

Principles of neural science, 2000

Eric R Kandel. Principles of neural science, 2000

2000
[65]

Scaling laws for neural language models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001
[66]

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A. Smith. Finetuning pretrained transformers into rnns, 2021. URLhttps://arxiv.org/abs/2103.13076

arXiv 2021
[67]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

2020
[68]

General-purpose in-context learning by meta-learning transformers

Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, and Luke Metz. General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458, 2022

arXiv 2022
[69]

Three approaches to the quantitative definition of information

Andrei Nikolaevic Kolmogorov. Three approaches to the quantitative definition of information. International journal of computer mathematics, 2(1-4):157–168, 1968

1968
[70]

Professor forcing: A new algorithm for training recurrent networks

Alex M Lamb, Anirudh Goyal ALIAS PARTH GOY AL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. Advances in neural information processing systems, 29, 2016. 15

2016
[71]

The MNIST database of handwritten digits, 1998

Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits, 1998. URL http://yann.lecun.com/exdb/mnist/

1998
[72]

Li, Zifan Carl Guo, and Jacob Andreas

Belinda Z. Li, Zifan Carl Guo, and Jacob Andreas. (how) do language models track state?,
[73]

URLhttps://arxiv.org/abs/2503.02854

arXiv
[74]

Noprop: Training neural networks without full back-propagation or full forward-propagation

Qinyu Li, Yee Whye Teh, and Razvan Pascanu. Noprop: Training neural networks without full back-propagation or full forward-propagation. arXiv preprint arXiv:2503.24322, 2025

arXiv 2025
[75]

Parallelizing non-linear sequential models over the sequence length, 2024

Yi Heng Lim, Qi Zhu, Joshua Selfridge, and Muhammad Firmansyah Kasim. Parallelizing non-linear sequential models over the sequence length, 2024. URL https://arxiv.org/ abs/2309.12252

arXiv 2024
[76]

Predictive representations of state

Michael Littman and Richard S Sutton. Predictive representations of state. Advances in neural information processing systems, 14, 2001

2001
[77]

Transform- ers learn shortcuts to automata

Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transform- ers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

arXiv 2022
[78]

The serial scaling hypothesis

Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. The serial scaling hypothesis. arXiv preprint arXiv:2507.12549, 2025

Pith/arXiv arXiv 2025
[79]

Reservoir computing approaches to recurrent neural network training

Mantas Lukoševiˇcius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer science review, 3(3):127–149, 2009

2009
[80]

Parallelizing linear recurrent neural nets over sequence length,

Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length,

Showing first 80 references.

[1] [1]

What learning algorithm is in-context learning? investigations with linear models

Ekin Akyürek, Dale Schuurmans, Jacob Andreas, Tengyu Ma, and Denny Zhou. What learning algorithm is in-context learning? investigations with linear models. arXiv preprint arXiv:2211.15661, 2022

Pith/arXiv arXiv 2022

[2] [2]

Validity of the single processor approach to achieving large scale computing capabilities

Gene M Amdahl. Validity of the single processor approach to achieving large scale computing capabilities. In Proceedings of the April 18-20, 1967, spring joint computer conference, pages 483–485, 1967

1967

[3] [3]

An evolutionary algorithm that constructs recurrent neural networks

Peter J Angeline, Gregory M Saunders, and Jordan B Pollack. An evolutionary algorithm that constructs recurrent neural networks. IEEE transactions on Neural Networks, 5(1):54–65, 1994

1994

[4] [4]

Unitary evolution recurrent neural networks

Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks. In International conference on machine learning, pages 1120–1128. PMLR, 2016

2016

[5] [5]

Some informational aspects of visual perception

Fred Attneave. Some informational aspects of visual perception. Psychological review, 61(3): 183, 1954

1954

[6] [6]

Neural machine translation by jointly learning to align and translate, 2016

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate, 2016. URLhttps://arxiv.org/abs/1409.0473

Pith/arXiv arXiv 2016

[7] [7]

An empirical evaluation of generic convo- lutional and recurrent networks for sequence modeling

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. An empirical evaluation of generic convo- lutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271, 2018

Pith/arXiv arXiv 2018

[8] [8]

Deep equilibrium models

Shaojie Bai, J Zico Kolter, and Vladlen Koltun. Deep equilibrium models. Advances in neural information processing systems, 32, 2019

2019

[9] [9]

xlstm: Ex- tended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xlstm: Ex- tended long short-term memory. Advances in Neural Information Processing Systems, 37: 107547–107603, 2024

2024

[10] [10]

Scheduled sampling for sequence prediction with recurrent neural networks, 2015

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks, 2015. URL https://arxiv.org/abs/ 1506.03099

Pith/arXiv arXiv 2015

[11] [11]

Learning long-term dependencies with gradient descent is difficult

Yoshua Bengio, Patrice Simard, and Paolo Frasconi. Learning long-term dependencies with gradient descent is difficult. IEEE transactions on neural networks, 5(2):157–166, 1994

1994

[12] [12]

A brief history of intelligence: evolution, AI, and the five breakthroughs that made our brains

Max S Bennett. A brief history of intelligence: evolution, AI, and the five breakthroughs that made our brains. HarperCollins, 2023

2023

[13] [13]

Prefix sums and their applications

Guy E Blelloch. Prefix sums and their applications. 1990

1990

[14] [14]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhari- wal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877– 1901, 2020. 12

1901

[15] [15]

Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

Aydar Bulatov, Yury Kuratov, and Mikhail Burtsev. Recurrent memory transformer.Advances in Neural Information Processing Systems, 35:11079–11091, 2022

2022

[16] [16]

Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts, 2026

Yingfa Chen, Zhen Leng Thai, Zihan Zhou, Zhu Zhang, Xingyu Shen, Shuo Wang, Chaojun Xiao, Xu Han, and Zhiyuan Liu. Hybrid linear attention done right: Efficient distillation and effective architectures for extremely long contexts, 2026. URL https://arxiv.org/abs/ 2601.22156

arXiv 2026

[17] [17]

On the properties of neural machine translation: Encoder–decoder approaches

Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of SSST-8, eighth workshop on syntax, semantics and structure in statistical translation, pages 103–111, 2014

2014

[18] [18]

Empirical evaluation of gated recurrent neural networks on sequence modeling

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555, 2014

Pith/arXiv arXiv 2014

[19] [19]

Hierarchical multiscale recurrent neural networks

Junyoung Chung, Sungjin Ahn, and Yoshua Bengio. Hierarchical multiscale recurrent neural networks. arXiv preprint arXiv:1609.01704, 2016

Pith/arXiv arXiv 2016

[20] [20]

A taxonomy of problems with fast parallel algorithms

Stephen A Cook. A taxonomy of problems with fast parallel algorithms. Information and control, 64(1-3):2–22, 1985

1985

[21] [21]

Transformer-xl: Attentive language models beyond a fixed-length context

Zihang Dai, Zhilin Yang, Yiming Yang, Jaime G Carbonell, Quoc Le, and Ruslan Salakhutdi- nov. Transformer-xl: Attentive language models beyond a fixed-length context. InProceedings of the 57th annual meeting of the association for computational linguistics, pages 2978–2988, 2019

2019

[22] [22]

Deeppcr: Parallelizing sequential operations in neural networks

Federico Danieli, Miguel Sarabia, Xavier Suau Cuadros, Pau Rodriguez, and Luca Zap- pella. Deeppcr: Parallelizing sequential operations in neural networks. Advances in Neural Information Processing Systems, 36:47598–47625, 2023

2023

[23] [23]

Pararnn: Unlocking parallel training of nonlinear rnns for large language models, 2025

Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, and Luca Zappella. Pararnn: Unlocking parallel training of nonlinear rnns for large language models, 2025. URL https: //arxiv.org/abs/2510.21450

arXiv 2025

[24] [24]

Transformers are ssms: Generalized models and efficient algorithms through structured state space duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality. arXiv preprint arXiv:2405.21060, 2024

Pith/arXiv arXiv 2024

[25] [25]

Practical learning of predictive state representations

Carlton Downey, Ahmed Hefny, and Geoffrey Gordon. Practical learning of predictive state representations. arXiv preprint arXiv:1702.04121, 2017

Pith/arXiv arXiv 2017

[26] [26]

Predictive state recurrent neural networks, 2017

Carlton Downey, Ahmed Hefny, Boyue Li, Byron Boots, and Geoffrey Gordon. Predictive state recurrent neural networks, 2017. URLhttps://arxiv.org/abs/1705.09353

Pith/arXiv arXiv 2017

[27] [27]

Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

Ronen Eldan and Yuanzhi Li. Tinystories: How small can language models be and still speak coherent english? arXiv preprint arXiv:2305.07759, 2023

Pith/arXiv arXiv 2023

[28] [28]

Finding structure in time

Jeffrey L Elman. Finding structure in time. Cognitive science, 14(2):179–211, 1990

1990

[29] [29]

Addressing some limitations of transformers with feedback memory

Angela Fan, Thibaut Lavril, Edouard Grave, Armand Joulin, and Sainbayar Sukhbaatar. Addressing some limitations of transformers with feedback memory. arXiv preprint arXiv:2002.09402, 2020

arXiv 2002

[30] [30]

What is wrong with perplexity for long-context language modeling?, 2025

Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenheng Zhang, Stefanie Jegelka, Jinyang Gao, Bolin Ding, and Yisen Wang. What is wrong with perplexity for long-context language modeling?, 2025. URLhttps://arxiv.org/abs/2410.23771

arXiv 2025

[31] [31]

Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024

Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajimir- sadeghi. Were rnns all we needed? arXiv preprint arXiv:2410.01201, 2024

arXiv 2024

[32] [32]

Neural thickets: Diverse task experts are dense around pretrained weights

Yulu Gan and Phillip Isola. Neural thickets: Diverse task experts are dense around pretrained weights. arXiv preprint arXiv:2603.12228, 2026. 13

arXiv 2026

[33] [33]

Scaling up test-time compute with latent reasoning: A recurrent depth approach

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. Scaling up test-time compute with latent reasoning: A recurrent depth approach. arXiv preprint arXiv:2502.05171, 2025

Pith/arXiv arXiv 2025

[34] [34]

Looped transformers as programmable computers

Angeliki Giannou, Shashank Rajput, Jy-yong Sohn, Kangwook Lee, Jason D Lee, and Dimitris Papailiopoulos. Looped transformers as programmable computers. InInternational Conference on Machine Learning, pages 11398–11442. PMLR, 2023

2023

[35] [35]

Towards scal- able and stable parallelization of nonlinear rnns

Xavier Gonzalez, Andrew Warrington, Jimmy T Smith, and Scott W Linderman. Towards scal- able and stable parallelization of nonlinear rnns. Advances in Neural Information Processing Systems, 37:5817–5849, 2024

2024

[36] [36]

Predictability enables parallelization of nonlinear state space models

Xavier Gonzalez, Leo Kozachkov, David M Zoltowski, Kenneth L Clarkson, and Scott W Linderman. Predictability enables parallelization of nonlinear state space models. arXiv preprint arXiv:2508.16817, 2025

arXiv 2025

[37] [37]

Neural turing machines

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

Pith/arXiv arXiv 2014

[38] [38]

On the tradeoffs of state space models and transformers, 2025

Albert Gu. On the tradeoffs of state space models and transformers, 2025. URL https: //goombalab.github.io/blog/2025/tradeoffs/

2025

[39] [39]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023

Pith/arXiv arXiv 2023

[40] [40]

Efficiently modeling long sequences with structured state spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces. arXiv preprint arXiv:2111.00396, 2021

Pith/arXiv arXiv 2021

[41] [41]

Long timescale credit assignment in neuralnetworks with external memory, 2017

Steven Stenberg Hansen. Long timescale credit assignment in neuralnetworks with external memory, 2017. URLhttps://arxiv.org/abs/1701.03866

Pith/arXiv arXiv 2017

[42] [42]

Training large language models to reason in a continuous latent space, 2025

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2025. URL https://arxiv.org/abs/2412.06769

Pith/arXiv arXiv 2025

[43] [43]

Effective distillation to hybrid xlstm architectures, 2026

Lukas Hauzenberger, Niklas Schmidinger, Thomas Schmied, Anamaria-Roberta Hartl, David Stap, Pieter-Jan Hoedt, Maximilian Beck, Sebastian Böck, Günter Klambauer, and Sepp Hochreiter. Effective distillation to hybrid xlstm architectures, 2026. URL https://arxiv. org/abs/2603.15590

arXiv 2026

[44] [44]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[45] [45]

Psychology press, 1949

Donald Olding Hebb.The organization of behavior: A neuropsychological theory. Psychology press, 1949

1949

[46] [46]

Recurrent predictive state policy networks, 2018

Ahmed Hefny, Zita Marinho, Wen Sun, Siddhartha Srinivasa, and Geoffrey Gordon. Recurrent predictive state policy networks, 2018. URLhttps://arxiv.org/abs/1803.01489

Pith/arXiv arXiv 2018

[47] [47]

Orthogonal recurrent neural networks with scaled cayley transform

Kyle Helfrich, Devin Willmott, and Qiang Ye. Orthogonal recurrent neural networks with scaled cayley transform. In International Conference on Machine Learning, pages 1969–1978. PMLR, 2018

1969

[48] [48]

Hierarchical recurrent neural networks for long-term depen- dencies

Salah Hihi and Yoshua Bengio. Hierarchical recurrent neural networks for long-term depen- dencies. Advances in neural information processing systems, 8, 1995

1995

[49] [49]

Data parallel algorithms

W Daniel Hillis and Guy L Steele Jr. Data parallel algorithms. Communications of the ACM, 29(12):1170–1183, 1986

1986

[50] [50]

Parallel models of associative memory: updated edition

Geoffrey E Hinton and James A Anderson. Parallel models of associative memory: updated edition. Psychology press, 2014. 14

2014

[51] [51]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020

[52] [52]

Long short-term memory

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9 (8):1735–1780, 1997

1997

[53] [53]

Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001

Sepp Hochreiter, Yoshua Bengio, Paolo Frasconi, Jürgen Schmidhuber, et al. Gradient flow in recurrent nets: the difficulty of learning long-term dependencies, 2001

2001

[54] [54]

Training compute-optimal large language models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, DDL Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 10, 2022

Pith/arXiv arXiv 2022

[55] [55]

The hardware lottery

Sara Hooker. The hardware lottery. Communications of the ACM, 64(12):58–65, 2021

2021

[56] [56]

Neural networks and physical systems with emergent collective computational abilities

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities. Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

1982

[57] [57]

Multilayer feedforward networks are universal approximators

Kurt Hornik, Maxwell Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural networks, 2(5):359–366, 1989

1989

[58] [58]

Universal artificial intelligence: Sequential decisions based on algorithmic probability, volume 300

Marcus Hutter. Universal artificial intelligence: Sequential decisions based on algorithmic probability, volume 300. Springer, 2005

2005

[59] [59]

Block-recurrent dynamics in vision transformers

Mozes Jacobs, Thomas Fel, Richard Hakim, Alessandra Brondetta, Demba Ba, and T Andy Keller. Block-recurrent dynamics in vision transformers. arXiv preprint arXiv:2512.19941, 2025

arXiv 2025

[60] [60]

echo state

Herbert Jaeger. The “echo state” approach to analysing and training recurrent neural networks- with an erratum note. Bonn, Germany: German national research center for information technology gmd technical report, 148(34):13, 2001

2001

[61] [61]

Less is more: Recursive reasoning with tiny networks, 2025

Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks, 2025. URLhttps://arxiv.org/abs/2510.04871

Pith/arXiv arXiv 2025

[62] [62]

Planning and acting in partially observable stochastic domains

Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains. Artificial intelligence, 101(1-2):99–134, 1998

1998

[63] [63]

Training recurrent neural networks via forward propagation through time

Anil Kag and Venkatesh Saligrama. Training recurrent neural networks via forward propagation through time. In International Conference on Machine Learning, pages 5189–5200. PMLR, 2021

2021

[64] [64]

Principles of neural science, 2000

Eric R Kandel. Principles of neural science, 2000

2000

[65] [65]

Scaling laws for neural language models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020

Pith/arXiv arXiv 2001

[66] [66]

Jungo Kasai, Hao Peng, Yizhe Zhang, Dani Yogatama, Gabriel Ilharco, Nikolaos Pappas, Yi Mao, Weizhu Chen, and Noah A. Smith. Finetuning pretrained transformers into rnns, 2021. URLhttps://arxiv.org/abs/2103.13076

arXiv 2021

[67] [67]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pages 5156–5165. PMLR, 2020

2020

[68] [68]

General-purpose in-context learning by meta-learning transformers

Louis Kirsch, James Harrison, Jascha Sohl-Dickstein, and Luke Metz. General-purpose in-context learning by meta-learning transformers. arXiv preprint arXiv:2212.04458, 2022

arXiv 2022

[69] [69]

Three approaches to the quantitative definition of information

Andrei Nikolaevic Kolmogorov. Three approaches to the quantitative definition of information. International journal of computer mathematics, 2(1-4):157–168, 1968

1968

[70] [70]

Professor forcing: A new algorithm for training recurrent networks

Alex M Lamb, Anirudh Goyal ALIAS PARTH GOY AL, Ying Zhang, Saizheng Zhang, Aaron C Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. Advances in neural information processing systems, 29, 2016. 15

2016

[71] [71]

The MNIST database of handwritten digits, 1998

Yann LeCun and Corinna Cortes. The MNIST database of handwritten digits, 1998. URL http://yann.lecun.com/exdb/mnist/

1998

[72] [72]

Li, Zifan Carl Guo, and Jacob Andreas

Belinda Z. Li, Zifan Carl Guo, and Jacob Andreas. (how) do language models track state?,

[73] [73]

URLhttps://arxiv.org/abs/2503.02854

arXiv

[74] [74]

Noprop: Training neural networks without full back-propagation or full forward-propagation

Qinyu Li, Yee Whye Teh, and Razvan Pascanu. Noprop: Training neural networks without full back-propagation or full forward-propagation. arXiv preprint arXiv:2503.24322, 2025

arXiv 2025

[75] [75]

Parallelizing non-linear sequential models over the sequence length, 2024

Yi Heng Lim, Qi Zhu, Joshua Selfridge, and Muhammad Firmansyah Kasim. Parallelizing non-linear sequential models over the sequence length, 2024. URL https://arxiv.org/ abs/2309.12252

arXiv 2024

[76] [76]

Predictive representations of state

Michael Littman and Richard S Sutton. Predictive representations of state. Advances in neural information processing systems, 14, 2001

2001

[77] [77]

Transform- ers learn shortcuts to automata

Bingbin Liu, Jordan T Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. Transform- ers learn shortcuts to automata. arXiv preprint arXiv:2210.10749, 2022

arXiv 2022

[78] [78]

The serial scaling hypothesis

Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. The serial scaling hypothesis. arXiv preprint arXiv:2507.12549, 2025

Pith/arXiv arXiv 2025

[79] [79]

Reservoir computing approaches to recurrent neural network training

Mantas Lukoševiˇcius and Herbert Jaeger. Reservoir computing approaches to recurrent neural network training. Computer science review, 3(3):127–149, 2009

2009

[80] [80]

Parallelizing linear recurrent neural nets over sequence length,

Eric Martin and Chris Cundy. Parallelizing linear recurrent neural nets over sequence length,