Parallelizable memory recurrent units

Damien Ernst; Florent De Geeter; Gaspard Lambrechts; Guillaume Drion

arxiv: 2601.09495 · v3 · pith:5NLHB6OMnew · submitted 2026-01-14 · 💻 cs.LG

Parallelizable memory recurrent units

Florent De Geeter , Gaspard Lambrechts , Damien Ernst , Guillaume Drion This is my paper

Pith reviewed 2026-05-21 16:00 UTC · model grok-4.3

classification 💻 cs.LG

keywords recurrent neural networksstate-space modelspersistent memorymultistabilityparallel scanlong-term dependenciessequence modeling

0 comments

The pith

Memory recurrent units add persistent memory to parallelizable sequence models by using multistability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces memory recurrent units as a new RNN family that achieves persistent memory through engineered multistability while supporting parallel computations by removing transient dynamics. This solves a core limitation of state-space models, which train efficiently in parallel but cannot retain information indefinitely because they are monostable. The bistable memory recurrent unit is presented as a concrete implementation that works with parallel scan algorithms and performs well on long-term dependency tasks. These units can also be combined with state-space models to build hybrid networks that keep both transient dynamics and persistent memory.

Core claim

By engineering multistability into recurrent units, the approach creates multiple stable equilibria that hold information for arbitrary durations while eliminating transient dynamics that would prevent efficient parallelization, allowing the parallel scan algorithm to be used for training without losing the representation power that nonlinear RNNs provide over monostable SSMs.

What carries the argument

Multistability in recurrent units that creates persistent memory equilibria while removing transient dynamics to enable parallel scan compatibility.

Load-bearing premise

That multistability can be engineered to provide persistent memory while fully eliminating transient dynamics so that parallel scan algorithms remain efficient and stable.

What would settle it

Showing that the BMRU hidden state drifts or loses stored information after a long sequence of zero inputs, or that the parallel scan version becomes numerically unstable for extended sequence lengths.

Figures

Figures reproduced from arXiv: 2601.09495 by Damien Ernst, Florent De Geeter, Gaspard Lambrechts, Guillaume Drion.

**Figure 2.** Figure 2: Convergence properties of the RNN unit described by equation (4b) for different values of input xt and either β = −1.5 (A) or β = 1.5 (B) (left) Internal state trajectories corresponding to different initial conditions h˜t[0] = ht−1 and 4 different input values xt. (right) Solutions of the steady-state equation equation (5) for β = −1.5 (A) and β = 1.5 (B). The red arrows show convergence trajectories from… view at source ↗

**Figure 3.** Figure 3: Comparison between the implicit function and its approximation. This figure compares the solutions (ht, xt) of the implicit function defined by equation (5) (in blue) with the approximation defined by equation (8) (in red) where α = 1. Solid lines correspond to stable points, and dashed lines to unstable points. x 0 t t −α 0 α ht Bistable region Memory is set ±β [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Surrogate gradient used in BMRU. (left) Comparison between the Heaviside function and the function defined by equation (13) used to approximate the gradient for different values of αsurr. (right) Impact of αsurr on the surrogate gradient defined by equation (12). in the layer. We use a classical fully connected layer to compute the vectors hˆ t and βt from the input vector xt, adding the positivity constra… view at source ↗

**Figure 6.** Figure 6: Persistent memory of BMRU compared to the fading memory of LRU. Small models with one LRU or BMRU layer of one unit have been trained on a simple benchmark whose inputs start with ±1 followed by 0’s. The goal of the models is to output the first input at the last timestep. (left, center) Evolution of the states of LRU and BMRU with respect to the timesteps for the two possible inputs. As the LRU state is c… view at source ↗

**Figure 7.** Figure 7: Consistency of the gradient of BMRU with respect to time. Evolution of the gradient of the last state hT with respect to the first candidate hˆ1 or β1. Two cases are considered: either only the first timestep sets the memory, i.e. |hˆ1| > β1 and |hˆt| < βt ∀t > 1, either another timestep also sets the memory. (figure 6, left) or fading memory (figure 6, center). However, BMRU encoding the information in st… view at source ↗

**Figure 8.** Figure 8: Results on the copy-first input benchmark. Test MSEs obtained by BMRU and LRU models on the copy-first-input benchmark for two sequence lengths: 100 and 300. 102 103 104 105 Sequence length 0 1 MSE BMRU LRU 102 103 104 105 Sequence length 0 1 MSE BMRU LRU [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 10.** Figure 10: Results on the permuted sequential MNIST. Accuracies obtained by BMRU, LRU and BMRU/LRU models on the permuted sequential MNIST, with or without black pixels added at the end of the sequences. Models with 2 and 3 recurrent block have been tested. sequences. To do that, we created small test sets whose sequences lengths are a power of 10, starting from 102 to 105 . Each test set is composed of 6000 samples… view at source ↗

**Figure 11.** Figure 11: Results on the pathfinder benchmark. Each plot shows the accuracies obtained by BMRU models on the pathfinder benchmark, for different state dimensions (x-axes) and network depths (1, 2 and 3 recurrent blocks from left to right, respectively). 5.4 Pathfinder The pathfinder benchmark is part of the long-range arena group of benchmarks [23]. It consists of 32x32 black and white images where lines are drawn … view at source ↗

**Figure 12.** Figure 12: Hysteresis bifurcation in the bistable recurrent cell. This figure shows the solutions to the implicit function defined by equation (15) as well as their stability for three values of ba. A Hysteresis bifurcation in the bistable recurrent cell The bistable recurrent cell [8] is described the set of equations ct = σ(Ucxt + wc ⊙ ht−1 + bc), (14a) at = 1 + tanh (Uaxt + wa ⊙ ht−1 + ba), (14b) ht = ct ⊙ ht−1 +… view at source ↗

read the original abstract

With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces MRUs that use multistability for persistent memory in a form meant to stay compatible with parallel scan, but the key step of removing transient dynamics while preserving exact associativity needs close checking.

read the letter

The main takeaway is that this work tries to give sequence models both long-term memory and fast parallel training by building multistable recurrent units instead of the usual monostable state-space models. They define a broader family called MRUs and then give one concrete bistable version, the BMRU, that they say works with the parallel scan algorithm and can be mixed with existing SSMs for hybrid networks that handle both transient and persistent behavior. They report decent results on long-term dependency tasks. That framing is new enough to be worth noticing; most SSM papers stay inside linear or near-linear dynamics and accept the memory limit that comes with it. The hybrid suggestion is also practical for people who want to keep some of the speed of current models while adding memory that lasts across very long sequences. The soft spot sits in the central construction. The authors claim they get rid of transient dynamics so the recurrence stays compatible with parallel scan. In a multistable discrete system, however, any switch between attractors is governed by the nonlinear map, and that map has to remain associative for the scan to stay exact and stable over hundreds or thousands of steps. If the paper only achieves this through a special parameterization or by making the inter-attractor map effectively linear in some hidden way, that detail matters a lot; without seeing the equations it is hard to tell whether the parallelization is exact or carries hidden approximation error that grows with sequence length. The abstract gives no error bars, no baseline numbers, and no derivation sketch, so the experimental support is still thin. This paper is aimed at researchers who build or compare efficient long-sequence architectures and who care about the memory-efficiency tradeoff. A reader already working on SSM variants or parallel RNNs would get the most out of it. The work is coherent enough on its own terms to deserve a serious referee who can verify the scan compatibility and the actual performance numbers.

Referee Report

2 major / 2 minor

Summary. The paper proposes Memory Recurrent Units (MRUs), a new family of RNNs that use multistability to achieve persistent memory while eliminating transient dynamics to remain compatible with parallel scan algorithms used in state-space models (SSMs). A concrete instantiation, the Bistable Memory Recurrent Unit (BMRU), is derived as a proof-of-concept; the authors claim it supports efficient parallel training, performs well on long-term dependency tasks, and can be hybridized with SSMs to combine transient and persistent memory.

Significance. If the core construction is correct, the result would be significant: it directly targets the monostability limitation of current SSMs while preserving their parallel-training advantage, potentially enabling more expressive yet efficient sequence models for tasks that require both short-term dynamics and infinite-horizon memory.

major comments (2)

[§3] §3 (BMRU construction): The central claim that multistability can be realized while 'getting rid of transient dynamics' so that the recurrence remains exactly compatible with the parallel scan algorithm is load-bearing. In a discrete-time multistable map, finite-time transitions between attractors are governed by the nonlinear update; unless the map is strictly affine between steps, the associative property required for stable parallel prefix computation does not hold exactly. The manuscript must exhibit the closed-form parallelization and prove that no approximation error accumulates over long sequences.
[§4] §4 (experimental validation): The reported results on long-term dependency tasks lack error bars, sequence-length scaling curves, and direct comparisons against both pure SSM baselines and standard nonlinear RNNs. Without these, it is impossible to assess whether the claimed performance gain is attributable to persistent memory or to other implementation details.

minor comments (2)

[§2] Notation for the bistable fixed points and the linearization around them should be introduced earlier and used consistently when discussing the elimination of transients.
[Abstract] The abstract states that BMRU 'achieves good results'; the main text should replace this qualitative phrase with quantitative metrics and statistical significance tests.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify key areas where additional rigor and experimental detail will strengthen the manuscript. We address each major comment below and describe the planned revisions.

read point-by-point responses

Referee: [§3] §3 (BMRU construction): The central claim that multistability can be realized while 'getting rid of transient dynamics' so that the recurrence remains exactly compatible with the parallel scan algorithm is load-bearing. In a discrete-time multistable map, finite-time transitions between attractors are governed by the nonlinear update; unless the map is strictly affine between steps, the associative property required for stable parallel prefix computation does not hold exactly. The manuscript must exhibit the closed-form parallelization and prove that no approximation error accumulates over long sequences.

Authors: We agree that a rigorous demonstration of exact compatibility is essential. The BMRU update is constructed so that the multistable (bistable) component operates on a separate memory state whose evolution can be expressed via an associative operator that is independent of the transient nonlinearities. We will add an explicit derivation of the closed-form parallel scan in a new subsection of §3, including the definition of the associative binary operator, verification of its associativity, and a proof that the parallel and sequential executions produce identical results for any sequence length, with zero accumulation of approximation error. This will be supported by both algebraic derivation and numerical verification on long sequences. revision: yes
Referee: [§4] §4 (experimental validation): The reported results on long-term dependency tasks lack error bars, sequence-length scaling curves, and direct comparisons against both pure SSM baselines and standard nonlinear RNNs. Without these, it is impossible to assess whether the claimed performance gain is attributable to persistent memory or to other implementation details.

Authors: We acknowledge that the current experimental section would benefit from these additions to allow clearer attribution of gains. In the revised version we will report mean performance with standard deviation error bars over at least five independent runs. We will include sequence-length scaling plots for the evaluated tasks. We will also add direct comparisons against representative SSM baselines (e.g., S4, Mamba) and standard nonlinear RNNs (LSTM, GRU) using identical training protocols and the same long-term dependency benchmarks. These results will be presented in an expanded §4 with a new table and accompanying discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new architectural proposal is self-contained

full rationale

The paper proposes a new RNN family (MRUs/BMRU) that combines multistability for persistent memory with parallel-scan compatibility by eliminating transient dynamics. This is presented as an architectural construction rather than a derivation that reduces to fitted parameters, self-citations, or prior results by construction. No equations or claims in the abstract reduce the central performance or compatibility assertions to inputs; the multistability mechanism is introduced as a design choice, not derived from or equivalent to the parallelization property. The work remains independent of any load-bearing self-citation chain and does not rename known results or smuggle ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that multistability can be isolated from transient dynamics to enable both persistent memory and parallel computation; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)

domain assumption Multistability provides a source of persistent memory that can be decoupled from transient dynamics.
Invoked in the abstract to justify the design of MRUs.

invented entities (2)

Memory recurrent unit (MRU) no independent evidence
purpose: New RNN family combining persistent memory and parallelizability.
Introduced as the core contribution.
Bistable memory recurrent unit (BMRU) no independent evidence
purpose: Concrete proof-of-concept implementation compatible with parallel scan.
Presented as a specific realization of the MRU idea.

pith-pipeline@v0.9.0 · 5820 in / 1335 out tokens · 35224 ms · 2026-05-21T16:00:10.838882+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BMRU update equations can be rewritten using an associative operator, therefore allowing the use of the parallel scan

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
cs.LG 2026-05 unverdicted novelty 7.0

Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs...
A Fully Tunable Ultra-Low Power Current-Mode Memory Cell in Standard CMOS Technology
eess.SP 2026-05 unverdicted novelty 7.0

A fully tunable ultra-low-power current-mode bistable memory cell using nine standard CMOS transistors enables spike-based logic gates and noise-immune recurrent neural units.
Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations
cs.AR 2026-05 unverdicted novelty 6.0

BMRUs enable a direct one-to-one mapping from learned parameters to current-mode analog circuit elements, with discrete hysteretic outputs suppressing noise by at least 20x and supporting sub-microwatt RNN inference i...
A Fully Tunable Ultra-Low Power Current-Mode Memory Cell in Standard CMOS Technology
eess.SP 2026-05 unverdicted novelty 6.0

A nine-transistor current-mode bistable memory cell in 180 nm CMOS is presented with independent tuning of threshold, hysteresis, and gain, shown via schematic simulations for spike-based logic gates and recurrent neu...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 3 Pith papers

[1]

Long Short-Term Memory,

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997

work page 1997
[2]

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,

K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724–1734

work page 2014
[3]

Attention is All you Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

work page 2017
[4]

Efficiently Modeling Long Sequences with Structured State Spaces,

A. Gu, K. Goel, and C. Re, “Efficiently Modeling Long Sequences with Structured State Spaces,”ArXiv, Oct. 2021

work page 2021
[5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces,

A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” inFirst Conference on Language Modeling, Aug. 2024

work page 2024
[6]

Fading memory and the problem of approximating nonlinear operators with Volterra series,

S. Boyd and L. Chua, “Fading memory and the problem of approximating nonlinear operators with Volterra series,”IEEE Transactions on Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, Nov. 1985

work page 1985
[7]

The illusion of state in state-space models,

W. Merrill, J. Petty, and A. Sabharwal, “The illusion of state in state-space models,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24, vol. 235. Vienna, Austria: JMLR.org, Jul. 2024, pp. 35 492–35 506

work page 2024
[8]

A bio-inspired bistable recurrent cell allows for long-lasting memory,

N. Vecoven, D. Ernst, and G. Drion, “A bio-inspired bistable recurrent cell allows for long-lasting memory,”PLOS ONE, vol. 16, no. 6, p. e0252676, Jun. 2021. 13

work page 2021
[9]

Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,

G. Lambrechts, F. De Geeter, N. Vecoven, D. Ernst, and G. Drion, “Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,”Neural Networks, vol. 166, pp. 645–669, Sep. 2023

work page 2023
[10]

Simplified State Space Layers for Sequence Model- ing,

J. T. H. Smith, A. Warrington, and S. Linderman, “Simplified State Space Layers for Sequence Model- ing,” inThe Eleventh International Conference on Learning Representations, Sep. 2022

work page 2022
[11]

Parallelizing Linear Recurrent Neural Nets Over Sequence Length,

E. Martin and C. Cundy, “Parallelizing Linear Recurrent Neural Nets Over Sequence Length,” inInter- national Conference on Learning Representations, Feb. 2018

work page 2018
[12]

Were RNNs All We Needed?

L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H. Hajimirsadeghi, “Were RNNs All We Needed?” 2024

work page 2024
[13]

Hierarchically Gated Recurrent Neural Network for Sequence Modeling,

Z. Qin, S. Yang, and Y. Zhong, “Hierarchically Gated Recurrent Neural Network for Sequence Modeling,” Advances in Neural Information Processing Systems, vol. 36, pp. 33 202–33 221, Dec. 2023

work page 2023
[14]

Parallelizing non-linear sequential models over the sequence length,

Y. H. Lim, Q. Zhu, J. Selfridge, and M. F. Kasim, “Parallelizing non-linear sequential models over the sequence length,” inThe Twelfth International Conference on Learning Representations, Oct. 2023

work page 2023
[15]

Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,

X. Gonzalez, A. Warrington, J. T. Smith, and S. W. Linderman, “Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,”Advances in Neural Information Processing Systems, vol. 37, pp. 5817– 5849, Dec. 2024

work page 2024
[16]

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,

F. Danieli, P. Rodriguez, M. Sarabia, X. Suau, and L. Zappella, “ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,” 2025

work page 2025
[17]

Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks,

E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks,”IEEE Signal Process- ing Magazine, vol. 36, no. 6, pp. 51–63, Nov. 2019

work page 2019
[18]

Training Spiking Neural Networks Using Lessons From Deep Learning,

J. K. Eshraghian, M. Ward, E. O. Neftci, X. Wang, G. Lenz, G. Dwivedi, M. Bennamoun, D. S. Jeong, and W. D. Lu, “Training Spiking Neural Networks Using Lessons From Deep Learning,”Proceedings of the IEEE, vol. 111, no. 9, pp. 1016–1054, Sep. 2023

work page 2023
[19]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,

Y. Bengio, N. L´ eonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” Aug. 2013

work page 2013
[20]

Resurrecting Recurrent Neural Networks for Long Sequences,

A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De, “Resurrecting Recurrent Neural Networks for Long Sequences,” inProceedings of the 40th International Conference on Machine Learning. PMLR, Jul. 2023, pp. 26 670–26 698

work page 2023
[21]

Gradient-based learning applied to document recog- nition,

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recog- nition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998

work page 1998
[22]

A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,

Q. V. Le, N. Jaitly, and G. E. Hinton, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,”ArXiv, Apr. 2015

work page 2015
[23]

Long Range Arena: A Benchmark for Efficient Transformers,

Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long Range Arena: A Benchmark for Efficient Transformers,”ArXiv, Nov. 2020

work page 2020
[24]

Multistability in Recurrent Neural Networks,

C.-Y. Cheng, K.-H. Lin, and C.-W. Shih, “Multistability in Recurrent Neural Networks,”SIAM Journal on Applied Mathematics, vol. 66, no. 4, pp. 1301–1320, Jan. 2006

work page 2006
[25]

Theory of Gating in Recurrent Neural Networks,

K. Krishnamurthy, T. Can, and D. J. Schwab, “Theory of Gating in Recurrent Neural Networks,” Physical Review X, vol. 12, no. 1, p. 011011, Jan. 2022

work page 2022
[26]

Analysis of continuous-time switching networks,

R. Edwards, “Analysis of continuous-time switching networks,”Physica D: Nonlinear Phenomena, vol. 146, no. 1-4, pp. 165–199, Nov. 2000

work page 2000
[27]

A Step Towards Uncovering The Structure of Multistable Neural Networks,

M. Tournoy and B. Doiron, “A Step Towards Uncovering The Structure of Multistable Neural Networks,” 2022

work page 2022
[28]

Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,

A. Gu, I. Johnson, K. Goel, K. K. Saab, T. Dao, A. Rudra, and C. Re, “Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,”Neural Information Processing Systems, 2021

work page 2021
[29]

xLSTM: Extended Long Short-Term Memory,

M. Beck, K. P¨ oppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “xLSTM: Extended Long Short-Term Memory,”Advances in Neural Information Processing Systems, vol. 37, pp. 107 547–107 603, Dec. 2024. 14

work page 2024
[30]

Recurrent neural network from adder’s perspective: Carry-lookahead RNN,

H. Jiang, F. Qin, J. Cao, Y. Peng, and Y. Shao, “Recurrent neural network from adder’s perspective: Carry-lookahead RNN,”Neural Networks, vol. 144, pp. 297–306, Dec. 2021

work page 2021
[31]

An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,

J. E. Zini, Y. Rizk, and M. Awad, “An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,”Journal of Artificial Intelligence and Soft Computing Research, vol. 11, no. 1, pp. 33–50, Jan. 2021

work page 2021
[32]

Training Deep Spiking Neural Networks Using Backpropaga- tion,

J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training Deep Spiking Neural Networks Using Backpropaga- tion,”Frontiers in Neuroscience, vol. 10, 2016

work page 2016
[33]

Sparse Spiking Gradient Descent,

N. P. Nieves and D. F. M. Goodman, “Sparse Spiking Gradient Descent,” inNeural Information Pro- cessing Systems, May 2021

work page 2021
[34]

Learning Finite State Machines With Self-Clustering Recurrent Networks,

Z. Zeng, R. M. Goodman, and P. Smyth, “Learning Finite State Machines With Self-Clustering Recurrent Networks,”Neural Computation, vol. 5, no. 6, pp. 976–990, Nov. 1993

work page 1993
[35]

A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,

R. Goodman and Z. Zeng, “A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,” inProceedings of IEEE Workshop on Neural Networks for Signal Processing, Sep. 1994, pp. 219– 228

work page 1994
[36]

Deep Equilibrium Models,

S. Bai, J. Z. Kolter, and V. Koltun, “Deep Equilibrium Models,” inAdvances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019

work page 2019
[37]

Prefix sums and their applications,

G. E. Blelloch, “Prefix sums and their applications,” School of Computer Science, Carnegie Mellon University Pittsburgh, PA, USA, Tech. Rep., 1990. 15 −2 0 2 xt −1 0 1 ht ba = −1 −2 0 2 xt −1 0 1 ht ba = 0 −2 0 2 xt −1 0 1 ht ba = 1 Stable Unstable Figure 12:Hysteresis bifurcation in the bistable recurrent cell.This figure shows the solutions to the impli...

work page 1990
[38]

The operator⊛is associative,

work page
[39]

, cT ]creates the array[s 0,

Performing the scan with the operator⊛on the array[c 0, . . . , cT ]creates the array[s 0, . . . , sT ]where st = [yt, ht]andy t is defined as: yt = ( a0 ift= 0, at ⊙y t−1 if0< t < T. It results that the parallel scan can be used to solve this first-order linear recurrence as⊛is associative (point 1.), and the solutionsh t will be the second values of the...

work page

[1] [1]

Long Short-Term Memory,

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997

work page 1997

[2] [2]

Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,

K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724–1734

work page 2014

[3] [3]

Attention is All you Need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

work page 2017

[4] [4]

Efficiently Modeling Long Sequences with Structured State Spaces,

A. Gu, K. Goel, and C. Re, “Efficiently Modeling Long Sequences with Structured State Spaces,”ArXiv, Oct. 2021

work page 2021

[5] [5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces,

A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” inFirst Conference on Language Modeling, Aug. 2024

work page 2024

[6] [6]

Fading memory and the problem of approximating nonlinear operators with Volterra series,

S. Boyd and L. Chua, “Fading memory and the problem of approximating nonlinear operators with Volterra series,”IEEE Transactions on Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, Nov. 1985

work page 1985

[7] [7]

The illusion of state in state-space models,

W. Merrill, J. Petty, and A. Sabharwal, “The illusion of state in state-space models,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24, vol. 235. Vienna, Austria: JMLR.org, Jul. 2024, pp. 35 492–35 506

work page 2024

[8] [8]

A bio-inspired bistable recurrent cell allows for long-lasting memory,

N. Vecoven, D. Ernst, and G. Drion, “A bio-inspired bistable recurrent cell allows for long-lasting memory,”PLOS ONE, vol. 16, no. 6, p. e0252676, Jun. 2021. 13

work page 2021

[9] [9]

Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,

G. Lambrechts, F. De Geeter, N. Vecoven, D. Ernst, and G. Drion, “Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,”Neural Networks, vol. 166, pp. 645–669, Sep. 2023

work page 2023

[10] [10]

Simplified State Space Layers for Sequence Model- ing,

J. T. H. Smith, A. Warrington, and S. Linderman, “Simplified State Space Layers for Sequence Model- ing,” inThe Eleventh International Conference on Learning Representations, Sep. 2022

work page 2022

[11] [11]

Parallelizing Linear Recurrent Neural Nets Over Sequence Length,

E. Martin and C. Cundy, “Parallelizing Linear Recurrent Neural Nets Over Sequence Length,” inInter- national Conference on Learning Representations, Feb. 2018

work page 2018

[12] [12]

Were RNNs All We Needed?

L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H. Hajimirsadeghi, “Were RNNs All We Needed?” 2024

work page 2024

[13] [13]

Hierarchically Gated Recurrent Neural Network for Sequence Modeling,

Z. Qin, S. Yang, and Y. Zhong, “Hierarchically Gated Recurrent Neural Network for Sequence Modeling,” Advances in Neural Information Processing Systems, vol. 36, pp. 33 202–33 221, Dec. 2023

work page 2023

[14] [14]

Parallelizing non-linear sequential models over the sequence length,

Y. H. Lim, Q. Zhu, J. Selfridge, and M. F. Kasim, “Parallelizing non-linear sequential models over the sequence length,” inThe Twelfth International Conference on Learning Representations, Oct. 2023

work page 2023

[15] [15]

Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,

X. Gonzalez, A. Warrington, J. T. Smith, and S. W. Linderman, “Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,”Advances in Neural Information Processing Systems, vol. 37, pp. 5817– 5849, Dec. 2024

work page 2024

[16] [16]

ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,

F. Danieli, P. Rodriguez, M. Sarabia, X. Suau, and L. Zappella, “ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,” 2025

work page 2025

[17] [17]

Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks,

E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks,”IEEE Signal Process- ing Magazine, vol. 36, no. 6, pp. 51–63, Nov. 2019

work page 2019

[18] [18]

Training Spiking Neural Networks Using Lessons From Deep Learning,

J. K. Eshraghian, M. Ward, E. O. Neftci, X. Wang, G. Lenz, G. Dwivedi, M. Bennamoun, D. S. Jeong, and W. D. Lu, “Training Spiking Neural Networks Using Lessons From Deep Learning,”Proceedings of the IEEE, vol. 111, no. 9, pp. 1016–1054, Sep. 2023

work page 2023

[19] [19]

Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,

Y. Bengio, N. L´ eonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” Aug. 2013

work page 2013

[20] [20]

Resurrecting Recurrent Neural Networks for Long Sequences,

A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De, “Resurrecting Recurrent Neural Networks for Long Sequences,” inProceedings of the 40th International Conference on Machine Learning. PMLR, Jul. 2023, pp. 26 670–26 698

work page 2023

[21] [21]

Gradient-based learning applied to document recog- nition,

Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recog- nition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998

work page 1998

[22] [22]

A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,

Q. V. Le, N. Jaitly, and G. E. Hinton, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,”ArXiv, Apr. 2015

work page 2015

[23] [23]

Long Range Arena: A Benchmark for Efficient Transformers,

Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long Range Arena: A Benchmark for Efficient Transformers,”ArXiv, Nov. 2020

work page 2020

[24] [24]

Multistability in Recurrent Neural Networks,

C.-Y. Cheng, K.-H. Lin, and C.-W. Shih, “Multistability in Recurrent Neural Networks,”SIAM Journal on Applied Mathematics, vol. 66, no. 4, pp. 1301–1320, Jan. 2006

work page 2006

[25] [25]

Theory of Gating in Recurrent Neural Networks,

K. Krishnamurthy, T. Can, and D. J. Schwab, “Theory of Gating in Recurrent Neural Networks,” Physical Review X, vol. 12, no. 1, p. 011011, Jan. 2022

work page 2022

[26] [26]

Analysis of continuous-time switching networks,

R. Edwards, “Analysis of continuous-time switching networks,”Physica D: Nonlinear Phenomena, vol. 146, no. 1-4, pp. 165–199, Nov. 2000

work page 2000

[27] [27]

A Step Towards Uncovering The Structure of Multistable Neural Networks,

M. Tournoy and B. Doiron, “A Step Towards Uncovering The Structure of Multistable Neural Networks,” 2022

work page 2022

[28] [28]

Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,

A. Gu, I. Johnson, K. Goel, K. K. Saab, T. Dao, A. Rudra, and C. Re, “Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,”Neural Information Processing Systems, 2021

work page 2021

[29] [29]

xLSTM: Extended Long Short-Term Memory,

M. Beck, K. P¨ oppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “xLSTM: Extended Long Short-Term Memory,”Advances in Neural Information Processing Systems, vol. 37, pp. 107 547–107 603, Dec. 2024. 14

work page 2024

[30] [30]

Recurrent neural network from adder’s perspective: Carry-lookahead RNN,

H. Jiang, F. Qin, J. Cao, Y. Peng, and Y. Shao, “Recurrent neural network from adder’s perspective: Carry-lookahead RNN,”Neural Networks, vol. 144, pp. 297–306, Dec. 2021

work page 2021

[31] [31]

An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,

J. E. Zini, Y. Rizk, and M. Awad, “An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,”Journal of Artificial Intelligence and Soft Computing Research, vol. 11, no. 1, pp. 33–50, Jan. 2021

work page 2021

[32] [32]

Training Deep Spiking Neural Networks Using Backpropaga- tion,

J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training Deep Spiking Neural Networks Using Backpropaga- tion,”Frontiers in Neuroscience, vol. 10, 2016

work page 2016

[33] [33]

Sparse Spiking Gradient Descent,

N. P. Nieves and D. F. M. Goodman, “Sparse Spiking Gradient Descent,” inNeural Information Pro- cessing Systems, May 2021

work page 2021

[34] [34]

Learning Finite State Machines With Self-Clustering Recurrent Networks,

Z. Zeng, R. M. Goodman, and P. Smyth, “Learning Finite State Machines With Self-Clustering Recurrent Networks,”Neural Computation, vol. 5, no. 6, pp. 976–990, Nov. 1993

work page 1993

[35] [35]

A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,

R. Goodman and Z. Zeng, “A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,” inProceedings of IEEE Workshop on Neural Networks for Signal Processing, Sep. 1994, pp. 219– 228

work page 1994

[36] [36]

Deep Equilibrium Models,

S. Bai, J. Z. Kolter, and V. Koltun, “Deep Equilibrium Models,” inAdvances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019

work page 2019

[37] [37]

Prefix sums and their applications,

G. E. Blelloch, “Prefix sums and their applications,” School of Computer Science, Carnegie Mellon University Pittsburgh, PA, USA, Tech. Rep., 1990. 15 −2 0 2 xt −1 0 1 ht ba = −1 −2 0 2 xt −1 0 1 ht ba = 0 −2 0 2 xt −1 0 1 ht ba = 1 Stable Unstable Figure 12:Hysteresis bifurcation in the bistable recurrent cell.This figure shows the solutions to the impli...

work page 1990

[38] [38]

The operator⊛is associative,

work page

[39] [39]

, cT ]creates the array[s 0,

Performing the scan with the operator⊛on the array[c 0, . . . , cT ]creates the array[s 0, . . . , sT ]where st = [yt, ht]andy t is defined as: yt = ( a0 ift= 0, at ⊙y t−1 if0< t < T. It results that the parallel scan can be used to solve this first-order linear recurrence as⊛is associative (point 1.), and the solutionsh t will be the second values of the...

work page