Parallelizable memory recurrent units
Pith reviewed 2026-05-21 16:00 UTC · model grok-4.3
The pith
Memory recurrent units add persistent memory to parallelizable sequence models by using multistability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By engineering multistability into recurrent units, the approach creates multiple stable equilibria that hold information for arbitrary durations while eliminating transient dynamics that would prevent efficient parallelization, allowing the parallel scan algorithm to be used for training without losing the representation power that nonlinear RNNs provide over monostable SSMs.
What carries the argument
Multistability in recurrent units that creates persistent memory equilibria while removing transient dynamics to enable parallel scan compatibility.
Load-bearing premise
That multistability can be engineered to provide persistent memory while fully eliminating transient dynamics so that parallel scan algorithms remain efficient and stable.
What would settle it
Showing that the BMRU hidden state drifts or loses stored information after a long sequence of zero inputs, or that the parallel scan version becomes numerically unstable for extended sequence lengths.
Figures
read the original abstract
With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Memory Recurrent Units (MRUs), a new family of RNNs that use multistability to achieve persistent memory while eliminating transient dynamics to remain compatible with parallel scan algorithms used in state-space models (SSMs). A concrete instantiation, the Bistable Memory Recurrent Unit (BMRU), is derived as a proof-of-concept; the authors claim it supports efficient parallel training, performs well on long-term dependency tasks, and can be hybridized with SSMs to combine transient and persistent memory.
Significance. If the core construction is correct, the result would be significant: it directly targets the monostability limitation of current SSMs while preserving their parallel-training advantage, potentially enabling more expressive yet efficient sequence models for tasks that require both short-term dynamics and infinite-horizon memory.
major comments (2)
- [§3] §3 (BMRU construction): The central claim that multistability can be realized while 'getting rid of transient dynamics' so that the recurrence remains exactly compatible with the parallel scan algorithm is load-bearing. In a discrete-time multistable map, finite-time transitions between attractors are governed by the nonlinear update; unless the map is strictly affine between steps, the associative property required for stable parallel prefix computation does not hold exactly. The manuscript must exhibit the closed-form parallelization and prove that no approximation error accumulates over long sequences.
- [§4] §4 (experimental validation): The reported results on long-term dependency tasks lack error bars, sequence-length scaling curves, and direct comparisons against both pure SSM baselines and standard nonlinear RNNs. Without these, it is impossible to assess whether the claimed performance gain is attributable to persistent memory or to other implementation details.
minor comments (2)
- [§2] Notation for the bistable fixed points and the linearization around them should be introduced earlier and used consistently when discussing the elimination of transients.
- [Abstract] The abstract states that BMRU 'achieves good results'; the main text should replace this qualitative phrase with quantitative metrics and statistical significance tests.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed review. The comments identify key areas where additional rigor and experimental detail will strengthen the manuscript. We address each major comment below and describe the planned revisions.
read point-by-point responses
-
Referee: [§3] §3 (BMRU construction): The central claim that multistability can be realized while 'getting rid of transient dynamics' so that the recurrence remains exactly compatible with the parallel scan algorithm is load-bearing. In a discrete-time multistable map, finite-time transitions between attractors are governed by the nonlinear update; unless the map is strictly affine between steps, the associative property required for stable parallel prefix computation does not hold exactly. The manuscript must exhibit the closed-form parallelization and prove that no approximation error accumulates over long sequences.
Authors: We agree that a rigorous demonstration of exact compatibility is essential. The BMRU update is constructed so that the multistable (bistable) component operates on a separate memory state whose evolution can be expressed via an associative operator that is independent of the transient nonlinearities. We will add an explicit derivation of the closed-form parallel scan in a new subsection of §3, including the definition of the associative binary operator, verification of its associativity, and a proof that the parallel and sequential executions produce identical results for any sequence length, with zero accumulation of approximation error. This will be supported by both algebraic derivation and numerical verification on long sequences. revision: yes
-
Referee: [§4] §4 (experimental validation): The reported results on long-term dependency tasks lack error bars, sequence-length scaling curves, and direct comparisons against both pure SSM baselines and standard nonlinear RNNs. Without these, it is impossible to assess whether the claimed performance gain is attributable to persistent memory or to other implementation details.
Authors: We acknowledge that the current experimental section would benefit from these additions to allow clearer attribution of gains. In the revised version we will report mean performance with standard deviation error bars over at least five independent runs. We will include sequence-length scaling plots for the evaluated tasks. We will also add direct comparisons against representative SSM baselines (e.g., S4, Mamba) and standard nonlinear RNNs (LSTM, GRU) using identical training protocols and the same long-term dependency benchmarks. These results will be presented in an expanded §4 with a new table and accompanying discussion. revision: yes
Circularity Check
No significant circularity; new architectural proposal is self-contained
full rationale
The paper proposes a new RNN family (MRUs/BMRU) that combines multistability for persistent memory with parallel-scan compatibility by eliminating transient dynamics. This is presented as an architectural construction rather than a derivation that reduces to fitted parameters, self-citations, or prior results by construction. No equations or claims in the abstract reduce the central performance or compatibility assertions to inputs; the multistability mechanism is introduced as a design choice, not derived from or equivalent to the parallelization property. The work remains independent of any load-bearing self-citation chain and does not rename known results or smuggle ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multistability provides a source of persistent memory that can be decoupled from transient dynamics.
invented entities (2)
-
Memory recurrent unit (MRU)
no independent evidence
-
Bistable memory recurrent unit (BMRU)
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
BMRU update equations can be rewritten using an associative operator, therefore allowing the use of the parallel scan
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 4 Pith papers
-
On the Importance of Multistability for Horizon Generalization in Reinforcement Learning
Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs...
-
A Fully Tunable Ultra-Low Power Current-Mode Memory Cell in Standard CMOS Technology
A fully tunable ultra-low-power current-mode bistable memory cell using nine standard CMOS transistors enables spike-based logic gates and noise-immune recurrent neural units.
-
Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations
BMRUs enable a direct one-to-one mapping from learned parameters to current-mode analog circuit elements, with discrete hysteretic outputs suppressing noise by at least 20x and supporting sub-microwatt RNN inference i...
-
A Fully Tunable Ultra-Low Power Current-Mode Memory Cell in Standard CMOS Technology
A nine-transistor current-mode bistable memory cell in 180 nm CMOS is presented with independent tuning of threshold, hysteresis, and gain, shown via schematic simulations for spike-based logic gates and recurrent neu...
Reference graph
Works this paper leans on
-
[1]
S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997
work page 1997
-
[2]
Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,
K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724–1734
work page 2014
-
[3]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017
work page 2017
-
[4]
Efficiently Modeling Long Sequences with Structured State Spaces,
A. Gu, K. Goel, and C. Re, “Efficiently Modeling Long Sequences with Structured State Spaces,”ArXiv, Oct. 2021
work page 2021
-
[5]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces,
A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” inFirst Conference on Language Modeling, Aug. 2024
work page 2024
-
[6]
Fading memory and the problem of approximating nonlinear operators with Volterra series,
S. Boyd and L. Chua, “Fading memory and the problem of approximating nonlinear operators with Volterra series,”IEEE Transactions on Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, Nov. 1985
work page 1985
-
[7]
The illusion of state in state-space models,
W. Merrill, J. Petty, and A. Sabharwal, “The illusion of state in state-space models,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24, vol. 235. Vienna, Austria: JMLR.org, Jul. 2024, pp. 35 492–35 506
work page 2024
-
[8]
A bio-inspired bistable recurrent cell allows for long-lasting memory,
N. Vecoven, D. Ernst, and G. Drion, “A bio-inspired bistable recurrent cell allows for long-lasting memory,”PLOS ONE, vol. 16, no. 6, p. e0252676, Jun. 2021. 13
work page 2021
-
[9]
Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,
G. Lambrechts, F. De Geeter, N. Vecoven, D. Ernst, and G. Drion, “Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,”Neural Networks, vol. 166, pp. 645–669, Sep. 2023
work page 2023
-
[10]
Simplified State Space Layers for Sequence Model- ing,
J. T. H. Smith, A. Warrington, and S. Linderman, “Simplified State Space Layers for Sequence Model- ing,” inThe Eleventh International Conference on Learning Representations, Sep. 2022
work page 2022
-
[11]
Parallelizing Linear Recurrent Neural Nets Over Sequence Length,
E. Martin and C. Cundy, “Parallelizing Linear Recurrent Neural Nets Over Sequence Length,” inInter- national Conference on Learning Representations, Feb. 2018
work page 2018
-
[12]
L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H. Hajimirsadeghi, “Were RNNs All We Needed?” 2024
work page 2024
-
[13]
Hierarchically Gated Recurrent Neural Network for Sequence Modeling,
Z. Qin, S. Yang, and Y. Zhong, “Hierarchically Gated Recurrent Neural Network for Sequence Modeling,” Advances in Neural Information Processing Systems, vol. 36, pp. 33 202–33 221, Dec. 2023
work page 2023
-
[14]
Parallelizing non-linear sequential models over the sequence length,
Y. H. Lim, Q. Zhu, J. Selfridge, and M. F. Kasim, “Parallelizing non-linear sequential models over the sequence length,” inThe Twelfth International Conference on Learning Representations, Oct. 2023
work page 2023
-
[15]
Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,
X. Gonzalez, A. Warrington, J. T. Smith, and S. W. Linderman, “Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,”Advances in Neural Information Processing Systems, vol. 37, pp. 5817– 5849, Dec. 2024
work page 2024
-
[16]
ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,
F. Danieli, P. Rodriguez, M. Sarabia, X. Suau, and L. Zappella, “ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,” 2025
work page 2025
-
[17]
E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks,”IEEE Signal Process- ing Magazine, vol. 36, no. 6, pp. 51–63, Nov. 2019
work page 2019
-
[18]
Training Spiking Neural Networks Using Lessons From Deep Learning,
J. K. Eshraghian, M. Ward, E. O. Neftci, X. Wang, G. Lenz, G. Dwivedi, M. Bennamoun, D. S. Jeong, and W. D. Lu, “Training Spiking Neural Networks Using Lessons From Deep Learning,”Proceedings of the IEEE, vol. 111, no. 9, pp. 1016–1054, Sep. 2023
work page 2023
-
[19]
Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,
Y. Bengio, N. L´ eonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” Aug. 2013
work page 2013
-
[20]
Resurrecting Recurrent Neural Networks for Long Sequences,
A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De, “Resurrecting Recurrent Neural Networks for Long Sequences,” inProceedings of the 40th International Conference on Machine Learning. PMLR, Jul. 2023, pp. 26 670–26 698
work page 2023
-
[21]
Gradient-based learning applied to document recog- nition,
Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recog- nition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998
work page 1998
-
[22]
A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,
Q. V. Le, N. Jaitly, and G. E. Hinton, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,”ArXiv, Apr. 2015
work page 2015
-
[23]
Long Range Arena: A Benchmark for Efficient Transformers,
Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long Range Arena: A Benchmark for Efficient Transformers,”ArXiv, Nov. 2020
work page 2020
-
[24]
Multistability in Recurrent Neural Networks,
C.-Y. Cheng, K.-H. Lin, and C.-W. Shih, “Multistability in Recurrent Neural Networks,”SIAM Journal on Applied Mathematics, vol. 66, no. 4, pp. 1301–1320, Jan. 2006
work page 2006
-
[25]
Theory of Gating in Recurrent Neural Networks,
K. Krishnamurthy, T. Can, and D. J. Schwab, “Theory of Gating in Recurrent Neural Networks,” Physical Review X, vol. 12, no. 1, p. 011011, Jan. 2022
work page 2022
-
[26]
Analysis of continuous-time switching networks,
R. Edwards, “Analysis of continuous-time switching networks,”Physica D: Nonlinear Phenomena, vol. 146, no. 1-4, pp. 165–199, Nov. 2000
work page 2000
-
[27]
A Step Towards Uncovering The Structure of Multistable Neural Networks,
M. Tournoy and B. Doiron, “A Step Towards Uncovering The Structure of Multistable Neural Networks,” 2022
work page 2022
-
[28]
Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,
A. Gu, I. Johnson, K. Goel, K. K. Saab, T. Dao, A. Rudra, and C. Re, “Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,”Neural Information Processing Systems, 2021
work page 2021
-
[29]
xLSTM: Extended Long Short-Term Memory,
M. Beck, K. P¨ oppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “xLSTM: Extended Long Short-Term Memory,”Advances in Neural Information Processing Systems, vol. 37, pp. 107 547–107 603, Dec. 2024. 14
work page 2024
-
[30]
Recurrent neural network from adder’s perspective: Carry-lookahead RNN,
H. Jiang, F. Qin, J. Cao, Y. Peng, and Y. Shao, “Recurrent neural network from adder’s perspective: Carry-lookahead RNN,”Neural Networks, vol. 144, pp. 297–306, Dec. 2021
work page 2021
-
[31]
An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,
J. E. Zini, Y. Rizk, and M. Awad, “An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,”Journal of Artificial Intelligence and Soft Computing Research, vol. 11, no. 1, pp. 33–50, Jan. 2021
work page 2021
-
[32]
Training Deep Spiking Neural Networks Using Backpropaga- tion,
J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training Deep Spiking Neural Networks Using Backpropaga- tion,”Frontiers in Neuroscience, vol. 10, 2016
work page 2016
-
[33]
Sparse Spiking Gradient Descent,
N. P. Nieves and D. F. M. Goodman, “Sparse Spiking Gradient Descent,” inNeural Information Pro- cessing Systems, May 2021
work page 2021
-
[34]
Learning Finite State Machines With Self-Clustering Recurrent Networks,
Z. Zeng, R. M. Goodman, and P. Smyth, “Learning Finite State Machines With Self-Clustering Recurrent Networks,”Neural Computation, vol. 5, no. 6, pp. 976–990, Nov. 1993
work page 1993
-
[35]
A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,
R. Goodman and Z. Zeng, “A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,” inProceedings of IEEE Workshop on Neural Networks for Signal Processing, Sep. 1994, pp. 219– 228
work page 1994
-
[36]
S. Bai, J. Z. Kolter, and V. Koltun, “Deep Equilibrium Models,” inAdvances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019
work page 2019
-
[37]
Prefix sums and their applications,
G. E. Blelloch, “Prefix sums and their applications,” School of Computer Science, Carnegie Mellon University Pittsburgh, PA, USA, Tech. Rep., 1990. 15 −2 0 2 xt −1 0 1 ht ba = −1 −2 0 2 xt −1 0 1 ht ba = 0 −2 0 2 xt −1 0 1 ht ba = 1 Stable Unstable Figure 12:Hysteresis bifurcation in the bistable recurrent cell.This figure shows the solutions to the impli...
work page 1990
-
[38]
The operator⊛is associative,
-
[39]
Performing the scan with the operator⊛on the array[c 0, . . . , cT ]creates the array[s 0, . . . , sT ]where st = [yt, ht]andy t is defined as: yt = ( a0 ift= 0, at ⊙y t−1 if0< t < T. It results that the parallel scan can be used to solve this first-order linear recurrence as⊛is associative (point 1.), and the solutionsh t will be the second values of the...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.