pith. sign in

arxiv: 2601.09495 · v3 · pith:5NLHB6OMnew · submitted 2026-01-14 · 💻 cs.LG

Parallelizable memory recurrent units

Pith reviewed 2026-05-21 16:00 UTC · model grok-4.3

classification 💻 cs.LG
keywords recurrent neural networksstate-space modelspersistent memorymultistabilityparallel scanlong-term dependenciessequence modeling
0
0 comments X

The pith

Memory recurrent units add persistent memory to parallelizable sequence models by using multistability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces memory recurrent units as a new RNN family that achieves persistent memory through engineered multistability while supporting parallel computations by removing transient dynamics. This solves a core limitation of state-space models, which train efficiently in parallel but cannot retain information indefinitely because they are monostable. The bistable memory recurrent unit is presented as a concrete implementation that works with parallel scan algorithms and performs well on long-term dependency tasks. These units can also be combined with state-space models to build hybrid networks that keep both transient dynamics and persistent memory.

Core claim

By engineering multistability into recurrent units, the approach creates multiple stable equilibria that hold information for arbitrary durations while eliminating transient dynamics that would prevent efficient parallelization, allowing the parallel scan algorithm to be used for training without losing the representation power that nonlinear RNNs provide over monostable SSMs.

What carries the argument

Multistability in recurrent units that creates persistent memory equilibria while removing transient dynamics to enable parallel scan compatibility.

Load-bearing premise

That multistability can be engineered to provide persistent memory while fully eliminating transient dynamics so that parallel scan algorithms remain efficient and stable.

What would settle it

Showing that the BMRU hidden state drifts or loses stored information after a long sequence of zero inputs, or that the parallel scan version becomes numerically unstable for extended sequence lengths.

Figures

Figures reproduced from arXiv: 2601.09495 by Damien Ernst, Florent De Geeter, Gaspard Lambrechts, Guillaume Drion.

Figure 1
Figure 1. Figure 1: Monostability vs bistability in a RNN with internal clock. The figure shows internal state trajec￾tories of the RNN unit described by equation (3) for different initial conditions h˜t[0] = ht−1. (left) Evolution of the system when β = −1.5. (right) Evolution of the system when β = 1.5. For N → ∞, we have the specific set of “convergent” RNNs, i.e. RNNs that can converge towards their steady-state values be… view at source ↗
Figure 2
Figure 2. Figure 2: Convergence properties of the RNN unit described by equation (4b) for different values of input xt and either β = −1.5 (A) or β = 1.5 (B) (left) Internal state trajectories corresponding to different initial conditions h˜t[0] = ht−1 and 4 different input values xt. (right) Solutions of the steady-state equation equation (5) for β = −1.5 (A) and β = 1.5 (B). The red arrows show convergence trajectories from… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison between the implicit func￾tion and its approximation. This figure compares the solutions (ht, xt) of the implicit function defined by equa￾tion (5) (in blue) with the approximation defined by equa￾tion (8) (in red) where α = 1. Solid lines correspond to stable points, and dashed lines to unstable points. x 0 t t −α 0 α ht Bistable region Memory is set ±β [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Surrogate gradient used in BMRU. (left) Comparison between the Heaviside function and the function defined by equation (13) used to approximate the gradient for different values of αsurr. (right) Impact of αsurr on the surrogate gradient defined by equation (12). in the layer. We use a classical fully connected layer to compute the vectors hˆ t and βt from the input vector xt, adding the positivity constra… view at source ↗
Figure 6
Figure 6. Figure 6: Persistent memory of BMRU compared to the fading memory of LRU. Small models with one LRU or BMRU layer of one unit have been trained on a simple benchmark whose inputs start with ±1 followed by 0’s. The goal of the models is to output the first input at the last timestep. (left, center) Evolution of the states of LRU and BMRU with respect to the timesteps for the two possible inputs. As the LRU state is c… view at source ↗
Figure 7
Figure 7. Figure 7: Consistency of the gradient of BMRU with respect to time. Evolution of the gradient of the last state hT with respect to the first candidate hˆ1 or β1. Two cases are considered: either only the first timestep sets the memory, i.e. |hˆ1| > β1 and |hˆt| < βt ∀t > 1, either another timestep also sets the memory. (figure 6, left) or fading memory (figure 6, center). However, BMRU encoding the information in st… view at source ↗
Figure 8
Figure 8. Figure 8: Results on the copy-first input benchmark. Test MSEs ob￾tained by BMRU and LRU models on the copy-first-input benchmark for two sequence lengths: 100 and 300. 102 103 104 105 Sequence length 0 1 MSE BMRU LRU 102 103 104 105 Sequence length 0 1 MSE BMRU LRU [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Results on the permuted sequential MNIST. Accuracies obtained by BMRU, LRU and BMRU/LRU models on the permuted sequential MNIST, with or without black pixels added at the end of the sequences. Models with 2 and 3 recurrent block have been tested. sequences. To do that, we created small test sets whose sequences lengths are a power of 10, starting from 102 to 105 . Each test set is composed of 6000 samples… view at source ↗
Figure 11
Figure 11. Figure 11: Results on the pathfinder benchmark. Each plot shows the accuracies obtained by BMRU models on the pathfinder benchmark, for different state dimensions (x-axes) and network depths (1, 2 and 3 recurrent blocks from left to right, respectively). 5.4 Pathfinder The pathfinder benchmark is part of the long-range arena group of benchmarks [23]. It consists of 32x32 black and white images where lines are drawn … view at source ↗
Figure 12
Figure 12. Figure 12: Hysteresis bifurcation in the bistable recurrent cell. This figure shows the solutions to the implicit function defined by equation (15) as well as their stability for three values of ba. A Hysteresis bifurcation in the bistable recurrent cell The bistable recurrent cell [8] is described the set of equations ct = σ(Ucxt + wc ⊙ ht−1 + bc), (14a) at = 1 + tanh (Uaxt + wa ⊙ ht−1 + ba), (14b) ht = ct ⊙ ht−1 +… view at source ↗
read the original abstract

With the emergence of massively parallel processing units, parallelization has become a desirable property for new sequence models. The ability to parallelize the processing of sequences with respect to the sequence length during training is one of the main factors behind the uprising of the Transformer architecture. However, Transformers lack efficiency at sequence generation, as they need to reprocess all past timesteps at every generation step. Recently, state-space models (SSMs) emerged as a more efficient alternative. These new kinds of recurrent neural networks (RNNs) keep the efficient update of the RNNs while gaining parallelization by getting rid of nonlinear dynamics (or recurrence). SSMs can reach state-of-the art performance through the efficient training of potentially very large networks, but still suffer from limited representation capabilities. In particular, SSMs cannot exhibit persistent memory, or the capacity of retaining information for an infinite duration, because of their monostability. In this paper, we introduce a new family of RNNs, the memory recurrent units (MRUs), that combine the persistent memory capabilities of nonlinear RNNs with the parallelizable computations of SSMs. These units leverage multistability as a source of persistent memory, while getting rid of transient dynamics for efficient computations. We then derive a specific implementation as proof-of-concept: the bistable memory recurrent unit (BMRU). This new RNN is compatible with the parallel scan algorithm. We show that BMRU achieves good results in tasks with long-term dependencies, and can be combined with state-space models to create hybrid networks that are parallelizable and have transient dynamics as well as persistent memory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Memory Recurrent Units (MRUs), a new family of RNNs that use multistability to achieve persistent memory while eliminating transient dynamics to remain compatible with parallel scan algorithms used in state-space models (SSMs). A concrete instantiation, the Bistable Memory Recurrent Unit (BMRU), is derived as a proof-of-concept; the authors claim it supports efficient parallel training, performs well on long-term dependency tasks, and can be hybridized with SSMs to combine transient and persistent memory.

Significance. If the core construction is correct, the result would be significant: it directly targets the monostability limitation of current SSMs while preserving their parallel-training advantage, potentially enabling more expressive yet efficient sequence models for tasks that require both short-term dynamics and infinite-horizon memory.

major comments (2)
  1. [§3] §3 (BMRU construction): The central claim that multistability can be realized while 'getting rid of transient dynamics' so that the recurrence remains exactly compatible with the parallel scan algorithm is load-bearing. In a discrete-time multistable map, finite-time transitions between attractors are governed by the nonlinear update; unless the map is strictly affine between steps, the associative property required for stable parallel prefix computation does not hold exactly. The manuscript must exhibit the closed-form parallelization and prove that no approximation error accumulates over long sequences.
  2. [§4] §4 (experimental validation): The reported results on long-term dependency tasks lack error bars, sequence-length scaling curves, and direct comparisons against both pure SSM baselines and standard nonlinear RNNs. Without these, it is impossible to assess whether the claimed performance gain is attributable to persistent memory or to other implementation details.
minor comments (2)
  1. [§2] Notation for the bistable fixed points and the linearization around them should be introduced earlier and used consistently when discussing the elimination of transients.
  2. [Abstract] The abstract states that BMRU 'achieves good results'; the main text should replace this qualitative phrase with quantitative metrics and statistical significance tests.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments identify key areas where additional rigor and experimental detail will strengthen the manuscript. We address each major comment below and describe the planned revisions.

read point-by-point responses
  1. Referee: [§3] §3 (BMRU construction): The central claim that multistability can be realized while 'getting rid of transient dynamics' so that the recurrence remains exactly compatible with the parallel scan algorithm is load-bearing. In a discrete-time multistable map, finite-time transitions between attractors are governed by the nonlinear update; unless the map is strictly affine between steps, the associative property required for stable parallel prefix computation does not hold exactly. The manuscript must exhibit the closed-form parallelization and prove that no approximation error accumulates over long sequences.

    Authors: We agree that a rigorous demonstration of exact compatibility is essential. The BMRU update is constructed so that the multistable (bistable) component operates on a separate memory state whose evolution can be expressed via an associative operator that is independent of the transient nonlinearities. We will add an explicit derivation of the closed-form parallel scan in a new subsection of §3, including the definition of the associative binary operator, verification of its associativity, and a proof that the parallel and sequential executions produce identical results for any sequence length, with zero accumulation of approximation error. This will be supported by both algebraic derivation and numerical verification on long sequences. revision: yes

  2. Referee: [§4] §4 (experimental validation): The reported results on long-term dependency tasks lack error bars, sequence-length scaling curves, and direct comparisons against both pure SSM baselines and standard nonlinear RNNs. Without these, it is impossible to assess whether the claimed performance gain is attributable to persistent memory or to other implementation details.

    Authors: We acknowledge that the current experimental section would benefit from these additions to allow clearer attribution of gains. In the revised version we will report mean performance with standard deviation error bars over at least five independent runs. We will include sequence-length scaling plots for the evaluated tasks. We will also add direct comparisons against representative SSM baselines (e.g., S4, Mamba) and standard nonlinear RNNs (LSTM, GRU) using identical training protocols and the same long-term dependency benchmarks. These results will be presented in an expanded §4 with a new table and accompanying discussion. revision: yes

Circularity Check

0 steps flagged

No significant circularity; new architectural proposal is self-contained

full rationale

The paper proposes a new RNN family (MRUs/BMRU) that combines multistability for persistent memory with parallel-scan compatibility by eliminating transient dynamics. This is presented as an architectural construction rather than a derivation that reduces to fitted parameters, self-citations, or prior results by construction. No equations or claims in the abstract reduce the central performance or compatibility assertions to inputs; the multistability mechanism is introduced as a design choice, not derived from or equivalent to the parallelization property. The work remains independent of any load-bearing self-citation chain and does not rename known results or smuggle ansatzes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that multistability can be isolated from transient dynamics to enable both persistent memory and parallel computation; no free parameters or invented entities are explicitly quantified in the abstract.

axioms (1)
  • domain assumption Multistability provides a source of persistent memory that can be decoupled from transient dynamics.
    Invoked in the abstract to justify the design of MRUs.
invented entities (2)
  • Memory recurrent unit (MRU) no independent evidence
    purpose: New RNN family combining persistent memory and parallelizability.
    Introduced as the core contribution.
  • Bistable memory recurrent unit (BMRU) no independent evidence
    purpose: Concrete proof-of-concept implementation compatible with parallel scan.
    Presented as a specific realization of the MRU idea.

pith-pipeline@v0.9.0 · 5820 in / 1335 out tokens · 35224 ms · 2026-05-21T16:00:10.838882+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. On the Importance of Multistability for Horizon Generalization in Reinforcement Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    Multistability is necessary for temporal horizon generalization in POMDPs, sufficient in simple tasks along with transient dynamics in complex ones, while monostable parallelizable RNNs like SSMs and gated linear RNNs...

  2. A Fully Tunable Ultra-Low Power Current-Mode Memory Cell in Standard CMOS Technology

    eess.SP 2026-05 unverdicted novelty 7.0

    A fully tunable ultra-low-power current-mode bistable memory cell using nine standard CMOS transistors enables spike-based logic gates and noise-immune recurrent neural units.

  3. Hardware-Software Co-Design of Scalable, Energy-Efficient Analog Recurrent Computations

    cs.AR 2026-05 unverdicted novelty 6.0

    BMRUs enable a direct one-to-one mapping from learned parameters to current-mode analog circuit elements, with discrete hysteretic outputs suppressing noise by at least 20x and supporting sub-microwatt RNN inference i...

  4. A Fully Tunable Ultra-Low Power Current-Mode Memory Cell in Standard CMOS Technology

    eess.SP 2026-05 unverdicted novelty 6.0

    A nine-transistor current-mode bistable memory cell in 180 nm CMOS is presented with independent tuning of threshold, hysteresis, and gain, shown via schematic simulations for spike-based logic gates and recurrent neu...

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 3 Pith papers

  1. [1]

    Long Short-Term Memory,

    S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,”Neural Computation, vol. 9, no. 8, pp. 1735–1780, Nov. 1997

  2. [2]

    Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,

    K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014, pp. 1724–1734

  3. [3]

    Attention is All you Need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is All you Need,” inAdvances in Neural Information Processing Systems, vol. 30. Curran Associates, Inc., 2017

  4. [4]

    Efficiently Modeling Long Sequences with Structured State Spaces,

    A. Gu, K. Goel, and C. Re, “Efficiently Modeling Long Sequences with Structured State Spaces,”ArXiv, Oct. 2021

  5. [5]

    Mamba: Linear-Time Sequence Modeling with Selective State Spaces,

    A. Gu and T. Dao, “Mamba: Linear-Time Sequence Modeling with Selective State Spaces,” inFirst Conference on Language Modeling, Aug. 2024

  6. [6]

    Fading memory and the problem of approximating nonlinear operators with Volterra series,

    S. Boyd and L. Chua, “Fading memory and the problem of approximating nonlinear operators with Volterra series,”IEEE Transactions on Circuits and Systems, vol. 32, no. 11, pp. 1150–1161, Nov. 1985

  7. [7]

    The illusion of state in state-space models,

    W. Merrill, J. Petty, and A. Sabharwal, “The illusion of state in state-space models,” inProceedings of the 41st International Conference on Machine Learning, ser. ICML’24, vol. 235. Vienna, Austria: JMLR.org, Jul. 2024, pp. 35 492–35 506

  8. [8]

    A bio-inspired bistable recurrent cell allows for long-lasting memory,

    N. Vecoven, D. Ernst, and G. Drion, “A bio-inspired bistable recurrent cell allows for long-lasting memory,”PLOS ONE, vol. 16, no. 6, p. e0252676, Jun. 2021. 13

  9. [9]

    Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,

    G. Lambrechts, F. De Geeter, N. Vecoven, D. Ernst, and G. Drion, “Warming up recurrent neural networks to maximise reachable multistability greatly improves learning,”Neural Networks, vol. 166, pp. 645–669, Sep. 2023

  10. [10]

    Simplified State Space Layers for Sequence Model- ing,

    J. T. H. Smith, A. Warrington, and S. Linderman, “Simplified State Space Layers for Sequence Model- ing,” inThe Eleventh International Conference on Learning Representations, Sep. 2022

  11. [11]

    Parallelizing Linear Recurrent Neural Nets Over Sequence Length,

    E. Martin and C. Cundy, “Parallelizing Linear Recurrent Neural Nets Over Sequence Length,” inInter- national Conference on Learning Representations, Feb. 2018

  12. [12]

    Were RNNs All We Needed?

    L. Feng, F. Tung, M. O. Ahmed, Y. Bengio, and H. Hajimirsadeghi, “Were RNNs All We Needed?” 2024

  13. [13]

    Hierarchically Gated Recurrent Neural Network for Sequence Modeling,

    Z. Qin, S. Yang, and Y. Zhong, “Hierarchically Gated Recurrent Neural Network for Sequence Modeling,” Advances in Neural Information Processing Systems, vol. 36, pp. 33 202–33 221, Dec. 2023

  14. [14]

    Parallelizing non-linear sequential models over the sequence length,

    Y. H. Lim, Q. Zhu, J. Selfridge, and M. F. Kasim, “Parallelizing non-linear sequential models over the sequence length,” inThe Twelfth International Conference on Learning Representations, Oct. 2023

  15. [15]

    Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,

    X. Gonzalez, A. Warrington, J. T. Smith, and S. W. Linderman, “Towards Scalable and Stable Paral- lelization of Nonlinear RNNs,”Advances in Neural Information Processing Systems, vol. 37, pp. 5817– 5849, Dec. 2024

  16. [16]

    ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,

    F. Danieli, P. Rodriguez, M. Sarabia, X. Suau, and L. Zappella, “ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models,” 2025

  17. [17]

    Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks,

    E. O. Neftci, H. Mostafa, and F. Zenke, “Surrogate Gradient Learning in Spiking Neural Networks: Bringing the Power of Gradient-Based Optimization to Spiking Neural Networks,”IEEE Signal Process- ing Magazine, vol. 36, no. 6, pp. 51–63, Nov. 2019

  18. [18]

    Training Spiking Neural Networks Using Lessons From Deep Learning,

    J. K. Eshraghian, M. Ward, E. O. Neftci, X. Wang, G. Lenz, G. Dwivedi, M. Bennamoun, D. S. Jeong, and W. D. Lu, “Training Spiking Neural Networks Using Lessons From Deep Learning,”Proceedings of the IEEE, vol. 111, no. 9, pp. 1016–1054, Sep. 2023

  19. [19]

    Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,

    Y. Bengio, N. L´ eonard, and A. Courville, “Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation,” Aug. 2013

  20. [20]

    Resurrecting Recurrent Neural Networks for Long Sequences,

    A. Orvieto, S. L. Smith, A. Gu, A. Fernando, C. Gulcehre, R. Pascanu, and S. De, “Resurrecting Recurrent Neural Networks for Long Sequences,” inProceedings of the 40th International Conference on Machine Learning. PMLR, Jul. 2023, pp. 26 670–26 698

  21. [21]

    Gradient-based learning applied to document recog- nition,

    Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recog- nition,”Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, Nov. 1998

  22. [22]

    A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,

    Q. V. Le, N. Jaitly, and G. E. Hinton, “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units,”ArXiv, Apr. 2015

  23. [23]

    Long Range Arena: A Benchmark for Efficient Transformers,

    Y. Tay, M. Dehghani, S. Abnar, Y. Shen, D. Bahri, P. Pham, J. Rao, L. Yang, S. Ruder, and D. Metzler, “Long Range Arena: A Benchmark for Efficient Transformers,”ArXiv, Nov. 2020

  24. [24]

    Multistability in Recurrent Neural Networks,

    C.-Y. Cheng, K.-H. Lin, and C.-W. Shih, “Multistability in Recurrent Neural Networks,”SIAM Journal on Applied Mathematics, vol. 66, no. 4, pp. 1301–1320, Jan. 2006

  25. [25]

    Theory of Gating in Recurrent Neural Networks,

    K. Krishnamurthy, T. Can, and D. J. Schwab, “Theory of Gating in Recurrent Neural Networks,” Physical Review X, vol. 12, no. 1, p. 011011, Jan. 2022

  26. [26]

    Analysis of continuous-time switching networks,

    R. Edwards, “Analysis of continuous-time switching networks,”Physica D: Nonlinear Phenomena, vol. 146, no. 1-4, pp. 165–199, Nov. 2000

  27. [27]

    A Step Towards Uncovering The Structure of Multistable Neural Networks,

    M. Tournoy and B. Doiron, “A Step Towards Uncovering The Structure of Multistable Neural Networks,” 2022

  28. [28]

    Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,

    A. Gu, I. Johnson, K. Goel, K. K. Saab, T. Dao, A. Rudra, and C. Re, “Combining Recurrent, Convo- lutional, and Continuous-time Models with Linear State-Space Layers,”Neural Information Processing Systems, 2021

  29. [29]

    xLSTM: Extended Long Short-Term Memory,

    M. Beck, K. P¨ oppel, M. Spanring, A. Auer, O. Prudnikova, M. Kopp, G. Klambauer, J. Brandstetter, and S. Hochreiter, “xLSTM: Extended Long Short-Term Memory,”Advances in Neural Information Processing Systems, vol. 37, pp. 107 547–107 603, Dec. 2024. 14

  30. [30]

    Recurrent neural network from adder’s perspective: Carry-lookahead RNN,

    H. Jiang, F. Qin, J. Cao, Y. Peng, and Y. Shao, “Recurrent neural network from adder’s perspective: Carry-lookahead RNN,”Neural Networks, vol. 144, pp. 297–306, Dec. 2021

  31. [31]

    An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,

    J. E. Zini, Y. Rizk, and M. Awad, “An Optimized Parallel Implementation of Non-Iteratively Trained Recurrent Neural Networks,”Journal of Artificial Intelligence and Soft Computing Research, vol. 11, no. 1, pp. 33–50, Jan. 2021

  32. [32]

    Training Deep Spiking Neural Networks Using Backpropaga- tion,

    J. H. Lee, T. Delbruck, and M. Pfeiffer, “Training Deep Spiking Neural Networks Using Backpropaga- tion,”Frontiers in Neuroscience, vol. 10, 2016

  33. [33]

    Sparse Spiking Gradient Descent,

    N. P. Nieves and D. F. M. Goodman, “Sparse Spiking Gradient Descent,” inNeural Information Pro- cessing Systems, May 2021

  34. [34]

    Learning Finite State Machines With Self-Clustering Recurrent Networks,

    Z. Zeng, R. M. Goodman, and P. Smyth, “Learning Finite State Machines With Self-Clustering Recurrent Networks,”Neural Computation, vol. 5, no. 6, pp. 976–990, Nov. 1993

  35. [35]

    A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,

    R. Goodman and Z. Zeng, “A learning algorithm for multi-layer perceptrons with hard-limiting threshold units,” inProceedings of IEEE Workshop on Neural Networks for Signal Processing, Sep. 1994, pp. 219– 228

  36. [36]

    Deep Equilibrium Models,

    S. Bai, J. Z. Kolter, and V. Koltun, “Deep Equilibrium Models,” inAdvances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc., 2019

  37. [37]

    Prefix sums and their applications,

    G. E. Blelloch, “Prefix sums and their applications,” School of Computer Science, Carnegie Mellon University Pittsburgh, PA, USA, Tech. Rep., 1990. 15 −2 0 2 xt −1 0 1 ht ba = −1 −2 0 2 xt −1 0 1 ht ba = 0 −2 0 2 xt −1 0 1 ht ba = 1 Stable Unstable Figure 12:Hysteresis bifurcation in the bistable recurrent cell.This figure shows the solutions to the impli...

  38. [38]

    The operator⊛is associative,

  39. [39]

    , cT ]creates the array[s 0,

    Performing the scan with the operator⊛on the array[c 0, . . . , cT ]creates the array[s 0, . . . , sT ]where st = [yt, ht]andy t is defined as: yt = ( a0 ift= 0, at ⊙y t−1 if0< t < T. It results that the parallel scan can be used to solve this first-order linear recurrence as⊛is associative (point 1.), and the solutionsh t will be the second values of the...