Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

Lorenzo Livi

arxiv: 2508.12121 · v5 · submitted 2025-08-16 · 💻 cs.LG · math.DS

Time-Scale Coupling Between States and Parameters in Recurrent Neural Networks

Lorenzo Livi This is my paper

Pith reviewed 2026-05-18 22:52 UTC · model grok-4.3

classification 💻 cs.LG math.DS

keywords recurrent neural networksgating mechanismseffective learning ratestime-scale couplingJacobian analysisgradient anisotropycredit assignmentadaptive optimization

0 comments

The pith

Gating in RNNs induces lag-dependent effective learning rates by coupling state time-scales to gradient updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that gating mechanisms in recurrent neural networks create effective learning rates that vary with time lag and update direction, even under a fixed global step size during training. This effect stems from an inherent coupling between the time-scales of state evolution, controlled by the gates, and the dynamics of parameter updates via gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and performing a first-order expansion, the analysis shows how gates reshape gradient flow, modulate step sizes, and introduce directional anisotropy in updates. A reader would care because this provides a dynamical-systems account of why gated architectures train more reliably than plain RNNs, while linking gates to known adaptive optimizers such as momentum and Adam. The work positions gates as both information filters and data-driven preconditioners of the optimization landscape.

Core claim

Gating mechanisms induce lag-dependent and direction-dependent effective learning rates, arising from a coupling between state-space time-scales parametrized by the gates and parameter-space dynamics during gradient descent. Exact Jacobians for leaky-integrator and gated RNNs, combined with a first-order expansion, make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy. Gates thereby act as data-driven preconditioners with formal connections to learning-rate schedules, momentum, and methods such as Adam. Empirical simulations confirm that gates produce the predicted lag-dependent rates and low-rank

What carries the argument

Exact Jacobians of leaky-integrator and gated RNNs under first-order expansion, which reveal how gates modulate gradient propagation and effective step sizes.

If this is right

Gates function simultaneously as information filters and as preconditioners that align state-space transport with loss-relevant directions.
Gating and optimizer-driven adaptivity address complementary parts of credit assignment.
The induced anisotropy matches or exceeds the structure produced by Adam across several tasks.
The coupling supplies a unified view of why gated RNNs remain trainable on long sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Design of new gate functions could target specific anisotropy patterns to accelerate convergence on particular sequence lengths.
Similar state-parameter couplings may appear in attention-based models and could be diagnosed with the same Jacobian approach.
The framework suggests experiments that vary gate dimensionality while holding optimizer fixed to isolate the contribution of each to final performance.

Load-bearing premise

The first-order expansion applied to the exact Jacobians sufficiently captures how gates reshape gradient propagation and introduce anisotropy.

What would settle it

Measuring whether the effective learning rates observed in gradient updates of gated RNNs on sequence tasks vary systematically with lag and direction, matching the anisotropy predicted by the Jacobian analysis but absent in non-gated models.

Figures

Figures reproduced from arXiv: 2508.12121 by Lorenzo Livi.

**Figure 2.** Figure 2: Second-order remainder C2(ε) for the scalar gate case. b) Multi-gate case: Figures 5–8 show the corresponding diagnostics for the multi-gate configuration. The truncation error in [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

**Figure 3.** Figure 3: Per-step norms ∥Aj∥2 (dominant part), ∥Bj∥2 (gate correction), and their ratio over time for the scalar gate case [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of per-step ratios ∥Bj∥2/∥Aj∥2 for the scalar gate case [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: First-order truncation error vs. ε for the multi-gate case [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Second-order remainder C2(ε) for the multi-gate case [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Per-step norms ∥Aj∥2 (dominant part), ∥Bj∥2 (gate correction), and their ratio over time for the multi-gate case [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Distribution of per-step ratios ∥Bj∥2/∥Aj∥2 for the multi-gate case [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Leaky RNN (constant α): normalized effective LR profile at final checkpoint (left), slope s(ℓ) across iterations (middle), and full sensitivity heatmap St,k (right) [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: Scalar-gated RNN: normalized effective LR profile at final checkpoint (left), slope [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Multi-gated RNN: normalized effective LR profile at final checkpoint (left), slope [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Adding task. Left/middle: propagation anisotropy (AI, CE) vs. lag. Bottom: update anisotropy from gradient covariance (higher is more concentrated). [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: AR(2). Propagation is highly anisotropic for all models; updates concentrate much more with gates. [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Delay-sum. Update anisotropy is extreme for scalar/multi gates; Adam remains much flatter. [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Moving-average. Multi-gate shows the strongest update concentration; scalar is a close second. [PITH_FULL_IMAGE:figures/full_fig_p022_15.png] view at source ↗

**Figure 16.** Figure 16: NARMA10. The gap between gated and Adam models is largest in update anisotropy. [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

We show that gating mechanisms in recurrent neural networks (RNNs) induce lag-dependent and direction-dependent effective learning rates, even when training uses a fixed, global step size. This behavior arises from a coupling between state-space time-scales (parametrized by the gates) and parameter-space dynamics during gradient descent. By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates. These findings reveal that gates act not only as filters of information flow, but also as data-driven preconditioners of optimization, with formal connections to learning-rate schedules, momentum, and adaptive methods such as Adam. Empirical simulations corroborate these predictions: across several sequence tasks, gates produce lag-dependent effective learning rates and concentrate gradient flow into low-dimensional subspaces, matching or exceeding the anisotropic structure induced by Adam. Notably, gating and optimizer-driven adaptivity shape complementary aspects of credit assignment: gates align state-space transport with loss-relevant directions, while optimizers rescale parameter-space updates. Overall, this work provides a unified dynamical systems perspective on how gating couples state evolution with parameter updates, clarifying why gated architectures achieve robust trainability in practice.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper frames RNN gates as coupling state time-scales to parameter updates to produce lag- and direction-dependent effective learning rates, with a first-order Jacobian expansion and some supporting simulations.

read the letter

The main point is that this work derives how gating mechanisms in RNNs create effective learning rates that vary with lag and direction through a coupling of state-space time-scales and gradient descent dynamics, even with a fixed step size. It positions gates as data-driven preconditioners that introduce anisotropy in updates and connects this to adaptive methods like Adam. The simulations on sequence tasks show gates concentrating gradients into low-dimensional subspaces and producing lag-dependent rates that complement optimizer effects. This gives a dynamical-systems account of why gated RNNs often train more stably than vanilla ones. The explicit Jacobian derivations for leaky-integrator and gated cases, plus the expansion showing how constant, scalar, and multi-dimensional gates reshape propagation, are the concrete pieces that stand out. The empirical checks add some corroboration without obvious circularity. The stress-test concern about higher-order terms in multi-step Jacobian products is worth checking in the full text, since RNN gradients involve long chains and a single-step linear approximation could accumulate error when spectral radii are near one or gates vary. If the paper only validates on short sequences or lacks bounds on the neglected terms, that would weaken the anisotropy claim for realistic unrollings. Citation patterns would also clarify how much this overlaps with existing RNN gradient analyses. This is for people who study architecture-optimization interactions in recurrent models. A reader wanting a formal angle on gated trainability would get something useful from the derivations and the complementary view of gates versus optimizers. It has enough structure and simulation backing to go to a serious referee, though the expansion's accuracy over many steps should be the main focus of review.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that gating mechanisms in RNNs induce lag-dependent and direction-dependent effective learning rates during gradient descent with a fixed global step size. This arises from coupling between state-space time-scales (parametrized by gates) and parameter-space dynamics. The authors derive exact Jacobians for leaky-integrator and gated RNNs, apply a first-order expansion to show how gates reshape gradient propagation and introduce anisotropy, draw formal connections to learning-rate schedules/momentum/Adam, and corroborate via empirical simulations on sequence tasks where gates concentrate gradient flow into low-dimensional subspaces.

Significance. If the central derivations hold without uncontrolled approximation error, the work supplies a dynamical-systems account of why gated RNNs train robustly, framing gates as data-driven preconditioners that align state transport with loss-relevant directions while optimizers handle parameter rescaling. The explicit links to adaptive methods and the empirical match to Adam's anisotropy are potentially useful for understanding credit assignment in recurrent models.

major comments (1)

[Jacobian derivation and first-order expansion] The first-order expansion of the Jacobians (described in the abstract and the derivation sections) is load-bearing for the lag- and direction-dependent effective learning-rate claims. In recurrent unrolling the total gradient is a product of Jacobians over many steps; when the spectral radius is near unity or gate values vary, higher-order terms in the expansion can accumulate and may not be negligible relative to the retained linear term. The manuscript should either bound the remainder or demonstrate that the anisotropy and preconditioning conclusions survive a multi-step analysis.

minor comments (1)

[Empirical simulations] The empirical section should report the precise sequence tasks, the method used to extract effective learning rates from simulations, and any controls for post-hoc gate-value selection.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive review and for recognizing the potential value of the dynamical-systems framing of gating as a preconditioner. We address the single major comment below and will incorporate revisions to strengthen the multi-step analysis.

read point-by-point responses

Referee: The first-order expansion of the Jacobians (described in the abstract and the derivation sections) is load-bearing for the lag- and direction-dependent effective learning-rate claims. In recurrent unrolling the total gradient is a product of Jacobians over many steps; when the spectral radius is near unity or gate values vary, higher-order terms in the expansion can accumulate and may not be negligible relative to the retained linear term. The manuscript should either bound the remainder or demonstrate that the anisotropy and preconditioning conclusions survive a multi-step analysis.

Authors: We agree that the accumulation of higher-order terms in the product of Jacobians over many steps is a valid concern when the spectral radius approaches unity. The first-order expansion is applied locally to each Jacobian to derive an interpretable expression for the leading effect of gates on per-step gradient scaling and anisotropy; the exact (non-expanded) Jacobians are used for all empirical gradient computations. To address the referee's point directly, we will add a new subsection that (1) provides an explicit bound on the remainder for the multi-step product under the assumption of slowly varying gates (a regime observed in trained models) and (2) reports numerical comparisons of the first-order prediction versus the full unrolled gradient for sequence lengths up to several hundred steps. These additions will show that the reported lag- and direction-dependent effects remain dominant. The core claims are therefore unchanged, but the manuscript will be revised to include this supporting analysis. revision: yes

Circularity Check

0 steps flagged

Derivation of Jacobians and first-order expansion from RNN equations is self-contained with no circular reduction

full rationale

The paper derives exact Jacobians for leaky-integrator and gated RNNs directly from their defining differential or discrete equations and applies a first-order expansion to analyze lag-dependent and direction-dependent effective learning rates. This is a forward mathematical analysis of gradient propagation under the model dynamics, without any parameter fitting to subsets of data followed by prediction of related quantities, without self-definitional loops, and without load-bearing reliance on self-citations for uniqueness theorems or ansatzes. The abstract and reader's summary confirm the steps start from the RNN state equations themselves, rendering the claimed coupling between state time-scales and parameter updates a direct consequence rather than a reconstruction of inputs. No evidence of the enumerated circularity patterns appears in the provided derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract, the work rests on standard RNN dynamical assumptions and a first-order approximation; no explicit free parameters, new entities, or ad-hoc axioms are introduced.

axioms (1)

domain assumption Leaky-integrator and gated RNNs follow their standard state-update equations as commonly defined in the literature.
Derivations of exact Jacobians presuppose these model forms.

pith-pipeline@v0.9.0 · 5748 in / 1268 out tokens · 46498 ms · 2026-05-18T22:52:12.445957+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By deriving exact Jacobians for leaky-integrator and gated RNNs and applying a first-order expansion, we make explicit how constant, scalar, and multi-dimensional gates reshape gradient propagation, modulate effective step sizes, and introduce anisotropy in parameter updates.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the effective learning rate reads: μ*_{t,k} = μ α^{t-k} + (perturbative correction terms)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 7 internal anchors

[1]

On the difficulty of training recurrent neural networks,

R. Pascanu, T. Mikolov, and Y . Bengio, “On the difficulty of training recurrent neural networks,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, Atlanta, Georgia, USA, 2013, pp. 1310–1318

work page 2013
[2]

Recurrent neural networks: vanishing and exploding gradients are not the end of the story,

N. Zucchet and A. Orvieto, “Recurrent neural networks: vanishing and exploding gradients are not the end of the story,” Advances in Neural Information Processing Systems , vol. 37, pp. 139 402–139 443, 2024

work page 2024
[3]

Random orthogonal additive filters: A solution to the van- ishing/exploding gradient of deep neural networks,

A. Ceni, “Random orthogonal additive filters: A solution to the van- ishing/exploding gradient of deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems , vol. 36, no. 6, pp. 10 794– 10 807, 2025

work page 2025
[4]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[5]

Combining recurrent, convolutional, and continuous-time models with linear state space layers,

A. Gu, I. Johnson, K. Goel, K. K. Saab, T. Dao, A. Rudra, and C. Ré, “Combining recurrent, convolutional, and continuous-time models with linear state space layers,” in Thirty-Fifth Conference on Neural Infor- mation Processing Systems , 2021

work page 2021
[6]

The- oretical foundations of deep selective state-space models,

N. Muca Cirone, A. Orvieto, B. Walker, C. Salvi, and T. Lyons, “The- oretical foundations of deep selective state-space models,” Advances in Neural Information Processing Systems , vol. 37, pp. 127 226–127 272, 2024

work page 2024
[7]

Wide neural networks of any depth evolve as linear models under gradient descent,

J. Lee, L. Xiao, S. Schoenholz, Y . Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” Advances in neural information processing systems, vol. 32, 2019

work page 2019
[8]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[9]

Orthogonal recurrent neural networks with scaled Cayley transform,

K. Helfrich, D. Willmott, and Q. Ye, “Orthogonal recurrent neural networks with scaled Cayley transform,” in Proceedings of the 35th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1969–1978

work page 2018
[10]

Efficient or- thogonal parametrisation of recurrent neural networks using householder reflections,

Z. Mhammedi, A. Hellicar, A. Rahman, and J. Bailey, “Efficient or- thogonal parametrisation of recurrent neural networks using householder reflections,” in Proceedings of the 34th International Conference on Machine Learning, 2017, p. 2401–2409

work page 2017
[11]

Unitary evolution recurrent neural networks,

M. Arjovsky, A. Shah, and Y . Bengio, “Unitary evolution recurrent neural networks,” in International Conference on Machine Learning , New York, USA, June 2016, pp. 1120–1128

work page 2016
[12]

Full- capacity unitary recurrent neural networks,

S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas, “Full- capacity unitary recurrent neural networks,” in Advances in Neural In- formation Processing Systems, D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, Eds. Barcelona, Spain: Curran Associates, Inc., Dec. 2016, pp. 4880–4888

work page 2016
[13]

On orthogonality and learning recurrent networks with long term dependencies,

E. V orontsov, C. Trabelsi, S. Kadoury, and C. Pal, “On orthogonality and learning recurrent networks with long term dependencies,” in Proceedings of the 34th International Conference on Machine Learning , 2017, p. 3570–3578

work page 2017
[14]

Lipschitz recurrent neural networks,

N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney, “Lipschitz recurrent neural networks,” arXiv preprint arXiv:2006.12070, 2021

work page arXiv 2006
[15]

Non-normal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics,

G. Kerg, K. Goyette, M. P. Touzel, G. Gidel, E. V orontsov, Y . Bengio, and G. Lajoie, “Non-normal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics,” arXiv preprint arXiv:1905.12080 , 2019

work page arXiv 1905
[16]

RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients?

A. Kag, Z. Zhang, and V . Saligrama, “RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients?” in International Conference on Learning Representations , 2020

work page 2020
[17]

AntisymmetricRNN: A dynamical system view on recurrent neural networks,

B. Chang, M. Chen, E. Haber, and E. H. Chi, “AntisymmetricRNN: A dynamical system view on recurrent neural networks,” in International Conference on Learning Representations , 2019. [Online]. Available: https://openreview.net/forum?id=ryxepo0cFX

work page 2019
[18]

Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies,

T. K. Rusch and S. Mishra, “Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies,” ICLR, 2021

work page 2021
[19]

Long expressive memory for sequence modeling,

T. K. Rusch, S. Mishra, N. B. Erichson, and M. W. Mahoney, “Long expressive memory for sequence modeling,” arXiv preprint arXiv:2110.04744, 2021

work page arXiv 2021
[20]

A clockwork RNN,

J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber, “A clockwork RNN,” in International Conference on Machine Learning, vol. 32, no. 2, 2014, pp. 1863–1871

work page 2014
[21]

Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks,

M. Chen, J. Pennington, and S. Schoenholz, “Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR...

work page 2018
[22]

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

D. Gilboa, B. Chang, M. Chen, G. Yang, S. S. Schoenholz, E. H. Chi, and J. Pennington, “Dynamical isometry and a mean field theory of lstms and grus,” arXiv preprint arXiv:1901.08987 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901
[23]

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice,

J. Pennington, S. Schoenholz, and S. Ganguli, “Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice,” in Advances in Neural Information Processing Systems , 2017, pp. 4785– 4795

work page 2017
[24]

Gating revisited: Deep multi-layer rnns that can be trained,

M. O. Turkoglu, S. D’Aronco, J. D. Wegner, and K. Schindler, “Gating revisited: Deep multi-layer rnns that can be trained,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 44, no. 8, pp. 4081– 4092, 2022

work page 2022
[25]

The unreasonable effectiveness of the forget gate

J. Van Der Westhuizen and J. Lasenby, “The unreasonable effectiveness of the forget gate,” arXiv preprint arXiv:1804.04849 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Theory of gating in recurrent neural networks,

A. Krishnamurthy, C. Gehring, D. K. Misra, and C. Zhang, “Theory of gating in recurrent neural networks,” Journal of Machine Learning Research, vol. 23, no. 157, pp. 1–39, 2022

work page 2022
[27]

Gates create slow modes in recurrent neural networks,

O. Can, K. Kapanova, and A. Søgaard, “Gates create slow modes in recurrent neural networks,” in International Conference on Learning Representations (ICLR), 2020

work page 2020
[28]

Adaptive time scales in recurrent neural networks,

R. Quax, D. Kandhai, and P. M. A. Sloot, “Adaptive time scales in recurrent neural networks,” Scientific Reports, vol. 10, no. 1, p. 7442, 2020

work page 2020
[29]

Can recurrent neural networks warp time?

C. Tallec and Y . Ollivier, “Can recurrent neural networks warp time?” in International Conference on Learning Representations , 2018. [Online]. Available: https://openreview.net/forum?id=SJcKhk-Ab

work page 2018
[30]

Optimization and applications of echo state networks with leaky-integrator neurons,

H. Jaeger, M. Lukoševi ˇcius, D. Popovici, and U. Siewert, “Optimization and applications of echo state networks with leaky-integrator neurons,” Neural Networks, vol. 20, no. 3, pp. 335–352, 2007

work page 2007
[31]

Backpropagation through time: what it does and how to do it,

P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE , vol. 78, no. 10, pp. 1550–1560, 1990

work page 1990
[32]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[33]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997
[34]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[35]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Ł. Kaiser, and I. Polo- sukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017

work page 2017
[36]

N. J. Higham, Functions of Matrices: Theory and Computation. SIAM, 2008

work page 2008
[37]

S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History, Theory, and Applications . Boston, MA: Birkhäuser, 2003

work page 2003
[38]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014. 11 SUPPLEMENTARY MATERIAL A. Matrix product expansion via the Fréchet derivative formulation We derive here the first-order expansion of a product of matrices with structured perturbations, starting from the product rule for the Fréchet derivative (Th...

work page internal anchor Pith review Pith/arXiv arXiv 2014
[39]

Since the space is finite-dimensional, all norms are equivalent

Fréchet differentiability and the first-order expansion: Let Cn×n denote the finite-dimensional vector space of complex n × n matrices equipped with a matrix norm (specifically, the operator 2-norm, unless otherwise stated). Since the space is finite-dimensional, all norms are equivalent. Definition VIII.1 (Fréchet differentiability [37], [36]) . Let f : ...

work page
[40]

The direction of perturbation E in (57) is now the tuple E ≡ (B1, B2,

Matrix products with structured perturbations: We now consider a product of n factors, each with a perturbation proportional to a scalar parameter ε: F (ε) = nY j=1 Aj + εBj , (59) where: • Aj ∈ Cd×d is the unperturbed factor at position j, • Bj ∈ Cd×d is the perturbation at position j, • ε ∈ R controls the magnitude of all perturbations. The direction of...

work page
[41]

We now apply the product rule (58) to Fn by setting g(ε) = Fn−1(ε), h (ε) = An + εBn

Recursive application of the product rule: For any k ≤ n, ϵ > 0, define the product Fk(ε) := kY j=1 Aj + εBj , so that Fn(ε) = Fn−1(ε) An + εBn . We now apply the product rule (58) to Fn by setting g(ε) = Fn−1(ε), h (ε) = An + εBn. At ε = 0 we have: g(0) = Fn−1(0) = A1A2 . . . An−1, h (0) = An, L h(0, E) = Bn. The product rule gives: LFn(0, E) = Lg(0, E) ...

work page
[42]

First-order expansion: Applying the first-order Taylor expansion (57) to (59) gives F (ε) = F (0) + ε LF (0, E) + O(ε2), (65) where F (0) =Qn j=1 Aj and LF (0, E) is given by (61). Substituting, we obtain the explicit first-order expansion: F (ε) =   nY j=1 Aj   | {z } dominant dynamics + ε nX m=1   nY j=m+1 Aj   Bm   m−1Y j=1 Aj   | {z } pert...

work page
[43]

In the main text, Bj represents gate-induced corrections, which are typically low-norm compared to the dominant dynamics in Aj

Simulations supporting the validity of the first-order approximation: The first-order expansion (66) is accurate when all ∥Bj∥ are small compared to ∥Aj∥ (in operator norm), so that the accumulated O(ε2) terms remain negligible. In the main text, Bj represents gate-induced corrections, which are typically low-norm compared to the dominant dynamics in Aj. ...

work page

[1] [1]

On the difficulty of training recurrent neural networks,

R. Pascanu, T. Mikolov, and Y . Bengio, “On the difficulty of training recurrent neural networks,” in Proceedings of the 30th International Conference on Machine Learning, vol. 28, Atlanta, Georgia, USA, 2013, pp. 1310–1318

work page 2013

[2] [2]

Recurrent neural networks: vanishing and exploding gradients are not the end of the story,

N. Zucchet and A. Orvieto, “Recurrent neural networks: vanishing and exploding gradients are not the end of the story,” Advances in Neural Information Processing Systems , vol. 37, pp. 139 402–139 443, 2024

work page 2024

[3] [3]

Random orthogonal additive filters: A solution to the van- ishing/exploding gradient of deep neural networks,

A. Ceni, “Random orthogonal additive filters: A solution to the van- ishing/exploding gradient of deep neural networks,” IEEE Transactions on Neural Networks and Learning Systems , vol. 36, no. 6, pp. 10 794– 10 807, 2025

work page 2025

[4] [4]

Efficiently Modeling Long Sequences with Structured State Spaces

A. Gu, K. Goel, and C. Ré, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396 , 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[5] [5]

Combining recurrent, convolutional, and continuous-time models with linear state space layers,

A. Gu, I. Johnson, K. Goel, K. K. Saab, T. Dao, A. Rudra, and C. Ré, “Combining recurrent, convolutional, and continuous-time models with linear state space layers,” in Thirty-Fifth Conference on Neural Infor- mation Processing Systems , 2021

work page 2021

[6] [6]

The- oretical foundations of deep selective state-space models,

N. Muca Cirone, A. Orvieto, B. Walker, C. Salvi, and T. Lyons, “The- oretical foundations of deep selective state-space models,” Advances in Neural Information Processing Systems , vol. 37, pp. 127 226–127 272, 2024

work page 2024

[7] [7]

Wide neural networks of any depth evolve as linear models under gradient descent,

J. Lee, L. Xiao, S. Schoenholz, Y . Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” Advances in neural information processing systems, vol. 32, 2019

work page 2019

[8] [8]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to the nonlinear dynamics of learning in deep linear neural networks,” arXiv preprint arXiv:1312.6120, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[9] [9]

Orthogonal recurrent neural networks with scaled Cayley transform,

K. Helfrich, D. Willmott, and Q. Ye, “Orthogonal recurrent neural networks with scaled Cayley transform,” in Proceedings of the 35th International Conference on Machine Learning , ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. PMLR, 10–15 Jul 2018, pp. 1969–1978

work page 2018

[10] [10]

Efficient or- thogonal parametrisation of recurrent neural networks using householder reflections,

Z. Mhammedi, A. Hellicar, A. Rahman, and J. Bailey, “Efficient or- thogonal parametrisation of recurrent neural networks using householder reflections,” in Proceedings of the 34th International Conference on Machine Learning, 2017, p. 2401–2409

work page 2017

[11] [11]

Unitary evolution recurrent neural networks,

M. Arjovsky, A. Shah, and Y . Bengio, “Unitary evolution recurrent neural networks,” in International Conference on Machine Learning , New York, USA, June 2016, pp. 1120–1128

work page 2016

[12] [12]

Full- capacity unitary recurrent neural networks,

S. Wisdom, T. Powers, J. Hershey, J. Le Roux, and L. Atlas, “Full- capacity unitary recurrent neural networks,” in Advances in Neural In- formation Processing Systems, D. D. Lee, M. Sugiyama, U. V . Luxburg, I. Guyon, and R. Garnett, Eds. Barcelona, Spain: Curran Associates, Inc., Dec. 2016, pp. 4880–4888

work page 2016

[13] [13]

On orthogonality and learning recurrent networks with long term dependencies,

E. V orontsov, C. Trabelsi, S. Kadoury, and C. Pal, “On orthogonality and learning recurrent networks with long term dependencies,” in Proceedings of the 34th International Conference on Machine Learning , 2017, p. 3570–3578

work page 2017

[14] [14]

Lipschitz recurrent neural networks,

N. B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. W. Mahoney, “Lipschitz recurrent neural networks,” arXiv preprint arXiv:2006.12070, 2021

work page arXiv 2006

[15] [15]

Non-normal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics,

G. Kerg, K. Goyette, M. P. Touzel, G. Gidel, E. V orontsov, Y . Bengio, and G. Lajoie, “Non-normal recurrent neural network (nnrnn): learning long time dependencies while improving expressivity with transient dynamics,” arXiv preprint arXiv:1905.12080 , 2019

work page arXiv 1905

[16] [16]

RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients?

A. Kag, Z. Zhang, and V . Saligrama, “RNNs incrementally evolving on an equilibrium manifold: A panacea for vanishing and exploding gradients?” in International Conference on Learning Representations , 2020

work page 2020

[17] [17]

AntisymmetricRNN: A dynamical system view on recurrent neural networks,

B. Chang, M. Chen, E. Haber, and E. H. Chi, “AntisymmetricRNN: A dynamical system view on recurrent neural networks,” in International Conference on Learning Representations , 2019. [Online]. Available: https://openreview.net/forum?id=ryxepo0cFX

work page 2019

[18] [18]

Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies,

T. K. Rusch and S. Mishra, “Coupled oscillatory recurrent neural network (cornn): An accurate and (gradient) stable architecture for learning long time dependencies,” ICLR, 2021

work page 2021

[19] [19]

Long expressive memory for sequence modeling,

T. K. Rusch, S. Mishra, N. B. Erichson, and M. W. Mahoney, “Long expressive memory for sequence modeling,” arXiv preprint arXiv:2110.04744, 2021

work page arXiv 2021

[20] [20]

A clockwork RNN,

J. Koutnik, K. Greff, F. Gomez, and J. Schmidhuber, “A clockwork RNN,” in International Conference on Machine Learning, vol. 32, no. 2, 2014, pp. 1863–1871

work page 2014

[21] [21]

Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks,

M. Chen, J. Pennington, and S. Schoenholz, “Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, J. Dy and A. Krause, Eds., vol. 80. Stockholmsmässan, Stockholm Sweden: PMLR...

work page 2018

[22] [22]

Dynamical Isometry and a Mean Field Theory of LSTMs and GRUs

D. Gilboa, B. Chang, M. Chen, G. Yang, S. S. Schoenholz, E. H. Chi, and J. Pennington, “Dynamical isometry and a mean field theory of lstms and grus,” arXiv preprint arXiv:1901.08987 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[23] [23]

Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice,

J. Pennington, S. Schoenholz, and S. Ganguli, “Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice,” in Advances in Neural Information Processing Systems , 2017, pp. 4785– 4795

work page 2017

[24] [24]

Gating revisited: Deep multi-layer rnns that can be trained,

M. O. Turkoglu, S. D’Aronco, J. D. Wegner, and K. Schindler, “Gating revisited: Deep multi-layer rnns that can be trained,” IEEE Transactions on Pattern Analysis and Machine Intelligence , vol. 44, no. 8, pp. 4081– 4092, 2022

work page 2022

[25] [25]

The unreasonable effectiveness of the forget gate

J. Van Der Westhuizen and J. Lasenby, “The unreasonable effectiveness of the forget gate,” arXiv preprint arXiv:1804.04849 , 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Theory of gating in recurrent neural networks,

A. Krishnamurthy, C. Gehring, D. K. Misra, and C. Zhang, “Theory of gating in recurrent neural networks,” Journal of Machine Learning Research, vol. 23, no. 157, pp. 1–39, 2022

work page 2022

[27] [27]

Gates create slow modes in recurrent neural networks,

O. Can, K. Kapanova, and A. Søgaard, “Gates create slow modes in recurrent neural networks,” in International Conference on Learning Representations (ICLR), 2020

work page 2020

[28] [28]

Adaptive time scales in recurrent neural networks,

R. Quax, D. Kandhai, and P. M. A. Sloot, “Adaptive time scales in recurrent neural networks,” Scientific Reports, vol. 10, no. 1, p. 7442, 2020

work page 2020

[29] [29]

Can recurrent neural networks warp time?

C. Tallec and Y . Ollivier, “Can recurrent neural networks warp time?” in International Conference on Learning Representations , 2018. [Online]. Available: https://openreview.net/forum?id=SJcKhk-Ab

work page 2018

[30] [30]

Optimization and applications of echo state networks with leaky-integrator neurons,

H. Jaeger, M. Lukoševi ˇcius, D. Popovici, and U. Siewert, “Optimization and applications of echo state networks with leaky-integrator neurons,” Neural Networks, vol. 20, no. 3, pp. 335–352, 2007

work page 2007

[31] [31]

Backpropagation through time: what it does and how to do it,

P. J. Werbos, “Backpropagation through time: what it does and how to do it,” Proceedings of the IEEE , vol. 78, no. 10, pp. 1550–1560, 1990

work page 1990

[32] [32]

An overview of gradient descent optimization algorithms

S. Ruder, “An overview of gradient descent optimization algorithms,” arXiv preprint arXiv:1609.04747 , 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[33] [33]

Long short-term memory,

S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997

work page 1997

[34] [34]

Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y . Bengio, “Learning phrase representations using RNN encoder-decoder for statistical machine translation,” arXiv preprint arXiv:1406.1078, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[35] [35]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, Ł. Kaiser, and I. Polo- sukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, 2017

work page 2017

[36] [36]

N. J. Higham, Functions of Matrices: Theory and Computation. SIAM, 2008

work page 2008

[37] [37]

S. G. Krantz and H. R. Parks, The Implicit Function Theorem: History, Theory, and Applications . Boston, MA: Birkhäuser, 2003

work page 2003

[38] [38]

Adam: A Method for Stochastic Optimization

D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014. 11 SUPPLEMENTARY MATERIAL A. Matrix product expansion via the Fréchet derivative formulation We derive here the first-order expansion of a product of matrices with structured perturbations, starting from the product rule for the Fréchet derivative (Th...

work page internal anchor Pith review Pith/arXiv arXiv 2014

[39] [39]

Since the space is finite-dimensional, all norms are equivalent

Fréchet differentiability and the first-order expansion: Let Cn×n denote the finite-dimensional vector space of complex n × n matrices equipped with a matrix norm (specifically, the operator 2-norm, unless otherwise stated). Since the space is finite-dimensional, all norms are equivalent. Definition VIII.1 (Fréchet differentiability [37], [36]) . Let f : ...

work page

[40] [40]

The direction of perturbation E in (57) is now the tuple E ≡ (B1, B2,

Matrix products with structured perturbations: We now consider a product of n factors, each with a perturbation proportional to a scalar parameter ε: F (ε) = nY j=1 Aj + εBj , (59) where: • Aj ∈ Cd×d is the unperturbed factor at position j, • Bj ∈ Cd×d is the perturbation at position j, • ε ∈ R controls the magnitude of all perturbations. The direction of...

work page

[41] [41]

We now apply the product rule (58) to Fn by setting g(ε) = Fn−1(ε), h (ε) = An + εBn

Recursive application of the product rule: For any k ≤ n, ϵ > 0, define the product Fk(ε) := kY j=1 Aj + εBj , so that Fn(ε) = Fn−1(ε) An + εBn . We now apply the product rule (58) to Fn by setting g(ε) = Fn−1(ε), h (ε) = An + εBn. At ε = 0 we have: g(0) = Fn−1(0) = A1A2 . . . An−1, h (0) = An, L h(0, E) = Bn. The product rule gives: LFn(0, E) = Lg(0, E) ...

work page

[42] [42]

First-order expansion: Applying the first-order Taylor expansion (57) to (59) gives F (ε) = F (0) + ε LF (0, E) + O(ε2), (65) where F (0) =Qn j=1 Aj and LF (0, E) is given by (61). Substituting, we obtain the explicit first-order expansion: F (ε) =   nY j=1 Aj   | {z } dominant dynamics + ε nX m=1   nY j=m+1 Aj   Bm   m−1Y j=1 Aj   | {z } pert...

work page

[43] [43]

In the main text, Bj represents gate-induced corrections, which are typically low-norm compared to the dominant dynamics in Aj

Simulations supporting the validity of the first-order approximation: The first-order expansion (66) is accurate when all ∥Bj∥ are small compared to ∥Aj∥ (in operator norm), so that the accumulated O(ε2) terms remain negligible. In the main text, Bj represents gate-induced corrections, which are typically low-norm compared to the dominant dynamics in Aj. ...

work page