pith. machine review for the scientific record. sign in

arxiv: 2605.04115 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.AI· q-bio.NC

Recognition: 3 theorem links

· Lean Theorem

Learning reveals invisible structure in low-rank RNNs

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.NC
keywords low-rank RNNlearning dynamicsoverlap spacegradient descentvisible and invisible overlapsrecurrent neural networksODE reduction
0
0 comments X

The pith

Learning in low-rank RNNs reduces to low-dimensional ODEs in overlap space, separating functional connections from those that only shape training history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives the gradient-descent dynamics of low-rank recurrent neural networks directly in a reduced space of neuron-pair overlaps rather than in the full weight matrix. This produces a closed-form system of ordinary differential equations that is exact when the network is linear and asymptotically exact for large nonlinear networks under a Gaussian approximation. The central move is to split the overlaps into two classes: those that determine the network's instantaneous activity, output, and loss, and those that leave current function unchanged yet still evolve under learning. The authors use this split to show that training can distinguish between networks that perform identically and that certain overlaps can retain a record of past training steps.

Core claim

We extend the low-rank framework from activity to learning by deriving gradient-descent dynamics directly in a reduced overlap space. We formulate a closed-form, low-dimensional system of ODEs that governs learning in this space, exact for linear RNNs and asymptotically exact for nonlinear RNNs in the large-N Gaussian limit. Central to our analysis is a distinction between two classes of overlaps: loss-visible overlaps, which fully determine network activity, output, and loss, and loss-invisible overlaps, which do not affect function but are required to describe learning. We illustrate the consequences of this decomposition through two phenomena. First, we show that learning can serve as a a

What carries the argument

the decomposition of pairwise overlaps into loss-visible and loss-invisible classes inside the low-rank connectivity space, which reduces the full weight-update dynamics to a low-dimensional closed-form ODE system

If this is right

  • Learning acts as a perturbation that can expose differences in connectivity between networks that produce identical activity and loss.
  • Loss-invisible overlaps can function as memory variables that encode aspects of training history when certain conditions on the loss and update rules hold.
  • The reduced dynamics remain low-dimensional regardless of network size, providing a scalable description of learning.
  • The theory yields concrete, testable predictions for how biological circuits might change during learning without immediately altering behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the separation holds, one could in principle train networks to store task history in invisible overlaps while keeping current performance unchanged.
  • The same visible-invisible split might offer a way to model how biological synapses can carry long-term learning traces that are not expressed in immediate circuit output.
  • Checking the approximation in moderately sized networks could identify the practical limits of the large-N reduction.

Load-bearing premise

The network must be large and its connectivity low-rank so that the Gaussian limit applies and the high-dimensional learning problem collapses to the overlap variables.

What would settle it

Run gradient descent on a finite but large low-rank RNN and check whether the time evolution of the measured overlap values follows the trajectories predicted by the derived ODE system.

Figures

Figures reproduced from arXiv: 2605.04115 by Omri Barak, Yoav Ger.

Figure 1
Figure 1. Figure 1: (a) High-dimensional RNN in parameter θ-space (top): input x drives activity h through input weights m, recurrent connectivity W, and readout weights z to produce output y. For a low-rank RNN, the same input–output function is captured by an effective model described by a small set of scalar overlaps σ (bottom). (b) Schematic illustration of a learning trajectory in the loss landscape over the parameter θ … view at source ↗
Figure 2
Figure 2. Figure 2: (a) Two RNNs parametrized with θ1 and θ2 share identical loss-visible σ (=) but differ in loss-invisible σ˜ (̸=) overlaps. (b) Initial values of all overlaps (blue: θ1, red: θ2) show identical visible components (top) but differences in the invisible components (bottom), including σmu and ||u||2 . (c–d) Because the input–output function depends only on the visible set, their hidden activity (c) and outputs… view at source ↗
Figure 3
Figure 3. Figure 3: (a) Illustration of the solution manifold in overlap space for tasks A and B. For each task, the loss fixes a subset (or all) of the loss-visible overlaps σ (x-axis), leaving a continuous manifold (black lines) of equivalent solutions parameterized by all the loss-invisible overlaps σ˜ (y-axis). Under an A–B–A training protocol, retraining on task A can either (1) recover the original solution (blue) or (2… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Example input sequence of the flip-flop task (top). Target output (black), high￾dimensional RNN prediction (blue), and effective RNN prediction (red dashed) show excellent agreement (bottom). (b) Training loss of RNN in parameter space θ (blue) and 10D overlap dynamics using G(θ) (red dashed), which closely match, while direct optimization in σ space (black dashed), leads to different dynamics. (c) Dis… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Filter task with white-noise input x (gray) and target output y ⋆ (blue). (b) Impulse responses of the target filter (solid blue), with gain a ⋆ = 1 and decay rate c ⋆ = 0.2, and final learned RNN function (dashed black). (c) Training loss of the RNN: numerical simulation of the high-dimensional network (solid blue) and the corresponding ODE theory (dashed black), showing excellent agreement. (d,e) Dyn… view at source ↗
Figure 6
Figure 6. Figure 6: Trajectories of all ten overlaps in the A-B-A protocol, supplementing Fig. 3 of the main text. view at source ↗
Figure 7
Figure 7. Figure 7: Trajectories of all ten overlaps in the A-B-A protocol using the Adam optimizer ( view at source ↗
Figure 8
Figure 8. Figure 8: Training on flip-flop task with different optimizers. view at source ↗
Figure 9
Figure 9. Figure 9: (a) Training loss for the A/B → C protocol, for an example run of network 1 (A→C; blue) and network 2 (B→C; red). (b) Overlaps at the end of phase 1 (A/B; epoch 30,000), showing that both loss-visible (blue) and loss-invisible (red) overlaps settle to distinct values. (c) After training on task C (epoch 60,000), loss-visible overlaps converge to the same values, while loss-invisible overlaps remain distinc… view at source ↗
Figure 10
Figure 10. Figure 10: Trajectories of all ten overlaps in the A/B view at source ↗
Figure 11
Figure 11. Figure 11: Augmented 10×10 Gram matrix G¯ for the linear (left) and nonlinear (right) rank-1 RNNs. Rows correspond to overlaps being updated, and columns to the associated loss gradients. Blue letter/dots denote loss-visible overlaps, while red letter/dots denote loss-invisible overlaps. Colored circles indicate the coefficient of G¯, for which the gradient is nonzero. In the linear case, the structure is cleanly se… view at source ↗
Figure 12
Figure 12. Figure 12: (a) Impulse response of the target damped-oscillatory filter (solid blue) and the final learned RNN response (dashed black). (b) Training loss for the full high-dimensional simulation (solid blue) and the overlap-based ODE theory (dashed black). (c) Dynamics of the 9 loss-visible and (d) 12 invisible overlaps, comparing numerical simulations (solid) with theoretical predictions (dashed). Note For a genera… view at source ↗
read the original abstract

Learning in neural systems arises from synaptic changes that reshape the representations underlying behavior. While low-rank recurrent neural networks (RNNs) have emerged as a powerful framework for linking connectivity to function, a theoretical understanding of their learning process remains elusive. Here, we extend the low-rank framework from activity to learning by deriving gradient-descent dynamics directly in a reduced overlap space. We formulate a closed-form, low-dimensional system of ODEs that governs learning in this space, exact for linear RNNs and asymptotically exact for nonlinear RNNs in the large-$N$ Gaussian limit. Central to our analysis is a distinction between two classes of overlaps: loss-visible overlaps, which fully determine network activity, output, and loss, and loss-invisible overlaps, which do not affect function but are required to describe learning. We illustrate the consequences of this decomposition through two phenomena. First, we show that learning can serve as a perturbation that exposes differences in connectivity between functionally equivalent networks. Second, we show that loss-invisible overlaps can act as memory variables that encode training history, and characterize the conditions under which this occurs. Finally, we present several testable predictions for biological learning experiments derived from our theory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript derives gradient-descent dynamics for low-rank RNNs directly in a reduced space of overlaps, yielding a closed-form low-dimensional system of ODEs. The reduction is stated to be exact for linear RNNs and asymptotically exact for nonlinear RNNs in the large-N Gaussian limit. A key distinction is drawn between loss-visible overlaps (which determine activity, output, and loss) and loss-invisible overlaps (which do not affect function but participate in learning). The authors illustrate two consequences: learning acting as a perturbation that reveals connectivity differences between functionally equivalent networks, and loss-invisible overlaps serving as memory variables that encode training history under certain conditions. Testable predictions for biological experiments are also presented.

Significance. If the claimed reduction is rigorously established, the work supplies a tractable theoretical handle on how low-rank connectivity evolves under learning, bridging connectivity-based models with gradient-based training. The visible/invisible overlap decomposition offers a concrete mechanism for why functional equivalence can mask structural differences and how training history can be stored without altering immediate behavior. The exact linear case and the proposed biological predictions are concrete strengths that could guide both modeling and experiment.

major comments (1)
  1. [Derivation of the overlap ODEs (main theoretical section)] The central claim of asymptotic exactness for nonlinear RNNs rests on closure of the overlap dynamics in the large-N Gaussian limit. The manuscript must supply the explicit steps (or numerical checks) demonstrating that higher-order moments of the pre-activations factorize appropriately under the gradient flow and that the low-rank structure is preserved (or that rank inflation remains negligible). Without this verification, the reduction to a closed-form ODE system is not guaranteed for general activations or initializations.
minor comments (2)
  1. [Abstract] The abstract would benefit from a brief statement of the dimension of the reduced ODE system and the precise form of the visible/invisible overlap variables.
  2. [Notation and setup] Notation for the overlap matrices and the loss function should be introduced with a single consolidated table or equation block early in the text to aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address the major comment on the derivation of the overlap ODEs below.

read point-by-point responses
  1. Referee: [Derivation of the overlap ODEs (main theoretical section)] The central claim of asymptotic exactness for nonlinear RNNs rests on closure of the overlap dynamics in the large-N Gaussian limit. The manuscript must supply the explicit steps (or numerical checks) demonstrating that higher-order moments of the pre-activations factorize appropriately under the gradient flow and that the low-rank structure is preserved (or that rank inflation remains negligible). Without this verification, the reduction to a closed-form ODE system is not guaranteed for general activations or initializations.

    Authors: We agree that a more explicit verification of the moment closure and rank preservation would strengthen the manuscript. In the revised version we have added Appendix D, which supplies the requested derivation. Under the large-N Gaussian assumption, the central limit theorem applied to the sum over neurons ensures that pre-activations remain Gaussian; higher-order moments therefore factorize into products of pairwise overlaps. The gradient updates are outer products of the visible and invisible overlaps with the input and output vectors, which by construction preserve the original low-rank form up to o(1) corrections that vanish as N grows. We also include numerical checks for tanh activations across a range of initializations, confirming that the reduced ODEs match full-network trajectories for large N. These additions make the asymptotic exactness claim rigorous for the stated regime. revision: yes

Circularity Check

0 steps flagged

Derivation of overlap-space ODEs is self-contained with no circular reductions

full rationale

The paper starts from standard gradient descent on a loss function and reduces the dynamics to a low-dimensional system of ODEs in overlap space. This reduction is exact for linear RNNs by direct algebraic closure and asymptotically exact for nonlinear RNNs under the stated large-N Gaussian limit, which is an external modeling assumption rather than a quantity fitted or defined in terms of the target result. No equations are shown to be equivalent to their inputs by construction, no parameters are fitted to a subset and then relabeled as predictions, and no load-bearing steps rely on self-citations or imported uniqueness theorems. The visible/invisible overlap distinction follows directly from the low-rank connectivity premise without self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central reduction rests on the low-rank connectivity assumption and the large-N Gaussian limit for nonlinear cases; no free parameters or new entities are introduced in the abstract.

axioms (2)
  • domain assumption RNN connectivity is low-rank
    Allows reduction of the weight matrix to a small set of overlap variables.
  • domain assumption Large-N Gaussian limit for nonlinear RNNs
    Required for the learning ODEs to be asymptotically exact.

pith-pipeline@v0.9.0 · 5505 in / 1216 out tokens · 47541 ms · 2026-05-08T18:34:17.408572+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Embracing multiple definitions of learning.Trends in neurosciences, 38(7):405–407, 2015

    Andrew B Barron, Eileen A Hebets, Thomas A Cleland, Courtney L Fitzpatrick, Mark E Hauber, and Jeffrey R Stevens. Embracing multiple definitions of learning.Trends in neurosciences, 38(7):405–407, 2015

  2. [2]

    How learning unfolds in the brain: toward an optimization view.Neuron, 109(23):3720–3735, 2021

    Jay A Hennig, Emily R Oby, Darby M Losey, Aaron P Batista, Byron M Yu, and Steven M Chase. How learning unfolds in the brain: toward an optimization view.Neuron, 109(23):3720–3735, 2021

  3. [3]

    If deep learning is the answer, what is the question?Nature Reviews Neuroscience, 22(1):55–67, 2021

    Andrew Saxe, Stephanie Nelli, and Christopher Summerfield. If deep learning is the answer, what is the question?Nature Reviews Neuroscience, 22(1):55–67, 2021

  4. [4]

    The neurobiology of learning and memory.Science, 233(4767):941–947, 1986

    Richard F Thompson. The neurobiology of learning and memory.Science, 233(4767):941–947, 1986

  5. [5]

    Synaptic plasticity forms and functions.Annual review of neuroscience, 43(1):95–117, 2020

    Jeffrey C Magee and Christine Grienberger. Synaptic plasticity forms and functions.Annual review of neuroscience, 43(1):95–117, 2020

  6. [6]

    Neural constraints on learning

    Patrick T Sadtler, Kristin M Quick, Matthew D Golub, Steven M Chase, Stephen I Ryu, Elizabeth C Tyler-Kabara, Byron M Yu, and Aaron P Batista. Neural constraints on learning. Nature, 512(7515):423–426, 2014

  7. [7]

    The next generation of approaches to investigate the link between synaptic plasticity and learning.Nature neuroscience, 22(10):1536–1543, 2019

    Yann Humeau and Daniel Choquet. The next generation of approaches to investigate the link between synaptic plasticity and learning.Nature neuroscience, 22(10):1536–1543, 2019

  8. [8]

    On simplicity and complexity in the brave new world of large-scale neuroscience.Current opinion in neurobiology, 32:148–155, 2015

    Peiran Gao and Surya Ganguli. On simplicity and complexity in the brave new world of large-scale neuroscience.Current opinion in neurobiology, 32:148–155, 2015

  9. [9]

    Systematic errors in connectivity inferred from activity in strongly recurrent networks.Nature Neuroscience, 23(10):1286–1296, 2020

    Abhranil Das and Ila R Fiete. Systematic errors in connectivity inferred from activity in strongly recurrent networks.Nature Neuroscience, 23(10):1286–1296, 2020

  10. [10]

    Degeneracy and complexity in biological systems

    Gerald M Edelman and Joseph A Gally. Degeneracy and complexity in biological systems. Proceedings of the national academy of sciences, 98(24):13763–13768, 2001

  11. [11]

    Similar network activity from disparate circuit parameters.Nature neuroscience, 7(12):1345–1352, 2004

    Astrid A Prinz, Dirk Bucher, and Eve Marder. Similar network activity from disparate circuit parameters.Nature neuroscience, 7(12):1345–1352, 2004

  12. [12]

    The brain’s best kept secret is its degenerate structure.Journal of Neuroscience, 44(40), 2024

    Larissa Albantakis, Christophe Bernard, Naama Brenner, Eve Marder, and Rishikesh Narayanan. The brain’s best kept secret is its degenerate structure.Journal of Neuroscience, 44(40), 2024

  13. [13]

    For neural networks, function determines form

    Francesca Albertini and Eduardo D Sontag. For neural networks, function determines form. Neural networks, 6(7):975–990, 1993

  14. [14]

    On linear identifiability of learned rep- resentations

    Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned rep- resentations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021

  15. [15]

    Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks

    Lukas Braun, Erin Grant, and Andrew M Saxe. Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks. In F orty-second International Conference on Machine Learning, 2025. 10

  16. [16]

    Linking connectivity, dynamics, and computa- tions in low-rank recurrent neural networks.Neuron, 99(3):609–623, 2018

    Francesca Mastrogiuseppe and Srdjan Ostojic. Linking connectivity, dynamics, and computa- tions in low-rank recurrent neural networks.Neuron, 99(3):609–623, 2018

  17. [17]

    Dynamics of random recurrent networks with correlated low-rank structure.Physical Review Research, 2(1):013111, 2020

    Friedrich Schuessler, Alexis Dubreuil, Francesca Mastrogiuseppe, Srdjan Ostojic, and Omri Barak. Dynamics of random recurrent networks with correlated low-rank structure.Physical Review Research, 2(1):013111, 2020

  18. [18]

    Shaping dynamics with multiple populations in low-rank recurrent networks.Neural computation, 33(6):1572–1615, 2021

    Manuel Beiran, Alexis Dubreuil, Adrian Valente, Francesca Mastrogiuseppe, and Srdjan Os- tojic. Shaping dynamics with multiple populations in low-rank recurrent networks.Neural computation, 33(6):1572–1615, 2021

  19. [19]

    Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

    John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

  20. [20]

    MIT press, 2003

    Chris Eliasmith and Charles H Anderson.Neural engineering: Computation, representation, and dynamics in neurobiological systems. MIT press, 2003

  21. [21]

    A theory of multi-task computation and task selection.bioRxiv, pages 2025–12, 2025

    Owen Marschall, David G Clark, and Ashok Litwin-Kumar. A theory of multi-task computation and task selection.bioRxiv, pages 2025–12, 2025

  22. [22]

    The interplay between randomness and structure during learning in rnns.Advances in neural information processing systems, 33:13352–13362, 2020

    Friedrich Schuessler, Francesca Mastrogiuseppe, Alexis Dubreuil, Srdjan Ostojic, and Omri Barak. The interplay between randomness and structure during learning in rnns.Advances in neural information processing systems, 33:13352–13362, 2020

  23. [23]

    The role of population structure in computations through neural dynamics.Nature neuroscience, 25(6):783–794, 2022

    Alexis Dubreuil, Adrian Valente, Manuel Beiran, Francesca Mastrogiuseppe, and Srdjan Ostojic. The role of population structure in computations through neural dynamics.Nature neuroscience, 25(6):783–794, 2022

  24. [24]

    Extracting computational mechanisms from neural data using low-rank rnns.Advances in Neural Information Processing Systems, 35:24072–24086, 2022

    Adrian Valente, Jonathan W Pillow, and Srdjan Ostojic. Extracting computational mechanisms from neural data using low-rank rnns.Advances in Neural Information Processing Systems, 35:24072–24086, 2022

  25. [25]

    Dynamically learning to integrate in recurrent neural networks.arXiv preprint arXiv:2503.18754, 2025

    Blake Bordelon, Jordan Cotler, Cengiz Pehlevan, and Jacob A Zavatone-Veth. Dynamically learning to integrate in recurrent neural networks.arXiv preprint arXiv:2503.18754, 2025

  26. [26]

    Learning dynamics in linear recurrent neural networks

    Alexandra Maria Proca, Clémentine Carla Juliette Dominé, Murray Shanahan, and Pedro AM Mediano. Learning dynamics in linear recurrent neural networks. InF orty-second International Conference on Machine Learning, 2025

  27. [27]

    Chaos in random neural networks.Physical review letters, 61(3):259, 1988

    Haim Sompolinsky, Andrea Crisanti, and Hans-Jurgen Sommers. Chaos in random neural networks.Physical review letters, 61(3):259, 1988

  28. [28]

    Recurrent neural networks as versatile tools of neuroscience research.Current opinion in neurobiology, 46:1–6, 2017

    Omri Barak. Recurrent neural networks as versatile tools of neuroscience research.Current opinion in neurobiology, 46:1–6, 2017

  29. [29]

    Measuring and controlling solution degeneracy across task-trained recurrent neural networks

    Ann Huang, Satpreet Harcharan Singh, Flavio Martinelli, and Kanaka Rajan. Measuring and controlling solution degeneracy across task-trained recurrent neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  30. [30]

    Memory by accident: a theory of learning as a byproduct of network stabilization

    Basile Confavreux, Will Dorrell, Nishil Patel, and Andrew M Saxe. Memory by accident: a theory of learning as a byproduct of network stabilization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  31. [31]

    Algorithmic regularization in learning deep homoge- neous models: Layers are automatically balanced.Advances in neural information processing systems, 31, 2018

    Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homoge- neous models: Layers are automatically balanced.Advances in neural information processing systems, 31, 2018

  32. [32]

    Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

    Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on learning theory, pages 483–513. PMLR, 2020

  33. [33]

    Representational drift as a result of implicit regularization.Elife, 12:RP90069, 2024

    Aviv Ratzon, Dori Derdikman, and Omri Barak. Representational drift as a result of implicit regularization.Elife, 12:RP90069, 2024. 11

  34. [34]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  35. [35]

    Opening the black box: low-dimensional dynamics in high- dimensional recurrent neural networks.Neural computation, 25(3):626–649, 2013

    David Sussillo and Omri Barak. Opening the black box: low-dimensional dynamics in high- dimensional recurrent neural networks.Neural computation, 25(3):626–649, 2013

  36. [36]

    Re- verse engineering recurrent networks for sentiment classification reveals line attractor dynamics

    Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Re- verse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32, 2019

  37. [37]

    Speech recognition with deep recurrent neural networks

    Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee, 2013

  38. [38]

    On the difficulty of training recurrent neural networks

    Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318. Pmlr, 2013

  39. [39]

    Resurrecting recurrent neural networks for long sequences

    Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670–26698. PMLR, 2023

  40. [40]

    Con- nectivity structure and dynamics of nonlinear recurrent neural networks.Physical Review X, 15(4):041019, 2025

    David G Clark, Owen Marschall, Alexander Van Meegen, and Ashok Litwin-Kumar. Con- nectivity structure and dynamics of nonlinear recurrent neural networks.Physical Review X, 15(4):041019, 2025

  41. [41]

    Trained recurrent neural networks de- velop phase-locked limit cycles in a working memory task.PLOS Computational Biology, 20(2):e1011852, 2024

    Matthijs Pals, Jakob H Macke, and Omri Barak. Trained recurrent neural networks de- velop phase-locked limit cycles in a working memory task.PLOS Computational Biology, 20(2):e1011852, 2024

  42. [42]

    Learning dynamics of RNNs in closed-loop environments

    Yoav Ger and Omri Barak. Learning dynamics of RNNs in closed-loop environments. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  43. [43]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120, 2013

  44. [44]

    A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019

    Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019

  45. [45]

    On the implicit bias of initialization shape: Beyond infinitesimal mirror descent

    Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. InInternational Conference on Machine Learning, pages 468–477. PMLR, 2021

  46. [46]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  47. [47]

    On lazy training in differentiable program- ming.Advances in neural information processing systems, 32, 2019

    Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable program- ming.Advances in neural information processing systems, 32, 2019

  48. [48]

    Natural gradient works efficiently in learning.Neural computation, 10(2):251– 276, 1998

    Shun-Ichi Amari. Natural gradient works efficiently in learning.Neural computation, 10(2):251– 276, 1998

  49. [49]

    Revisiting Natural Gradient for Deep Networks

    Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arXiv preprint arXiv:1301.3584, 2013

  50. [50]

    Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics.arXiv preprint arXiv:2012.04728, 2020

    Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics.arXiv preprint arXiv:2012.04728, 2020

  51. [51]

    Noether’s learning dynamics: Role of symmetry breaking in neural networks.Advances in Neural Information Processing Systems, 34:25646–25660, 2021

    Hidenori Tanaka and Daniel Kunin. Noether’s learning dynamics: Role of symmetry breaking in neural networks.Advances in Neural Information Processing Systems, 34:25646–25660, 2021. 12

  52. [52]

    arXiv preprint arXiv:2110.06914 , year=

    Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after sgd reaches zero loss?–a mathematical framework.arXiv preprint arXiv:2110.06914, 2021

  53. [53]

    Residual dynamics resolves recurrent contributions to neural computation.Nature Neuroscience, 26(2):326–338, 2023

    Aniruddh R Galgali, Maneesh Sahani, and Valerio Mante. Residual dynamics resolves recurrent contributions to neural computation.Nature Neuroscience, 26(2):326–338, 2023

  54. [54]

    Neural dynamics outside task-coding dimensions drive decision trajectories through transient amplifi- cation.bioRxiv, pages 2025–11, 2025

    Ulises Pereira-Obilinovic, Kayvon Daie, Susu Chen, Karel Svoboda, and Ran Darshan. Neural dynamics outside task-coding dimensions drive decision trajectories through transient amplifi- cation.bioRxiv, pages 2025–11, 2025

  55. [55]

    PyTorch: An Imperative Style, High-Performance Deep Learning Library

    A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703, 2019. 13 Appendix The appendix is organized as follows: • Section A– Full derivation of the linear rank-1 RNN: reduced activity dynamics, overlap learning dynamics, filter task and training details, gradient-flow invariants, and experiments usin...