arxiv: 2605.04115 · v1 · submitted 2026-05-05 · 💻 cs.LG · cs.AI· q-bio.NC

Recognition: 3 theorem links

· Lean Theorem

Learning reveals invisible structure in low-rank RNNs

Yoav Ger , Omri Barak

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:34 UTC · model grok-4.3

classification 💻 cs.LG cs.AIq-bio.NC

keywords low-rank RNNlearning dynamicsoverlap spacegradient descentvisible and invisible overlapsrecurrent neural networksODE reduction

0 comments

The pith

Learning in low-rank RNNs reduces to low-dimensional ODEs in overlap space, separating functional connections from those that only shape training history.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives the gradient-descent dynamics of low-rank recurrent neural networks directly in a reduced space of neuron-pair overlaps rather than in the full weight matrix. This produces a closed-form system of ordinary differential equations that is exact when the network is linear and asymptotically exact for large nonlinear networks under a Gaussian approximation. The central move is to split the overlaps into two classes: those that determine the network's instantaneous activity, output, and loss, and those that leave current function unchanged yet still evolve under learning. The authors use this split to show that training can distinguish between networks that perform identically and that certain overlaps can retain a record of past training steps.

Core claim

We extend the low-rank framework from activity to learning by deriving gradient-descent dynamics directly in a reduced overlap space. We formulate a closed-form, low-dimensional system of ODEs that governs learning in this space, exact for linear RNNs and asymptotically exact for nonlinear RNNs in the large-N Gaussian limit. Central to our analysis is a distinction between two classes of overlaps: loss-visible overlaps, which fully determine network activity, output, and loss, and loss-invisible overlaps, which do not affect function but are required to describe learning. We illustrate the consequences of this decomposition through two phenomena. First, we show that learning can serve as a a

What carries the argument

the decomposition of pairwise overlaps into loss-visible and loss-invisible classes inside the low-rank connectivity space, which reduces the full weight-update dynamics to a low-dimensional closed-form ODE system

If this is right

Learning acts as a perturbation that can expose differences in connectivity between networks that produce identical activity and loss.
Loss-invisible overlaps can function as memory variables that encode aspects of training history when certain conditions on the loss and update rules hold.
The reduced dynamics remain low-dimensional regardless of network size, providing a scalable description of learning.
The theory yields concrete, testable predictions for how biological circuits might change during learning without immediately altering behavior.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the separation holds, one could in principle train networks to store task history in invisible overlaps while keeping current performance unchanged.
The same visible-invisible split might offer a way to model how biological synapses can carry long-term learning traces that are not expressed in immediate circuit output.
Checking the approximation in moderately sized networks could identify the practical limits of the large-N reduction.

Load-bearing premise

The network must be large and its connectivity low-rank so that the Gaussian limit applies and the high-dimensional learning problem collapses to the overlap variables.

What would settle it

Run gradient descent on a finite but large low-rank RNN and check whether the time evolution of the measured overlap values follows the trajectories predicted by the derived ODE system.

Figures

Figures reproduced from arXiv: 2605.04115 by Omri Barak, Yoav Ger.

**Figure 1.** Figure 1: (a) High-dimensional RNN in parameter θ-space (top): input x drives activity h through input weights m, recurrent connectivity W, and readout weights z to produce output y. For a low-rank RNN, the same input–output function is captured by an effective model described by a small set of scalar overlaps σ (bottom). (b) Schematic illustration of a learning trajectory in the loss landscape over the parameter θ … view at source ↗

**Figure 2.** Figure 2: (a) Two RNNs parametrized with θ1 and θ2 share identical loss-visible σ (=) but differ in loss-invisible σ˜ (̸=) overlaps. (b) Initial values of all overlaps (blue: θ1, red: θ2) show identical visible components (top) but differences in the invisible components (bottom), including σmu and ||u||2 . (c–d) Because the input–output function depends only on the visible set, their hidden activity (c) and outputs… view at source ↗

**Figure 3.** Figure 3: (a) Illustration of the solution manifold in overlap space for tasks A and B. For each task, the loss fixes a subset (or all) of the loss-visible overlaps σ (x-axis), leaving a continuous manifold (black lines) of equivalent solutions parameterized by all the loss-invisible overlaps σ˜ (y-axis). Under an A–B–A training protocol, retraining on task A can either (1) recover the original solution (blue) or (2… view at source ↗

**Figure 4.** Figure 4: (a) Example input sequence of the flip-flop task (top). Target output (black), highdimensional RNN prediction (blue), and effective RNN prediction (red dashed) show excellent agreement (bottom). (b) Training loss of RNN in parameter space θ (blue) and 10D overlap dynamics using G(θ) (red dashed), which closely match, while direct optimization in σ space (black dashed), leads to different dynamics. (c) Dis… view at source ↗

**Figure 5.** Figure 5: (a) Filter task with white-noise input x (gray) and target output y ⋆ (blue). (b) Impulse responses of the target filter (solid blue), with gain a ⋆ = 1 and decay rate c ⋆ = 0.2, and final learned RNN function (dashed black). (c) Training loss of the RNN: numerical simulation of the high-dimensional network (solid blue) and the corresponding ODE theory (dashed black), showing excellent agreement. (d,e) Dyn… view at source ↗

**Figure 6.** Figure 6: Trajectories of all ten overlaps in the A-B-A protocol, supplementing Fig. 3 of the main text. view at source ↗

**Figure 7.** Figure 7: Trajectories of all ten overlaps in the A-B-A protocol using the Adam optimizer ( view at source ↗

**Figure 8.** Figure 8: Training on flip-flop task with different optimizers. view at source ↗

**Figure 9.** Figure 9: (a) Training loss for the A/B → C protocol, for an example run of network 1 (A→C; blue) and network 2 (B→C; red). (b) Overlaps at the end of phase 1 (A/B; epoch 30,000), showing that both loss-visible (blue) and loss-invisible (red) overlaps settle to distinct values. (c) After training on task C (epoch 60,000), loss-visible overlaps converge to the same values, while loss-invisible overlaps remain distinc… view at source ↗

**Figure 10.** Figure 10: Trajectories of all ten overlaps in the A/B view at source ↗

**Figure 11.** Figure 11: Augmented 10×10 Gram matrix G¯ for the linear (left) and nonlinear (right) rank-1 RNNs. Rows correspond to overlaps being updated, and columns to the associated loss gradients. Blue letter/dots denote loss-visible overlaps, while red letter/dots denote loss-invisible overlaps. Colored circles indicate the coefficient of G¯, for which the gradient is nonzero. In the linear case, the structure is cleanly se… view at source ↗

**Figure 12.** Figure 12: (a) Impulse response of the target damped-oscillatory filter (solid blue) and the final learned RNN response (dashed black). (b) Training loss for the full high-dimensional simulation (solid blue) and the overlap-based ODE theory (dashed black). (c) Dynamics of the 9 loss-visible and (d) 12 invisible overlaps, comparing numerical simulations (solid) with theoretical predictions (dashed). Note For a genera… view at source ↗

read the original abstract

Learning in neural systems arises from synaptic changes that reshape the representations underlying behavior. While low-rank recurrent neural networks (RNNs) have emerged as a powerful framework for linking connectivity to function, a theoretical understanding of their learning process remains elusive. Here, we extend the low-rank framework from activity to learning by deriving gradient-descent dynamics directly in a reduced overlap space. We formulate a closed-form, low-dimensional system of ODEs that governs learning in this space, exact for linear RNNs and asymptotically exact for nonlinear RNNs in the large-$N$ Gaussian limit. Central to our analysis is a distinction between two classes of overlaps: loss-visible overlaps, which fully determine network activity, output, and loss, and loss-invisible overlaps, which do not affect function but are required to describe learning. We illustrate the consequences of this decomposition through two phenomena. First, we show that learning can serve as a perturbation that exposes differences in connectivity between functionally equivalent networks. Second, we show that loss-invisible overlaps can act as memory variables that encode training history, and characterize the conditions under which this occurs. Finally, we present several testable predictions for biological learning experiments derived from our theory.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reduces GD learning in low-rank RNNs to a low-dimensional ODE system on overlaps, with a clean split between loss-visible and loss-invisible components.

read the letter

The main advance is taking the low-rank RNN setup and deriving the gradient flow directly in overlap space rather than in the full weight matrix. They separate overlaps that determine the network output and loss from those that do not, then write coupled ODEs for both. The linear case is exact; the nonlinear case is claimed to become exact in the large-N Gaussian limit. This gives a concrete way to track how learning changes connectivity without simulating every neuron at every step, and the two worked examples show the split in action: learning can expose connectivity differences between networks that look identical from the outside, and the invisible overlaps can retain information about past training under the right conditions. The biological predictions follow naturally from that memory-variable view. The derivation starts from standard gradient descent and uses the low-rank structure to close the equations, so there is no obvious circularity. The linear reduction looks tight. The nonlinear extension, however, depends on the large-N limit making higher moments factorize and keeping the low-rank structure intact under flow. If those do not hold for the activations or initializations people actually use, the closed-form claim weakens. The paper states the assumption but the stress-test concern about unverified closure is fair; without explicit checks or bounds on the error, readers will have to take the asymptotic exactness on trust. This is for people already working with low-rank or mean-field RNN models in ML or systems neuroscience. A reader who wants reduced equations for learning dynamics will find the visible/invisible distinction and the ODE system worth having. I would send it to peer review. The framework is coherent enough and the math is grounded enough that referees can usefully check the nonlinear limit and the examples.

Referee Report

1 major / 2 minor

Summary. The manuscript derives gradient-descent dynamics for low-rank RNNs directly in a reduced space of overlaps, yielding a closed-form low-dimensional system of ODEs. The reduction is stated to be exact for linear RNNs and asymptotically exact for nonlinear RNNs in the large-N Gaussian limit. A key distinction is drawn between loss-visible overlaps (which determine activity, output, and loss) and loss-invisible overlaps (which do not affect function but participate in learning). The authors illustrate two consequences: learning acting as a perturbation that reveals connectivity differences between functionally equivalent networks, and loss-invisible overlaps serving as memory variables that encode training history under certain conditions. Testable predictions for biological experiments are also presented.

Significance. If the claimed reduction is rigorously established, the work supplies a tractable theoretical handle on how low-rank connectivity evolves under learning, bridging connectivity-based models with gradient-based training. The visible/invisible overlap decomposition offers a concrete mechanism for why functional equivalence can mask structural differences and how training history can be stored without altering immediate behavior. The exact linear case and the proposed biological predictions are concrete strengths that could guide both modeling and experiment.

major comments (1)

[Derivation of the overlap ODEs (main theoretical section)] The central claim of asymptotic exactness for nonlinear RNNs rests on closure of the overlap dynamics in the large-N Gaussian limit. The manuscript must supply the explicit steps (or numerical checks) demonstrating that higher-order moments of the pre-activations factorize appropriately under the gradient flow and that the low-rank structure is preserved (or that rank inflation remains negligible). Without this verification, the reduction to a closed-form ODE system is not guaranteed for general activations or initializations.

minor comments (2)

[Abstract] The abstract would benefit from a brief statement of the dimension of the reduced ODE system and the precise form of the visible/invisible overlap variables.
[Notation and setup] Notation for the overlap matrices and the loss function should be introduced with a single consolidated table or equation block early in the text to aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive comments. We address the major comment on the derivation of the overlap ODEs below.

read point-by-point responses

Referee: [Derivation of the overlap ODEs (main theoretical section)] The central claim of asymptotic exactness for nonlinear RNNs rests on closure of the overlap dynamics in the large-N Gaussian limit. The manuscript must supply the explicit steps (or numerical checks) demonstrating that higher-order moments of the pre-activations factorize appropriately under the gradient flow and that the low-rank structure is preserved (or that rank inflation remains negligible). Without this verification, the reduction to a closed-form ODE system is not guaranteed for general activations or initializations.

Authors: We agree that a more explicit verification of the moment closure and rank preservation would strengthen the manuscript. In the revised version we have added Appendix D, which supplies the requested derivation. Under the large-N Gaussian assumption, the central limit theorem applied to the sum over neurons ensures that pre-activations remain Gaussian; higher-order moments therefore factorize into products of pairwise overlaps. The gradient updates are outer products of the visible and invisible overlaps with the input and output vectors, which by construction preserve the original low-rank form up to o(1) corrections that vanish as N grows. We also include numerical checks for tanh activations across a range of initializations, confirming that the reduced ODEs match full-network trajectories for large N. These additions make the asymptotic exactness claim rigorous for the stated regime. revision: yes

Circularity Check

0 steps flagged

Derivation of overlap-space ODEs is self-contained with no circular reductions

full rationale

The paper starts from standard gradient descent on a loss function and reduces the dynamics to a low-dimensional system of ODEs in overlap space. This reduction is exact for linear RNNs by direct algebraic closure and asymptotically exact for nonlinear RNNs under the stated large-N Gaussian limit, which is an external modeling assumption rather than a quantity fitted or defined in terms of the target result. No equations are shown to be equivalent to their inputs by construction, no parameters are fitted to a subset and then relabeled as predictions, and no load-bearing steps rely on self-citations or imported uniqueness theorems. The visible/invisible overlap distinction follows directly from the low-rank connectivity premise without self-reference.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central reduction rests on the low-rank connectivity assumption and the large-N Gaussian limit for nonlinear cases; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption RNN connectivity is low-rank
Allows reduction of the weight matrix to a small set of overlap variables.
domain assumption Large-N Gaussian limit for nonlinear RNNs
Required for the learning ODEs to be asymptotically exact.

pith-pipeline@v0.9.0 · 5505 in / 1216 out tokens · 47541 ms · 2026-05-08T18:34:17.408572+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation/ArithmeticFromLogic.lean (orbit/embedding structure) embed_add (multiplicative homomorphism) — unrelated invariant structure unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

K = zz^⊤ + vv^⊤ − mm^⊤ − uu^⊤ ∈ R^{N×N} is conserved throughout learning ... gradient flow admits exact invariants of the learning dynamics.
Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel — not invoked; loss is generic squared error, not J(x)=½(x+x⁻¹)−1 unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

G(θ) acts as a preconditioner, reshaping ∇σ L according to the geometry inherited from parameter space.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

55 extracted references · 7 canonical work pages · 2 internal anchors

[1]

Embracing multiple definitions of learning.Trends in neurosciences, 38(7):405–407, 2015

Andrew B Barron, Eileen A Hebets, Thomas A Cleland, Courtney L Fitzpatrick, Mark E Hauber, and Jeffrey R Stevens. Embracing multiple definitions of learning.Trends in neurosciences, 38(7):405–407, 2015

2015
[2]

How learning unfolds in the brain: toward an optimization view.Neuron, 109(23):3720–3735, 2021

Jay A Hennig, Emily R Oby, Darby M Losey, Aaron P Batista, Byron M Yu, and Steven M Chase. How learning unfolds in the brain: toward an optimization view.Neuron, 109(23):3720–3735, 2021

2021
[3]

If deep learning is the answer, what is the question?Nature Reviews Neuroscience, 22(1):55–67, 2021

Andrew Saxe, Stephanie Nelli, and Christopher Summerfield. If deep learning is the answer, what is the question?Nature Reviews Neuroscience, 22(1):55–67, 2021

2021
[4]

The neurobiology of learning and memory.Science, 233(4767):941–947, 1986

Richard F Thompson. The neurobiology of learning and memory.Science, 233(4767):941–947, 1986

1986
[5]

Synaptic plasticity forms and functions.Annual review of neuroscience, 43(1):95–117, 2020

Jeffrey C Magee and Christine Grienberger. Synaptic plasticity forms and functions.Annual review of neuroscience, 43(1):95–117, 2020

2020
[6]

Neural constraints on learning

Patrick T Sadtler, Kristin M Quick, Matthew D Golub, Steven M Chase, Stephen I Ryu, Elizabeth C Tyler-Kabara, Byron M Yu, and Aaron P Batista. Neural constraints on learning. Nature, 512(7515):423–426, 2014

2014
[7]

The next generation of approaches to investigate the link between synaptic plasticity and learning.Nature neuroscience, 22(10):1536–1543, 2019

Yann Humeau and Daniel Choquet. The next generation of approaches to investigate the link between synaptic plasticity and learning.Nature neuroscience, 22(10):1536–1543, 2019

2019
[8]

On simplicity and complexity in the brave new world of large-scale neuroscience.Current opinion in neurobiology, 32:148–155, 2015

Peiran Gao and Surya Ganguli. On simplicity and complexity in the brave new world of large-scale neuroscience.Current opinion in neurobiology, 32:148–155, 2015

2015
[9]

Systematic errors in connectivity inferred from activity in strongly recurrent networks.Nature Neuroscience, 23(10):1286–1296, 2020

Abhranil Das and Ila R Fiete. Systematic errors in connectivity inferred from activity in strongly recurrent networks.Nature Neuroscience, 23(10):1286–1296, 2020

2020
[10]

Degeneracy and complexity in biological systems

Gerald M Edelman and Joseph A Gally. Degeneracy and complexity in biological systems. Proceedings of the national academy of sciences, 98(24):13763–13768, 2001

2001
[11]

Similar network activity from disparate circuit parameters.Nature neuroscience, 7(12):1345–1352, 2004

Astrid A Prinz, Dirk Bucher, and Eve Marder. Similar network activity from disparate circuit parameters.Nature neuroscience, 7(12):1345–1352, 2004

2004
[12]

The brain’s best kept secret is its degenerate structure.Journal of Neuroscience, 44(40), 2024

Larissa Albantakis, Christophe Bernard, Naama Brenner, Eve Marder, and Rishikesh Narayanan. The brain’s best kept secret is its degenerate structure.Journal of Neuroscience, 44(40), 2024

2024
[13]

For neural networks, function determines form

Francesca Albertini and Eduardo D Sontag. For neural networks, function determines form. Neural networks, 6(7):975–990, 1993

1993
[14]

On linear identifiability of learned rep- resentations

Geoffrey Roeder, Luke Metz, and Durk Kingma. On linear identifiability of learned rep- resentations. InInternational Conference on Machine Learning, pages 9030–9039. PMLR, 2021

2021
[15]

Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks

Lukas Braun, Erin Grant, and Andrew M Saxe. Not all solutions are created equal: An analytical dissociation of functional and representational similarity in deep linear neural networks. In F orty-second International Conference on Machine Learning, 2025. 10

2025
[16]

Linking connectivity, dynamics, and computa- tions in low-rank recurrent neural networks.Neuron, 99(3):609–623, 2018

Francesca Mastrogiuseppe and Srdjan Ostojic. Linking connectivity, dynamics, and computa- tions in low-rank recurrent neural networks.Neuron, 99(3):609–623, 2018

2018
[17]

Dynamics of random recurrent networks with correlated low-rank structure.Physical Review Research, 2(1):013111, 2020

Friedrich Schuessler, Alexis Dubreuil, Francesca Mastrogiuseppe, Srdjan Ostojic, and Omri Barak. Dynamics of random recurrent networks with correlated low-rank structure.Physical Review Research, 2(1):013111, 2020

2020
[18]

Shaping dynamics with multiple populations in low-rank recurrent networks.Neural computation, 33(6):1572–1615, 2021

Manuel Beiran, Alexis Dubreuil, Adrian Valente, Francesca Mastrogiuseppe, and Srdjan Os- tojic. Shaping dynamics with multiple populations in low-rank recurrent networks.Neural computation, 33(6):1572–1615, 2021

2021
[19]

Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

John J Hopfield. Neural networks and physical systems with emergent collective computational abilities.Proceedings of the national academy of sciences, 79(8):2554–2558, 1982

1982
[20]

MIT press, 2003

Chris Eliasmith and Charles H Anderson.Neural engineering: Computation, representation, and dynamics in neurobiological systems. MIT press, 2003

2003
[21]

A theory of multi-task computation and task selection.bioRxiv, pages 2025–12, 2025

Owen Marschall, David G Clark, and Ashok Litwin-Kumar. A theory of multi-task computation and task selection.bioRxiv, pages 2025–12, 2025

2025
[22]

The interplay between randomness and structure during learning in rnns.Advances in neural information processing systems, 33:13352–13362, 2020

Friedrich Schuessler, Francesca Mastrogiuseppe, Alexis Dubreuil, Srdjan Ostojic, and Omri Barak. The interplay between randomness and structure during learning in rnns.Advances in neural information processing systems, 33:13352–13362, 2020

2020
[23]

The role of population structure in computations through neural dynamics.Nature neuroscience, 25(6):783–794, 2022

Alexis Dubreuil, Adrian Valente, Manuel Beiran, Francesca Mastrogiuseppe, and Srdjan Ostojic. The role of population structure in computations through neural dynamics.Nature neuroscience, 25(6):783–794, 2022

2022
[24]

Extracting computational mechanisms from neural data using low-rank rnns.Advances in Neural Information Processing Systems, 35:24072–24086, 2022

Adrian Valente, Jonathan W Pillow, and Srdjan Ostojic. Extracting computational mechanisms from neural data using low-rank rnns.Advances in Neural Information Processing Systems, 35:24072–24086, 2022

2022
[25]

Dynamically learning to integrate in recurrent neural networks.arXiv preprint arXiv:2503.18754, 2025

Blake Bordelon, Jordan Cotler, Cengiz Pehlevan, and Jacob A Zavatone-Veth. Dynamically learning to integrate in recurrent neural networks.arXiv preprint arXiv:2503.18754, 2025

work page arXiv 2025
[26]

Learning dynamics in linear recurrent neural networks

Alexandra Maria Proca, Clémentine Carla Juliette Dominé, Murray Shanahan, and Pedro AM Mediano. Learning dynamics in linear recurrent neural networks. InF orty-second International Conference on Machine Learning, 2025

2025
[27]

Chaos in random neural networks.Physical review letters, 61(3):259, 1988

Haim Sompolinsky, Andrea Crisanti, and Hans-Jurgen Sommers. Chaos in random neural networks.Physical review letters, 61(3):259, 1988

1988
[28]

Recurrent neural networks as versatile tools of neuroscience research.Current opinion in neurobiology, 46:1–6, 2017

Omri Barak. Recurrent neural networks as versatile tools of neuroscience research.Current opinion in neurobiology, 46:1–6, 2017

2017
[29]

Measuring and controlling solution degeneracy across task-trained recurrent neural networks

Ann Huang, Satpreet Harcharan Singh, Flavio Martinelli, and Kanaka Rajan. Measuring and controlling solution degeneracy across task-trained recurrent neural networks. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[30]

Memory by accident: a theory of learning as a byproduct of network stabilization

Basile Confavreux, Will Dorrell, Nishil Patel, and Andrew M Saxe. Memory by accident: a theory of learning as a byproduct of network stabilization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[31]

Algorithmic regularization in learning deep homoge- neous models: Layers are automatically balanced.Advances in neural information processing systems, 31, 2018

Simon S Du, Wei Hu, and Jason D Lee. Algorithmic regularization in learning deep homoge- neous models: Layers are automatically balanced.Advances in neural information processing systems, 31, 2018

2018
[32]

Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process

Guy Blanc, Neha Gupta, Gregory Valiant, and Paul Valiant. Implicit regularization for deep neural networks driven by an ornstein-uhlenbeck like process. InConference on learning theory, pages 483–513. PMLR, 2020

2020
[33]

Representational drift as a result of implicit regularization.Elife, 12:RP90069, 2024

Aviv Ratzon, Dori Derdikman, and Omri Barak. Representational drift as a result of implicit regularization.Elife, 12:RP90069, 2024. 11

2024
[34]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review arXiv 2014
[35]

Opening the black box: low-dimensional dynamics in high- dimensional recurrent neural networks.Neural computation, 25(3):626–649, 2013

David Sussillo and Omri Barak. Opening the black box: low-dimensional dynamics in high- dimensional recurrent neural networks.Neural computation, 25(3):626–649, 2013

2013
[36]

Re- verse engineering recurrent networks for sentiment classification reveals line attractor dynamics

Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Re- verse engineering recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32, 2019

2019
[37]

Speech recognition with deep recurrent neural networks

Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In2013 IEEE international conference on acoustics, speech and signal processing, pages 6645–6649. Ieee, 2013

2013
[38]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pages 1310–1318. Pmlr, 2013

2013
[39]

Resurrecting recurrent neural networks for long sequences

Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. Resurrecting recurrent neural networks for long sequences. In International Conference on Machine Learning, pages 26670–26698. PMLR, 2023

2023
[40]

Con- nectivity structure and dynamics of nonlinear recurrent neural networks.Physical Review X, 15(4):041019, 2025

David G Clark, Owen Marschall, Alexander Van Meegen, and Ashok Litwin-Kumar. Con- nectivity structure and dynamics of nonlinear recurrent neural networks.Physical Review X, 15(4):041019, 2025

2025
[41]

Trained recurrent neural networks de- velop phase-locked limit cycles in a working memory task.PLOS Computational Biology, 20(2):e1011852, 2024

Matthijs Pals, Jakob H Macke, and Omri Barak. Trained recurrent neural networks de- velop phase-locked limit cycles in a working memory task.PLOS Computational Biology, 20(2):e1011852, 2024

2024
[42]

Learning dynamics of RNNs in closed-loop environments

Yoav Ger and Omri Barak. Learning dynamics of RNNs in closed-loop environments. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[43]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks.arXiv preprint arXiv:1312.6120, 2013

work page Pith review arXiv 2013
[44]

A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019

Andrew M Saxe, James L McClelland, and Surya Ganguli. A mathematical theory of semantic development in deep neural networks.Proceedings of the National Academy of Sciences, 116(23):11537–11546, 2019

2019
[45]

On the implicit bias of initialization shape: Beyond infinitesimal mirror descent

Shahar Azulay, Edward Moroshko, Mor Shpigel Nacson, Blake E Woodworth, Nathan Srebro, Amir Globerson, and Daniel Soudry. On the implicit bias of initialization shape: Beyond infinitesimal mirror descent. InInternational Conference on Machine Learning, pages 468–477. PMLR, 2021

2021
[46]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

2018
[47]

On lazy training in differentiable program- ming.Advances in neural information processing systems, 32, 2019

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable program- ming.Advances in neural information processing systems, 32, 2019

2019
[48]

Natural gradient works efficiently in learning.Neural computation, 10(2):251– 276, 1998

Shun-Ichi Amari. Natural gradient works efficiently in learning.Neural computation, 10(2):251– 276, 1998

1998
[49]

Revisiting Natural Gradient for Deep Networks

Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.arXiv preprint arXiv:1301.3584, 2013

work page Pith review arXiv 2013
[50]

Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics.arXiv preprint arXiv:2012.04728, 2020

Daniel Kunin, Javier Sagastuy-Brena, Surya Ganguli, Daniel LK Yamins, and Hidenori Tanaka. Neural mechanics: Symmetry and broken conservation laws in deep learning dynamics.arXiv preprint arXiv:2012.04728, 2020

work page arXiv 2012
[51]

Noether’s learning dynamics: Role of symmetry breaking in neural networks.Advances in Neural Information Processing Systems, 34:25646–25660, 2021

Hidenori Tanaka and Daniel Kunin. Noether’s learning dynamics: Role of symmetry breaking in neural networks.Advances in Neural Information Processing Systems, 34:25646–25660, 2021. 12

2021
[52]

arXiv preprint arXiv:2110.06914 , year=

Zhiyuan Li, Tianhao Wang, and Sanjeev Arora. What happens after sgd reaches zero loss?–a mathematical framework.arXiv preprint arXiv:2110.06914, 2021

work page arXiv 2021
[53]

Residual dynamics resolves recurrent contributions to neural computation.Nature Neuroscience, 26(2):326–338, 2023

Aniruddh R Galgali, Maneesh Sahani, and Valerio Mante. Residual dynamics resolves recurrent contributions to neural computation.Nature Neuroscience, 26(2):326–338, 2023

2023
[54]

Neural dynamics outside task-coding dimensions drive decision trajectories through transient amplifi- cation.bioRxiv, pages 2025–11, 2025

Ulises Pereira-Obilinovic, Kayvon Daie, Susu Chen, Karel Svoboda, and Ran Darshan. Neural dynamics outside task-coding dimensions drive decision trajectories through transient amplifi- cation.bioRxiv, pages 2025–11, 2025

2025
[55]

PyTorch: An Imperative Style, High-Performance Deep Learning Library

A Paszke. Pytorch: An imperative style, high-performance deep learning library.arXiv preprint arXiv:1912.01703, 2019. 13 Appendix The appendix is organized as follows: • Section A– Full derivation of the linear rank-1 RNN: reduced activity dynamics, overlap learning dynamics, filter task and training details, gradient-flow invariants, and experiments usin...

work page internal anchor Pith review arXiv 1912