A ghost mechanism: An analytical model of abrupt learning in recurrent networks

Bariscan Kurtkaya; Ege Cirakman; Fatih Dinc; Hidenori Tanaka; Mark J. Schnitzer; Mert Yuksekgonul; Yiqi Jiang

arxiv: 2501.02378 · v2 · submitted 2025-01-04 · 💻 cs.LG · q-bio.NC· stat.ML

A ghost mechanism: An analytical model of abrupt learning in recurrent networks

Fatih Dinc , Ege Cirakman , Bariscan Kurtkaya , Mert Yuksekgonul , Yiqi Jiang , Mark J. Schnitzer , Hidenori Tanaka This is my paper

Pith reviewed 2026-05-23 06:29 UTC · model grok-4.3

classification 💻 cs.LG q-bio.NCstat.ML

keywords ghost mechanismabrupt learningrecurrent neural networkssaddle-node bifurcationgradient collapseworking memory taskslow-rank RNNs

0 comments

The pith

Recurrent networks exhibit abrupt learning when high-dimensional dynamics near ghost points reduce to a one-dimensional canonical form governed by a single scale parameter.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that abrupt performance jumps in RNNs trained on working memory tasks arise from transient slowdowns near ghost points, the remnants of saddle-node bifurcations. Reducing the local dynamics to a one-dimensional model shows that learning is controlled by one scale parameter and identifies a critical learning rate that follows an inverse power law with the timescale of the required computation. Past this rate the system collapses through vanishing gradients and oscillatory gradients that trap parameters in no-learning zones of high-confidence errors. The model is validated first in low-rank RNNs where ghost points precede transitions and then in full-rank networks on standard tasks. Two remedies follow directly: raising the trainable rank stabilizes trajectories and lowering output confidence avoids entrapment.

Core claim

By reducing the high-dimensional dynamics near ghost points, we derive a one-dimensional canonical form that analytically captures learning as a process controlled by a single scale parameter. Using this model, we study a form of abrupt learning emerging from ghost points and identify a critical learning rate that scales as an inverse power law with the timescale of the learned computation. Beyond this rate, learning collapses through two interacting modes: (i) vanishing gradients and (ii) oscillatory gradients near minima. These features can lock the system into high-confidence but incorrect predictions when parameter updates trigger a no-learning zone.

What carries the argument

The ghost mechanism, defined as the transient slowdown of dynamical systems near the remnant of a saddle-node bifurcation, reduced to a one-dimensional canonical form that governs learning via a single scale parameter.

If this is right

Learning trajectories in RNNs are shaped by proximity to ghost points in state space.
A critical learning rate exists; exceeding it triggers collapse via vanishing or oscillatory gradients.
Increasing the number of trainable ranks prevents the system from entering no-learning zones.
Lowering output reduces the depth of no-learning zones and allows escape from incorrect high-confidence states.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same reduction might apply to other recurrent or state-space models that develop slow manifolds during training.
The inverse-power scaling could be tested directly by varying task delay lengths while holding network size fixed.
The no-learning zone concept suggests that confidence-calibration methods used in other domains may also stabilize RNN training.

Load-bearing premise

The high-dimensional RNN dynamics near ghost points reduce to the stated one-dimensional canonical form without losing the features that produce abrupt learning and gradient collapse.

What would settle it

Measure whether the observed critical learning rate in RNN training on working-memory tasks follows the predicted inverse power-law dependence on the task's intrinsic timescale.

Figures

Figures reproduced from arXiv: 2501.02378 by Bariscan Kurtkaya, Ege Cirakman, Fatih Dinc, Hidenori Tanaka, Mark J. Schnitzer, Mert Yuksekgonul, Yiqi Jiang.

**Figure 2.** Figure 2: FIG. 2 [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: FIG. 3 [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: FIG. 4 [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

read the original abstract

Abrupt learning is a common phenomenon in recurrent neural networks (RNNs) trained on working memory tasks. In such cases, the networks develop transient slow regions in state space that extend the effective timescales of computation. However, the mechanisms driving sudden performance improvements and their causal role remain unclear. To address this gap, we introduce the ghost mechanism, a process by which dynamical systems exhibit transient slowdown near the remnant of a saddle-node bifurcation. By reducing the high-dimensional dynamics near ghost points, we derive a one-dimensional canonical form that analytically captures learning as a process controlled by a single scale parameter. Using this model, we study a form of abrupt learning emerging from ghost points and identify a critical learning rate that scales as an inverse power law with the timescale of the learned computation. Beyond this rate, learning collapses through two interacting modes: (i) vanishing gradients and (ii) oscillatory gradients near minima. These features can lock the system into high-confidence but incorrect predictions when parameter updates trigger a no-learning zone, a region of parameter space where gradients vanish. We validate these predictions in low-rank RNNs, where ghost points precede abrupt transitions, and further demonstrate their generality in full-rank RNNs trained on canonical working memory tasks. Our theory offers two approaches to address these learning difficulties: increasing trainable ranks stabilizes learning trajectories, while reducing output confidence mitigates entrapment in no-learning zones. Overall, the ghost mechanism reveals how the computational demands of a task constrain the optimization landscape, demonstrating that well-known learning difficulties in RNNs partly arise from the dynamical systems they must learn to implement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims a new ghost mechanism via 1D reduction of RNN dynamics near saddle-node remnants, but that reduction step is the main point to verify.

read the letter

The one or two things to know: this paper introduces the ghost mechanism to explain abrupt learning in RNNs on working memory tasks. They reduce high-dimensional dynamics near remnants of saddle-node bifurcations to a 1D canonical form controlled by a single scale parameter, then derive a critical learning rate that scales as an inverse power law with the task timescale, plus two gradient collapse modes that can trap the network in a no-learning zone of high-confidence errors. They validate the picture in low-rank RNNs and claim it holds in full-rank cases too, with practical fixes like raising trainable rank or lowering output confidence.

Referee Report

3 major / 2 minor

Summary. The paper introduces the 'ghost mechanism' to explain abrupt learning in RNNs on working memory tasks. It claims that high-dimensional dynamics near remnants of saddle-node bifurcations (ghost points) reduce to a one-dimensional canonical form controlled by a single scale parameter. From this reduction, the authors analytically derive a critical learning rate that scales as an inverse power law with the learned computation timescale, explain learning collapse via vanishing and oscillatory gradient modes, and identify a no-learning zone. Predictions are validated in low-rank RNNs (where ghost points precede transitions) and extended to full-rank RNNs, with proposed mitigations of increasing trainable rank or reducing output confidence.

Significance. If the 1D reduction is rigorously justified and preserves the essential slow-manifold dynamics, the work would link task computational structure directly to optimization landscape features in RNN training, offering analytical predictions for critical rates and gradient pathologies that are currently observed empirically. The explicit scaling relation, cross-validation in low- and full-rank cases, and concrete mitigation strategies constitute strengths; the approach could inform both theory and practical training heuristics if the central reduction holds without hidden parameter dependence.

major comments (3)

[Abstract and §2] The reduction of high-dimensional RNN dynamics near ghost points to the stated 1D canonical form (Abstract; §2) is the load-bearing step for all subsequent claims, including the inverse-power-law critical rate, gradient modes, and no-learning zone. The manuscript provides no explicit error analysis, transverse stability conditions, or demonstration that higher-dimensional effects (e.g., rank-dependent transients) remain negligible, leaving open whether the essential slow-manifold features driving abrupt learning are preserved.
[§3] The critical learning rate is stated to scale as an inverse power law with the timescale of the learned computation and to be controlled by a single scale parameter (Abstract; §3). Because the timescale itself appears to be an input or fitted quantity in the 1D model, the reported scaling risks reducing to a tautological relation rather than an independent prediction; explicit parameter-free derivation or cross-validation against un-fitted simulation data is required to establish independence.
[Validation in full-rank RNNs] Validation in full-rank RNNs (final results section) demonstrates qualitative agreement with the 1D predictions, but lacks quantitative metrics (e.g., predicted vs. observed transition thresholds or gradient-norm distributions) that would confirm the reduction remains accurate when transverse directions are not artificially constrained by low-rank structure.

minor comments (2)

[§2] Notation for the single scale parameter and the ghost-point location should be introduced with a clear equation reference at first use to avoid ambiguity when comparing the 1D model to the original RNN vector field.
[Figures 4-6] Figure captions for the low-rank and full-rank trajectory plots should explicitly state the number of random seeds and the precise definition of 'abrupt transition' used for counting events.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive report. The three major comments identify legitimate gaps in the justification of the 1D reduction, the independence of the scaling prediction, and the quantitative strength of the full-rank validation. We respond to each point below and will incorporate revisions where the manuscript is deficient.

read point-by-point responses

Referee: [Abstract and §2] The reduction of high-dimensional RNN dynamics near ghost points to the stated 1D canonical form (Abstract; §2) is the load-bearing step for all subsequent claims, including the inverse-power-law critical rate, gradient modes, and no-learning zone. The manuscript provides no explicit error analysis, transverse stability conditions, or demonstration that higher-dimensional effects (e.g., rank-dependent transients) remain negligible, leaving open whether the essential slow-manifold features driving abrupt learning are preserved.

Authors: We agree that the manuscript lacks an explicit error analysis and transverse stability conditions for the reduction. Section 2 presents the canonical form via the standard local analysis near a saddle-node ghost, but does not quantify the approximation error or prove transverse contraction rates. In revision we will add an appendix deriving the transverse eigenvalue bounds from the low-rank connectivity and reporting numerical L2 trajectory errors between the full network and the 1D projection for ranks 2–10; this will make the domain of validity explicit. revision: yes
Referee: [§3] The critical learning rate is stated to scale as an inverse power law with the timescale of the learned computation and to be controlled by a single scale parameter (Abstract; §3). Because the timescale itself appears to be an input or fitted quantity in the 1D model, the reported scaling risks reducing to a tautological relation rather than an independent prediction; explicit parameter-free derivation or cross-validation against un-fitted simulation data is required to establish independence.

Authors: The timescale enters the 1D model as the inverse distance to the ghost point, which is fixed by the task-defined fixed-point locations rather than fitted to learning curves. The inverse-power-law relation for the critical rate follows directly from nondimensionalization of the canonical equation. To demonstrate independence we will add a supplementary figure that extracts the slow-transient duration from untrained networks (no fitting) and overlays the analytically predicted critical rates; agreement without adjustable parameters will be shown for multiple task timescales. revision: yes
Referee: [Validation in full-rank RNNs] Validation in full-rank RNNs (final results section) demonstrates qualitative agreement with the 1D predictions, but lacks quantitative metrics (e.g., predicted vs. observed transition thresholds or gradient-norm distributions) that would confirm the reduction remains accurate when transverse directions are not artificially constrained by low-rank structure.

Authors: We accept that the full-rank section provides only qualitative agreement. In the revision we will augment the final results with two quantitative panels: (i) a scatter plot of predicted versus observed critical learning rates across five task timescales, and (ii) overlaid histograms of gradient norms at collapse onset versus the 1D model distribution. These additions will quantify the accuracy of the reduction outside the low-rank constraint. revision: yes

Circularity Check

1 steps flagged

Critical learning rate scaling reduces to relation with the model's own single scale parameter

specific steps

self definitional [Abstract]
"By reducing the high-dimensional dynamics near ghost points, we derive a one-dimensional canonical form that analytically captures learning as a process controlled by a single scale parameter. Using this model, we study a form of abrupt learning emerging from ghost points and identify a critical learning rate that scales as an inverse power law with the timescale of the learned computation."

The single scale parameter is defined as the controller of learning and is identified with the timescale of the computation. The critical rate is then stated to scale as an inverse power law of that timescale; the reported scaling is therefore an algebraic consequence of the model's own definition rather than an emergent or falsifiable prediction.

full rationale

The derivation reduces high-dimensional RNN dynamics to a 1D canonical form controlled by one scale parameter (the timescale of the learned computation). From this form the paper analytically obtains a critical learning rate scaling as an inverse power law with that same timescale. Because the scaling is derived directly from the parameter that defines the reduced model, the reported 'prediction' is forced by construction rather than an independent test of the ghost mechanism. The reduction step itself is presented as the key analytical contribution, but no external benchmark or non-self-referential verification is shown for the power-law relation. This produces partial circularity (score 6) while the broader claims about gradient collapse and no-learning zones remain downstream of the same reduction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

Ledger extracted from abstract; full paper may add or remove entries. Model rests on dynamical-systems reduction and introduces ghost points as explanatory entities.

free parameters (1)

single scale parameter
Controls the learning process in the derived 1D canonical form; critical rate expressed in terms of it.

axioms (1)

domain assumption High-dimensional RNN dynamics near saddle-node remnants reduce to the stated 1D canonical form
Invoked to derive the analytical model of abrupt learning.

invented entities (1)

ghost point no independent evidence
purpose: Remnant of saddle-node bifurcation producing transient slowdown that extends computational timescales
New explanatory construct introduced to account for abrupt learning

pith-pipeline@v0.9.0 · 5847 in / 1272 out tokens · 27483 ms · 2026-05-23T06:29:08.834164+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 4 internal anchors

[1]

pathological curvature

and “pathological curvature” [51]. Thus, our toy model provides a simple and analytically tractable start- ing point for exploring potential remedies. Moreover, our analyses with rank-one RNNs suggest an alterna- tive, bifurcation-free, mechanism for abrupt learning. By studying the latent circuits during learning, we identified the emergence of ghost poi...

work page doi:10.5281/zenodo.13686989
[2]

The organization of behavior: A 6 neuropsychological theory

Donald Olding Hebb. The organization of behavior: A 6 neuropsychological theory. Psychology press, 2005

work page 2005
[3]

Principles of neural science, volume 4

Eric R Kandel, James H Schwartz, Thomas M Jessell, Steven Siegelbaum, A James Hudspeth, Sarah Mack, et al. Principles of neural science, volume 4. McGraw-hill New York, 2000

work page 2000
[4]

Large-scale neural recordings call for new insights to link brain and behavior

Anne E Urai, Brent Doiron, Andrew M Leifer, and Anne K Churchland. Large-scale neural recordings call for new insights to link brain and behavior. Nature neu- roscience, 25(1):11–19, 2022

work page 2022
[5]

Deep physical neural networks trained with backpropagation

Logan G Wright, Tatsuhiro Onodera, Martin M Stein, Tianyu Wang, Darren T Schachter, Zoey Hu, and Peter L McMahon. Deep physical neural networks trained with backpropagation. Nature, 601(7894):549–555, 2022

work page 2022
[6]

The physics of optical computing

Peter L McMahon. The physics of optical computing. Nature Reviews Physics, 5(12):717–734, 2023

work page 2023
[7]

Experimentally realized in situ backpropagation for deep learning in photonic neural networks

Sunil Pai, Zhanghao Sun, Tyler W Hughes, Tae- won Park, Ben Bartlett, Ian AD Williamson, Mom- chil Minkov, Maziyar Milanizadeh, Nathnael Abebe, Francesco Morichetti, et al. Experimentally realized in situ backpropagation for deep learning in photonic neural networks. Science, 380(6643):398–404, 2023

work page 2023
[8]

Neuroscience-inspired artificial intelligence

Demis Hassabis, Dharshan Kumaran, Christopher Sum- merfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258, 2017

work page 2017
[9]

A critique of pure learning and what artificial neural networks can learn from animal brains

Anthony M Zador. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications, 10(1):3770, 2019

work page 2019
[10]

Bifurcations and loss jumps in rnn training

Lukas Eisenmann, Zahra Monfared, Niclas G¨ oring, and Daniel Durstewitz. Bifurcations and loss jumps in rnn training. Advances in Neural Information Processing Sys- tems, 36, 2024

work page 2024
[11]

Why do recurrent neural net- works suddenly learn? bifurcation mechanisms in neuro- inspired short-term memory tasks

Udith Haputhanthri, Liam Storan, Yiqi Jiang, Adam Shai, Hakki Orhun Akengin, Mark Schnitzer, Fatih Dinc, and Hidenori Tanaka. Why do recurrent neural net- works suddenly learn? bifurcation mechanisms in neuro- inspired short-term memory tasks. In ICML 2024 Work- shop on Mechanistic Interpretability , 2024

work page 2024
[12]

On the dynamics of learning time- aware behavior with recurrent neural networks

Peter DelMastro, Rushiv Arora, Edward Rietman, and Hava T Siegelmann. On the dynamics of learning time- aware behavior with recurrent neural networks. arXiv preprint arXiv:2306.07125, 2023

work page arXiv 2023
[13]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga- Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

A theory for emer- gence of complex skills in language models

Sanjeev Arora and Anirudh Goyal. A theory for emer- gence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023

work page arXiv 2023
[15]

Skill-mix: A flexible and expandable family of evaluations for ai models

Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown- Cohen, Anirudh Goyal, and Sanjeev Arora. Skill-mix: A flexible and expandable family of evaluations for ai models. arXiv preprint arXiv:2310.17567 , 2023

work page arXiv 2023
[16]

A percolation model of emer- gence: Analyzing transformers trained on a formal lan- guage

Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P Dick, and Hidenori Tanaka. A percolation model of emer- gence: Analyzing transformers trained on a formal lan- guage. arXiv preprint arXiv:2408.12578 , 2024

work page arXiv 2024
[17]

Compositional abilities emerge multiplica- tively: Exploring diffusion models on a synthetic task

Maya Okawa, Ekdeep S Lubana, Robert Dick, and Hide- nori Tanaka. Compositional abilities emerge multiplica- tively: Exploring diffusion models on a synthetic task. Advances in Neural Information Processing Systems , 36, 2023

work page 2023
[18]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emer- gent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[19]

An empirical analysis of compute- optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute- optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022

work page 2022
[20]

Grokking: Generaliza- tion beyond overfitting on small algorithmic datasets, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generaliza- tion beyond overfitting on small algorithmic datasets, 2022

work page 2022
[21]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning , pages 1310–1318. Pmlr, 2013

work page 2013
[22]

Bifurcations in the learning of recurrent neural networks 3

Kenji Doya et al. Bifurcations in the learning of recurrent neural networks 3. learning (RTRL), 3:17, 1992

work page 1992
[23]

Qualitatively characterizing neural network optimization problems

Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[24]

The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task

Gautam Reddy. The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task. In The Twelfth International Conference on Learn- ing Representations, 2023

work page 2023
[25]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In- context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[26]

Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks

David Sussillo and Omri Barak. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural computation , 25(3):626–649, 2013

work page 2013
[27]

Context-dependent computation by recurrent dynamics in prefrontal cortex

Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. Context-dependent computation by recurrent dynamics in prefrontal cortex. nature, 503(7474):78–84, 2013

work page 2013
[28]

Task representations in neural networks trained to perform many cognitive tasks

Guangyu Robert Yang, Madhura R Joglekar, H Francis Song, William T Newsome, and Xiao-Jing Wang. Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, 22(2):297– 306, 2019

work page 2019
[29]

The role of population structure in computations through neural dynamics

Alexis Dubreuil, Adrian Valente, Manuel Beiran, Francesca Mastrogiuseppe, and Srdjan Ostojic. The role of population structure in computations through neural dynamics. Nature Neuroscience, pages 1–12, 2022

work page 2022
[30]

Extracting computational mechanisms from neural data using low-rank rnns

Adrian Valente, Jonathan W Pillow, and Srdjan Ostojic. Extracting computational mechanisms from neural data using low-rank rnns. Advances in Neural Information Processing Systems, 35:24072–24086, 2022

work page 2022
[31]

Linking connectivity, dynamics, and computations in low-rank re- current neural networks

Francesca Mastrogiuseppe and Srdjan Ostojic. Linking connectivity, dynamics, and computations in low-rank re- current neural networks. Neuron, 99(3):609–623, 2018

work page 2018
[32]

Shap- ing dynamics with multiple populations in low-rank re- current networks

Manuel Beiran, Alexis Dubreuil, Adrian Valente, Francesca Mastrogiuseppe, and Srdjan Ostojic. Shap- ing dynamics with multiple populations in low-rank re- current networks. Neural Computation, 33(6):1572–1615, 2021

work page 2021
[33]

The inter- 7 play between randomness and structure during learning in rnns

Friedrich Schuessler, Francesca Mastrogiuseppe, Alexis Dubreuil, Srdjan Ostojic, and Omri Barak. The inter- 7 play between randomness and structure during learning in rnns. Advances in neural information processing sys- tems, 33:13352–13362, 2020

work page 2020
[34]

Generalized teacher forcing for learn- ing chaotic dynamics

Florian Hess, Zahra Monfared, Manuel Brenner, and Daniel Durstewitz. Generalized teacher forcing for learn- ing chaotic dynamics. In Proceedings of the 40th In- ternational Conference on Machine Learning , ICML’23. JMLR.org, 2023

work page 2023
[35]

Beyond exploding and vanishing gradi- ents: analysing rnn training using attractors and smooth- ness

Antˆ onio H Ribeiro, Koen Tiels, Luis A Aguirre, and Thomas Sch¨ on. Beyond exploding and vanishing gradi- ents: analysing rnn training using attractors and smooth- ness. In International conference on artificial intelligence and statistics, pages 2370–2380. PMLR, 2020

work page 2020
[36]

Reverse engineer- ing recurrent networks for sentiment classification reveals line attractor dynamics

Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Reverse engineer- ing recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32, 2019

work page 2019
[37]

Universality and in- dividuality in neural dynamics across large populations of recurrent networks

Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Universality and in- dividuality in neural dynamics across large populations of recurrent networks. Advances in neural information processing systems, 32, 2019

work page 2019
[38]

Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineer- ing

Steven H Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineer- ing. CRC press, 2018

work page 2018
[39]

Identifying non- linear dynamical systems with multiple time scales and long-range dependencies

Dominik Schmidt, Georgia Koppe, Zahra Monfared, Max Beutelspacher, and Daniel Durstewitz. Identifying non- linear dynamical systems with multiple time scales and long-range dependencies. In International Conference on Learning Representations, 2021

work page 2021
[40]

Robert Haschke and Jochen J. Steil. Input space bifur- cation manifolds of recurrent neural networks. Neuro- computing, 64:25–38, 2005. Trends in Neurocomputing: 12th European Symposium on Artificial Neural Networks 2004

work page 2005
[41]

The effect of the forget gate on bifurcation boundaries and dynamics in re- current neural networks and its implications for gradient- based optimization

Alexander Rehmer and Andreas Kroll. The effect of the forget gate on bifurcation boundaries and dynamics in re- current neural networks and its implications for gradient- based optimization. In 2022 International Joint Confer- ence on Neural Networks (IJCNN) , pages 01–08, 2022

work page 2022
[42]

Occurrence of multiple attractor bifurcations in the two- dimensional piecewise linear normal form map.Nonlinear Dynamics, 67:293–307, 2012

Viktor Avrutin, Michael Schanz, and Soumitro Banerjee. Occurrence of multiple attractor bifurcations in the two- dimensional piecewise linear normal form map.Nonlinear Dynamics, 67:293–307, 2012

work page 2012
[43]

Dangerous bi- furcation at border collision: When does it occur? Phys- ical Review E—Statistical, Nonlinear, and Soft Matter Physics, 71(5):057202, 2005

Anindita Ganguli and Soumitro Banerjee. Dangerous bi- furcation at border collision: When does it occur? Phys- ical Review E—Statistical, Nonlinear, and Soft Matter Physics, 71(5):057202, 2005

work page 2005
[44]

Monfared and D

Z. Monfared and D. Durstewitz. Existence of n-cycles and border-collision bifurcations in piecewise-linear continu- ous maps with applications to recurrent neural networks. Nonlinear Dynamics, 101(2):1037–1052, Jul 2020

work page 2020
[45]

Fixedpointfinder: A tensorflow toolbox for identifying and characterizing fixed points in recurrent neural networks

Matthew D Golub and David Sussillo. Fixedpointfinder: A tensorflow toolbox for identifying and characterizing fixed points in recurrent neural networks. Journal of Open Source Software, 3(31):1003, 2018

work page 2018
[46]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018

work page 2018
[47]

Generating coherent patterns of activity from chaotic neural networks

David Sussillo and Larry F Abbott. Generating coherent patterns of activity from chaotic neural networks. Neu- ron, 63(4):544–557, 2009

work page 2009
[48]

Cornn: Convex optimization of recurrent neural networks for rapid inference of neural dynam- ics

Fatih Dinc, Adam Shai, Mark Schnitzer, and Hide- nori Tanaka. Cornn: Convex optimization of recurrent neural networks for rapid inference of neural dynam- ics. Advances in Neural Information Processing Systems, 36:51273–51301, 2023

work page 2023
[49]

Optimization by simulated annealing

Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983

work page 1983
[50]

Universally sloppy parameter sensitivities in systems biology models

Ryan N Gutenkunst, Joshua J Waterfall, Fergal P Casey, Kevin S Brown, Christopher R Myers, and James P Sethna. Universally sloppy parameter sensitivities in systems biology models. PLoS computational biology , 3(10):e189, 2007

work page 2007
[51]

The large learning rate phase of deep learning: the catapult mechanism

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020

work page arXiv 2003
[52]

Deep learning via hessian-free op- timization

James Martens et al. Deep learning via hessian-free op- timization. In Icml, volume 27, pages 735–742, 2010

work page 2010
[53]

Social context mod- ulates singing-related neural activity in the songbird fore- brain

Neal A Hessler and Allison J Doupe. Social context mod- ulates singing-related neural activity in the songbird fore- brain. Nature neuroscience, 2(3):209–211, 1999

work page 1999
[54]

Neu- rons in a forebrain nucleus required for vocal plasticity rapidly switch between precise firing and variable burst- ing depending on social context

Mimi H Kao, Brian D Wright, and Allison J Doupe. Neu- rons in a forebrain nucleus required for vocal plasticity rapidly switch between precise firing and variable burst- ing depending on social context. Journal of Neuroscience, 28(49):13232–13247, 2008

work page 2008
[55]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zach DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In 31st Conference on Neural Information Processing Systems, 2017. 8 End Matter Derivations of the toy model results In this section, we perform the analytical derivations of th...

work page 2017

[1] [1]

pathological curvature

and “pathological curvature” [51]. Thus, our toy model provides a simple and analytically tractable start- ing point for exploring potential remedies. Moreover, our analyses with rank-one RNNs suggest an alterna- tive, bifurcation-free, mechanism for abrupt learning. By studying the latent circuits during learning, we identified the emergence of ghost poi...

work page doi:10.5281/zenodo.13686989

[2] [2]

The organization of behavior: A 6 neuropsychological theory

Donald Olding Hebb. The organization of behavior: A 6 neuropsychological theory. Psychology press, 2005

work page 2005

[3] [3]

Principles of neural science, volume 4

Eric R Kandel, James H Schwartz, Thomas M Jessell, Steven Siegelbaum, A James Hudspeth, Sarah Mack, et al. Principles of neural science, volume 4. McGraw-hill New York, 2000

work page 2000

[4] [4]

Large-scale neural recordings call for new insights to link brain and behavior

Anne E Urai, Brent Doiron, Andrew M Leifer, and Anne K Churchland. Large-scale neural recordings call for new insights to link brain and behavior. Nature neu- roscience, 25(1):11–19, 2022

work page 2022

[5] [5]

Deep physical neural networks trained with backpropagation

Logan G Wright, Tatsuhiro Onodera, Martin M Stein, Tianyu Wang, Darren T Schachter, Zoey Hu, and Peter L McMahon. Deep physical neural networks trained with backpropagation. Nature, 601(7894):549–555, 2022

work page 2022

[6] [6]

The physics of optical computing

Peter L McMahon. The physics of optical computing. Nature Reviews Physics, 5(12):717–734, 2023

work page 2023

[7] [7]

Experimentally realized in situ backpropagation for deep learning in photonic neural networks

Sunil Pai, Zhanghao Sun, Tyler W Hughes, Tae- won Park, Ben Bartlett, Ian AD Williamson, Mom- chil Minkov, Maziyar Milanizadeh, Nathnael Abebe, Francesco Morichetti, et al. Experimentally realized in situ backpropagation for deep learning in photonic neural networks. Science, 380(6643):398–404, 2023

work page 2023

[8] [8]

Neuroscience-inspired artificial intelligence

Demis Hassabis, Dharshan Kumaran, Christopher Sum- merfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258, 2017

work page 2017

[9] [9]

A critique of pure learning and what artificial neural networks can learn from animal brains

Anthony M Zador. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications, 10(1):3770, 2019

work page 2019

[10] [10]

Bifurcations and loss jumps in rnn training

Lukas Eisenmann, Zahra Monfared, Niclas G¨ oring, and Daniel Durstewitz. Bifurcations and loss jumps in rnn training. Advances in Neural Information Processing Sys- tems, 36, 2024

work page 2024

[11] [11]

Why do recurrent neural net- works suddenly learn? bifurcation mechanisms in neuro- inspired short-term memory tasks

Udith Haputhanthri, Liam Storan, Yiqi Jiang, Adam Shai, Hakki Orhun Akengin, Mark Schnitzer, Fatih Dinc, and Hidenori Tanaka. Why do recurrent neural net- works suddenly learn? bifurcation mechanisms in neuro- inspired short-term memory tasks. In ICML 2024 Work- shop on Mechanistic Interpretability , 2024

work page 2024

[12] [12]

On the dynamics of learning time- aware behavior with recurrent neural networks

Peter DelMastro, Rushiv Arora, Edward Rietman, and Hava T Siegelmann. On the dynamics of learning time- aware behavior with recurrent neural networks. arXiv preprint arXiv:2306.07125, 2023

work page arXiv 2023

[13] [13]

Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga- Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[14] [14]

A theory for emer- gence of complex skills in language models

Sanjeev Arora and Anirudh Goyal. A theory for emer- gence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023

work page arXiv 2023

[15] [15]

Skill-mix: A flexible and expandable family of evaluations for ai models

Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown- Cohen, Anirudh Goyal, and Sanjeev Arora. Skill-mix: A flexible and expandable family of evaluations for ai models. arXiv preprint arXiv:2310.17567 , 2023

work page arXiv 2023

[16] [16]

A percolation model of emer- gence: Analyzing transformers trained on a formal lan- guage

Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P Dick, and Hidenori Tanaka. A percolation model of emer- gence: Analyzing transformers trained on a formal lan- guage. arXiv preprint arXiv:2408.12578 , 2024

work page arXiv 2024

[17] [17]

Compositional abilities emerge multiplica- tively: Exploring diffusion models on a synthetic task

Maya Okawa, Ekdeep S Lubana, Robert Dick, and Hide- nori Tanaka. Compositional abilities emerge multiplica- tively: Exploring diffusion models on a synthetic task. Advances in Neural Information Processing Systems , 36, 2023

work page 2023

[18] [18]

Emergent Abilities of Large Language Models

Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emer- gent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[19] [19]

An empirical analysis of compute- optimal large language model training

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute- optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022

work page 2022

[20] [20]

Grokking: Generaliza- tion beyond overfitting on small algorithmic datasets, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generaliza- tion beyond overfitting on small algorithmic datasets, 2022

work page 2022

[21] [21]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning , pages 1310–1318. Pmlr, 2013

work page 2013

[22] [22]

Bifurcations in the learning of recurrent neural networks 3

Kenji Doya et al. Bifurcations in the learning of recurrent neural networks 3. learning (RTRL), 3:17, 1992

work page 1992

[23] [23]

Qualitatively characterizing neural network optimization problems

Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[24] [24]

The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task

Gautam Reddy. The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task. In The Twelfth International Conference on Learn- ing Representations, 2023

work page 2023

[25] [25]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In- context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[26] [26]

Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks

David Sussillo and Omri Barak. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural computation , 25(3):626–649, 2013

work page 2013

[27] [27]

Context-dependent computation by recurrent dynamics in prefrontal cortex

Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. Context-dependent computation by recurrent dynamics in prefrontal cortex. nature, 503(7474):78–84, 2013

work page 2013

[28] [28]

Task representations in neural networks trained to perform many cognitive tasks

Guangyu Robert Yang, Madhura R Joglekar, H Francis Song, William T Newsome, and Xiao-Jing Wang. Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, 22(2):297– 306, 2019

work page 2019

[29] [29]

The role of population structure in computations through neural dynamics

Alexis Dubreuil, Adrian Valente, Manuel Beiran, Francesca Mastrogiuseppe, and Srdjan Ostojic. The role of population structure in computations through neural dynamics. Nature Neuroscience, pages 1–12, 2022

work page 2022

[30] [30]

Extracting computational mechanisms from neural data using low-rank rnns

Adrian Valente, Jonathan W Pillow, and Srdjan Ostojic. Extracting computational mechanisms from neural data using low-rank rnns. Advances in Neural Information Processing Systems, 35:24072–24086, 2022

work page 2022

[31] [31]

Linking connectivity, dynamics, and computations in low-rank re- current neural networks

Francesca Mastrogiuseppe and Srdjan Ostojic. Linking connectivity, dynamics, and computations in low-rank re- current neural networks. Neuron, 99(3):609–623, 2018

work page 2018

[32] [32]

Shap- ing dynamics with multiple populations in low-rank re- current networks

Manuel Beiran, Alexis Dubreuil, Adrian Valente, Francesca Mastrogiuseppe, and Srdjan Ostojic. Shap- ing dynamics with multiple populations in low-rank re- current networks. Neural Computation, 33(6):1572–1615, 2021

work page 2021

[33] [33]

The inter- 7 play between randomness and structure during learning in rnns

Friedrich Schuessler, Francesca Mastrogiuseppe, Alexis Dubreuil, Srdjan Ostojic, and Omri Barak. The inter- 7 play between randomness and structure during learning in rnns. Advances in neural information processing sys- tems, 33:13352–13362, 2020

work page 2020

[34] [34]

Generalized teacher forcing for learn- ing chaotic dynamics

Florian Hess, Zahra Monfared, Manuel Brenner, and Daniel Durstewitz. Generalized teacher forcing for learn- ing chaotic dynamics. In Proceedings of the 40th In- ternational Conference on Machine Learning , ICML’23. JMLR.org, 2023

work page 2023

[35] [35]

Beyond exploding and vanishing gradi- ents: analysing rnn training using attractors and smooth- ness

Antˆ onio H Ribeiro, Koen Tiels, Luis A Aguirre, and Thomas Sch¨ on. Beyond exploding and vanishing gradi- ents: analysing rnn training using attractors and smooth- ness. In International conference on artificial intelligence and statistics, pages 2370–2380. PMLR, 2020

work page 2020

[36] [36]

Reverse engineer- ing recurrent networks for sentiment classification reveals line attractor dynamics

Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Reverse engineer- ing recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32, 2019

work page 2019

[37] [37]

Universality and in- dividuality in neural dynamics across large populations of recurrent networks

Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Universality and in- dividuality in neural dynamics across large populations of recurrent networks. Advances in neural information processing systems, 32, 2019

work page 2019

[38] [38]

Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineer- ing

Steven H Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineer- ing. CRC press, 2018

work page 2018

[39] [39]

Identifying non- linear dynamical systems with multiple time scales and long-range dependencies

Dominik Schmidt, Georgia Koppe, Zahra Monfared, Max Beutelspacher, and Daniel Durstewitz. Identifying non- linear dynamical systems with multiple time scales and long-range dependencies. In International Conference on Learning Representations, 2021

work page 2021

[40] [40]

Robert Haschke and Jochen J. Steil. Input space bifur- cation manifolds of recurrent neural networks. Neuro- computing, 64:25–38, 2005. Trends in Neurocomputing: 12th European Symposium on Artificial Neural Networks 2004

work page 2005

[41] [41]

The effect of the forget gate on bifurcation boundaries and dynamics in re- current neural networks and its implications for gradient- based optimization

Alexander Rehmer and Andreas Kroll. The effect of the forget gate on bifurcation boundaries and dynamics in re- current neural networks and its implications for gradient- based optimization. In 2022 International Joint Confer- ence on Neural Networks (IJCNN) , pages 01–08, 2022

work page 2022

[42] [42]

Occurrence of multiple attractor bifurcations in the two- dimensional piecewise linear normal form map.Nonlinear Dynamics, 67:293–307, 2012

Viktor Avrutin, Michael Schanz, and Soumitro Banerjee. Occurrence of multiple attractor bifurcations in the two- dimensional piecewise linear normal form map.Nonlinear Dynamics, 67:293–307, 2012

work page 2012

[43] [43]

Dangerous bi- furcation at border collision: When does it occur? Phys- ical Review E—Statistical, Nonlinear, and Soft Matter Physics, 71(5):057202, 2005

Anindita Ganguli and Soumitro Banerjee. Dangerous bi- furcation at border collision: When does it occur? Phys- ical Review E—Statistical, Nonlinear, and Soft Matter Physics, 71(5):057202, 2005

work page 2005

[44] [44]

Monfared and D

Z. Monfared and D. Durstewitz. Existence of n-cycles and border-collision bifurcations in piecewise-linear continu- ous maps with applications to recurrent neural networks. Nonlinear Dynamics, 101(2):1037–1052, Jul 2020

work page 2020

[45] [45]

Fixedpointfinder: A tensorflow toolbox for identifying and characterizing fixed points in recurrent neural networks

Matthew D Golub and David Sussillo. Fixedpointfinder: A tensorflow toolbox for identifying and characterizing fixed points in recurrent neural networks. Journal of Open Source Software, 3(31):1003, 2018

work page 2018

[46] [46]

Visualizing the loss landscape of neural nets

Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018

work page 2018

[47] [47]

Generating coherent patterns of activity from chaotic neural networks

David Sussillo and Larry F Abbott. Generating coherent patterns of activity from chaotic neural networks. Neu- ron, 63(4):544–557, 2009

work page 2009

[48] [48]

Cornn: Convex optimization of recurrent neural networks for rapid inference of neural dynam- ics

Fatih Dinc, Adam Shai, Mark Schnitzer, and Hide- nori Tanaka. Cornn: Convex optimization of recurrent neural networks for rapid inference of neural dynam- ics. Advances in Neural Information Processing Systems, 36:51273–51301, 2023

work page 2023

[49] [49]

Optimization by simulated annealing

Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983

work page 1983

[50] [50]

Universally sloppy parameter sensitivities in systems biology models

Ryan N Gutenkunst, Joshua J Waterfall, Fergal P Casey, Kevin S Brown, Christopher R Myers, and James P Sethna. Universally sloppy parameter sensitivities in systems biology models. PLoS computational biology , 3(10):e189, 2007

work page 2007

[51] [51]

The large learning rate phase of deep learning: the catapult mechanism

Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020

work page arXiv 2003

[52] [52]

Deep learning via hessian-free op- timization

James Martens et al. Deep learning via hessian-free op- timization. In Icml, volume 27, pages 735–742, 2010

work page 2010

[53] [53]

Social context mod- ulates singing-related neural activity in the songbird fore- brain

Neal A Hessler and Allison J Doupe. Social context mod- ulates singing-related neural activity in the songbird fore- brain. Nature neuroscience, 2(3):209–211, 1999

work page 1999

[54] [54]

Neu- rons in a forebrain nucleus required for vocal plasticity rapidly switch between precise firing and variable burst- ing depending on social context

Mimi H Kao, Brian D Wright, and Allison J Doupe. Neu- rons in a forebrain nucleus required for vocal plasticity rapidly switch between precise firing and variable burst- ing depending on social context. Journal of Neuroscience, 28(49):13232–13247, 2008

work page 2008

[55] [55]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zach DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In 31st Conference on Neural Information Processing Systems, 2017. 8 End Matter Derivations of the toy model results In this section, we perform the analytical derivations of th...

work page 2017