A ghost mechanism: An analytical model of abrupt learning in recurrent networks
Pith reviewed 2026-05-23 06:29 UTC · model grok-4.3
The pith
Recurrent networks exhibit abrupt learning when high-dimensional dynamics near ghost points reduce to a one-dimensional canonical form governed by a single scale parameter.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By reducing the high-dimensional dynamics near ghost points, we derive a one-dimensional canonical form that analytically captures learning as a process controlled by a single scale parameter. Using this model, we study a form of abrupt learning emerging from ghost points and identify a critical learning rate that scales as an inverse power law with the timescale of the learned computation. Beyond this rate, learning collapses through two interacting modes: (i) vanishing gradients and (ii) oscillatory gradients near minima. These features can lock the system into high-confidence but incorrect predictions when parameter updates trigger a no-learning zone.
What carries the argument
The ghost mechanism, defined as the transient slowdown of dynamical systems near the remnant of a saddle-node bifurcation, reduced to a one-dimensional canonical form that governs learning via a single scale parameter.
If this is right
- Learning trajectories in RNNs are shaped by proximity to ghost points in state space.
- A critical learning rate exists; exceeding it triggers collapse via vanishing or oscillatory gradients.
- Increasing the number of trainable ranks prevents the system from entering no-learning zones.
- Lowering output reduces the depth of no-learning zones and allows escape from incorrect high-confidence states.
Where Pith is reading between the lines
- The same reduction might apply to other recurrent or state-space models that develop slow manifolds during training.
- The inverse-power scaling could be tested directly by varying task delay lengths while holding network size fixed.
- The no-learning zone concept suggests that confidence-calibration methods used in other domains may also stabilize RNN training.
Load-bearing premise
The high-dimensional RNN dynamics near ghost points reduce to the stated one-dimensional canonical form without losing the features that produce abrupt learning and gradient collapse.
What would settle it
Measure whether the observed critical learning rate in RNN training on working-memory tasks follows the predicted inverse power-law dependence on the task's intrinsic timescale.
Figures
read the original abstract
Abrupt learning is a common phenomenon in recurrent neural networks (RNNs) trained on working memory tasks. In such cases, the networks develop transient slow regions in state space that extend the effective timescales of computation. However, the mechanisms driving sudden performance improvements and their causal role remain unclear. To address this gap, we introduce the ghost mechanism, a process by which dynamical systems exhibit transient slowdown near the remnant of a saddle-node bifurcation. By reducing the high-dimensional dynamics near ghost points, we derive a one-dimensional canonical form that analytically captures learning as a process controlled by a single scale parameter. Using this model, we study a form of abrupt learning emerging from ghost points and identify a critical learning rate that scales as an inverse power law with the timescale of the learned computation. Beyond this rate, learning collapses through two interacting modes: (i) vanishing gradients and (ii) oscillatory gradients near minima. These features can lock the system into high-confidence but incorrect predictions when parameter updates trigger a no-learning zone, a region of parameter space where gradients vanish. We validate these predictions in low-rank RNNs, where ghost points precede abrupt transitions, and further demonstrate their generality in full-rank RNNs trained on canonical working memory tasks. Our theory offers two approaches to address these learning difficulties: increasing trainable ranks stabilizes learning trajectories, while reducing output confidence mitigates entrapment in no-learning zones. Overall, the ghost mechanism reveals how the computational demands of a task constrain the optimization landscape, demonstrating that well-known learning difficulties in RNNs partly arise from the dynamical systems they must learn to implement.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the 'ghost mechanism' to explain abrupt learning in RNNs on working memory tasks. It claims that high-dimensional dynamics near remnants of saddle-node bifurcations (ghost points) reduce to a one-dimensional canonical form controlled by a single scale parameter. From this reduction, the authors analytically derive a critical learning rate that scales as an inverse power law with the learned computation timescale, explain learning collapse via vanishing and oscillatory gradient modes, and identify a no-learning zone. Predictions are validated in low-rank RNNs (where ghost points precede transitions) and extended to full-rank RNNs, with proposed mitigations of increasing trainable rank or reducing output confidence.
Significance. If the 1D reduction is rigorously justified and preserves the essential slow-manifold dynamics, the work would link task computational structure directly to optimization landscape features in RNN training, offering analytical predictions for critical rates and gradient pathologies that are currently observed empirically. The explicit scaling relation, cross-validation in low- and full-rank cases, and concrete mitigation strategies constitute strengths; the approach could inform both theory and practical training heuristics if the central reduction holds without hidden parameter dependence.
major comments (3)
- [Abstract and §2] The reduction of high-dimensional RNN dynamics near ghost points to the stated 1D canonical form (Abstract; §2) is the load-bearing step for all subsequent claims, including the inverse-power-law critical rate, gradient modes, and no-learning zone. The manuscript provides no explicit error analysis, transverse stability conditions, or demonstration that higher-dimensional effects (e.g., rank-dependent transients) remain negligible, leaving open whether the essential slow-manifold features driving abrupt learning are preserved.
- [§3] The critical learning rate is stated to scale as an inverse power law with the timescale of the learned computation and to be controlled by a single scale parameter (Abstract; §3). Because the timescale itself appears to be an input or fitted quantity in the 1D model, the reported scaling risks reducing to a tautological relation rather than an independent prediction; explicit parameter-free derivation or cross-validation against un-fitted simulation data is required to establish independence.
- [Validation in full-rank RNNs] Validation in full-rank RNNs (final results section) demonstrates qualitative agreement with the 1D predictions, but lacks quantitative metrics (e.g., predicted vs. observed transition thresholds or gradient-norm distributions) that would confirm the reduction remains accurate when transverse directions are not artificially constrained by low-rank structure.
minor comments (2)
- [§2] Notation for the single scale parameter and the ghost-point location should be introduced with a clear equation reference at first use to avoid ambiguity when comparing the 1D model to the original RNN vector field.
- [Figures 4-6] Figure captions for the low-rank and full-rank trajectory plots should explicitly state the number of random seeds and the precise definition of 'abrupt transition' used for counting events.
Simulated Author's Rebuttal
We thank the referee for the constructive report. The three major comments identify legitimate gaps in the justification of the 1D reduction, the independence of the scaling prediction, and the quantitative strength of the full-rank validation. We respond to each point below and will incorporate revisions where the manuscript is deficient.
read point-by-point responses
-
Referee: [Abstract and §2] The reduction of high-dimensional RNN dynamics near ghost points to the stated 1D canonical form (Abstract; §2) is the load-bearing step for all subsequent claims, including the inverse-power-law critical rate, gradient modes, and no-learning zone. The manuscript provides no explicit error analysis, transverse stability conditions, or demonstration that higher-dimensional effects (e.g., rank-dependent transients) remain negligible, leaving open whether the essential slow-manifold features driving abrupt learning are preserved.
Authors: We agree that the manuscript lacks an explicit error analysis and transverse stability conditions for the reduction. Section 2 presents the canonical form via the standard local analysis near a saddle-node ghost, but does not quantify the approximation error or prove transverse contraction rates. In revision we will add an appendix deriving the transverse eigenvalue bounds from the low-rank connectivity and reporting numerical L2 trajectory errors between the full network and the 1D projection for ranks 2–10; this will make the domain of validity explicit. revision: yes
-
Referee: [§3] The critical learning rate is stated to scale as an inverse power law with the timescale of the learned computation and to be controlled by a single scale parameter (Abstract; §3). Because the timescale itself appears to be an input or fitted quantity in the 1D model, the reported scaling risks reducing to a tautological relation rather than an independent prediction; explicit parameter-free derivation or cross-validation against un-fitted simulation data is required to establish independence.
Authors: The timescale enters the 1D model as the inverse distance to the ghost point, which is fixed by the task-defined fixed-point locations rather than fitted to learning curves. The inverse-power-law relation for the critical rate follows directly from nondimensionalization of the canonical equation. To demonstrate independence we will add a supplementary figure that extracts the slow-transient duration from untrained networks (no fitting) and overlays the analytically predicted critical rates; agreement without adjustable parameters will be shown for multiple task timescales. revision: yes
-
Referee: [Validation in full-rank RNNs] Validation in full-rank RNNs (final results section) demonstrates qualitative agreement with the 1D predictions, but lacks quantitative metrics (e.g., predicted vs. observed transition thresholds or gradient-norm distributions) that would confirm the reduction remains accurate when transverse directions are not artificially constrained by low-rank structure.
Authors: We accept that the full-rank section provides only qualitative agreement. In the revision we will augment the final results with two quantitative panels: (i) a scatter plot of predicted versus observed critical learning rates across five task timescales, and (ii) overlaid histograms of gradient norms at collapse onset versus the 1D model distribution. These additions will quantify the accuracy of the reduction outside the low-rank constraint. revision: yes
Circularity Check
Critical learning rate scaling reduces to relation with the model's own single scale parameter
specific steps
-
self definitional
[Abstract]
"By reducing the high-dimensional dynamics near ghost points, we derive a one-dimensional canonical form that analytically captures learning as a process controlled by a single scale parameter. Using this model, we study a form of abrupt learning emerging from ghost points and identify a critical learning rate that scales as an inverse power law with the timescale of the learned computation."
The single scale parameter is defined as the controller of learning and is identified with the timescale of the computation. The critical rate is then stated to scale as an inverse power law of that timescale; the reported scaling is therefore an algebraic consequence of the model's own definition rather than an emergent or falsifiable prediction.
full rationale
The derivation reduces high-dimensional RNN dynamics to a 1D canonical form controlled by one scale parameter (the timescale of the learned computation). From this form the paper analytically obtains a critical learning rate scaling as an inverse power law with that same timescale. Because the scaling is derived directly from the parameter that defines the reduced model, the reported 'prediction' is forced by construction rather than an independent test of the ghost mechanism. The reduction step itself is presented as the key analytical contribution, but no external benchmark or non-self-referential verification is shown for the power-law relation. This produces partial circularity (score 6) while the broader claims about gradient collapse and no-learning zones remain downstream of the same reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- single scale parameter
axioms (1)
- domain assumption High-dimensional RNN dynamics near saddle-node remnants reduce to the stated 1D canonical form
invented entities (1)
-
ghost point
no independent evidence
Reference graph
Works this paper leans on
-
[1]
and “pathological curvature” [51]. Thus, our toy model provides a simple and analytically tractable start- ing point for exploring potential remedies. Moreover, our analyses with rank-one RNNs suggest an alterna- tive, bifurcation-free, mechanism for abrupt learning. By studying the latent circuits during learning, we identified the emergence of ghost poi...
-
[2]
The organization of behavior: A 6 neuropsychological theory
Donald Olding Hebb. The organization of behavior: A 6 neuropsychological theory. Psychology press, 2005
work page 2005
-
[3]
Principles of neural science, volume 4
Eric R Kandel, James H Schwartz, Thomas M Jessell, Steven Siegelbaum, A James Hudspeth, Sarah Mack, et al. Principles of neural science, volume 4. McGraw-hill New York, 2000
work page 2000
-
[4]
Large-scale neural recordings call for new insights to link brain and behavior
Anne E Urai, Brent Doiron, Andrew M Leifer, and Anne K Churchland. Large-scale neural recordings call for new insights to link brain and behavior. Nature neu- roscience, 25(1):11–19, 2022
work page 2022
-
[5]
Deep physical neural networks trained with backpropagation
Logan G Wright, Tatsuhiro Onodera, Martin M Stein, Tianyu Wang, Darren T Schachter, Zoey Hu, and Peter L McMahon. Deep physical neural networks trained with backpropagation. Nature, 601(7894):549–555, 2022
work page 2022
-
[6]
The physics of optical computing
Peter L McMahon. The physics of optical computing. Nature Reviews Physics, 5(12):717–734, 2023
work page 2023
-
[7]
Experimentally realized in situ backpropagation for deep learning in photonic neural networks
Sunil Pai, Zhanghao Sun, Tyler W Hughes, Tae- won Park, Ben Bartlett, Ian AD Williamson, Mom- chil Minkov, Maziyar Milanizadeh, Nathnael Abebe, Francesco Morichetti, et al. Experimentally realized in situ backpropagation for deep learning in photonic neural networks. Science, 380(6643):398–404, 2023
work page 2023
-
[8]
Neuroscience-inspired artificial intelligence
Demis Hassabis, Dharshan Kumaran, Christopher Sum- merfield, and Matthew Botvinick. Neuroscience-inspired artificial intelligence. Neuron, 95(2):245–258, 2017
work page 2017
-
[9]
A critique of pure learning and what artificial neural networks can learn from animal brains
Anthony M Zador. A critique of pure learning and what artificial neural networks can learn from animal brains. Nature communications, 10(1):3770, 2019
work page 2019
-
[10]
Bifurcations and loss jumps in rnn training
Lukas Eisenmann, Zahra Monfared, Niclas G¨ oring, and Daniel Durstewitz. Bifurcations and loss jumps in rnn training. Advances in Neural Information Processing Sys- tems, 36, 2024
work page 2024
-
[11]
Udith Haputhanthri, Liam Storan, Yiqi Jiang, Adam Shai, Hakki Orhun Akengin, Mark Schnitzer, Fatih Dinc, and Hidenori Tanaka. Why do recurrent neural net- works suddenly learn? bifurcation mechanisms in neuro- inspired short-term memory tasks. In ICML 2024 Work- shop on Mechanistic Interpretability , 2024
work page 2024
-
[12]
On the dynamics of learning time- aware behavior with recurrent neural networks
Peter DelMastro, Rushiv Arora, Edward Rietman, and Hava T Siegelmann. On the dynamics of learning time- aware behavior with recurrent neural networks. arXiv preprint arXiv:2306.07125, 2023
-
[13]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adri` a Garriga- Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[14]
A theory for emer- gence of complex skills in language models
Sanjeev Arora and Anirudh Goyal. A theory for emer- gence of complex skills in language models. arXiv preprint arXiv:2307.15936, 2023
-
[15]
Skill-mix: A flexible and expandable family of evaluations for ai models
Dingli Yu, Simran Kaur, Arushi Gupta, Jonah Brown- Cohen, Anirudh Goyal, and Sanjeev Arora. Skill-mix: A flexible and expandable family of evaluations for ai models. arXiv preprint arXiv:2310.17567 , 2023
-
[16]
A percolation model of emer- gence: Analyzing transformers trained on a formal lan- guage
Ekdeep Singh Lubana, Kyogo Kawaguchi, Robert P Dick, and Hidenori Tanaka. A percolation model of emer- gence: Analyzing transformers trained on a formal lan- guage. arXiv preprint arXiv:2408.12578 , 2024
-
[17]
Compositional abilities emerge multiplica- tively: Exploring diffusion models on a synthetic task
Maya Okawa, Ekdeep S Lubana, Robert Dick, and Hide- nori Tanaka. Compositional abilities emerge multiplica- tively: Exploring diffusion models on a synthetic task. Advances in Neural Information Processing Systems , 36, 2023
work page 2023
-
[18]
Emergent Abilities of Large Language Models
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Bar- ret Zoph, Sebastian Borgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, et al. Emer- gent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[19]
An empirical analysis of compute- optimal large language model training
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. An empirical analysis of compute- optimal large language model training. Advances in Neural Information Processing Systems, 35:30016–30030, 2022
work page 2022
-
[20]
Grokking: Generaliza- tion beyond overfitting on small algorithmic datasets, 2022
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generaliza- tion beyond overfitting on small algorithmic datasets, 2022
work page 2022
-
[21]
On the difficulty of training recurrent neural networks
Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the difficulty of training recurrent neural networks. In International conference on machine learning , pages 1310–1318. Pmlr, 2013
work page 2013
-
[22]
Bifurcations in the learning of recurrent neural networks 3
Kenji Doya et al. Bifurcations in the learning of recurrent neural networks 3. learning (RTRL), 3:17, 1992
work page 1992
-
[23]
Qualitatively characterizing neural network optimization problems
Ian J Goodfellow, Oriol Vinyals, and Andrew M Saxe. Qualitatively characterizing neural network optimization problems. arXiv preprint arXiv:1412.6544 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[24]
The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task
Gautam Reddy. The mechanistic basis of data depen- dence and abrupt learning in an in-context classification task. In The Twelfth International Conference on Learn- ing Representations, 2023
work page 2023
-
[25]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In- context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[26]
Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks
David Sussillo and Omri Barak. Opening the black box: low-dimensional dynamics in high-dimensional recurrent neural networks. Neural computation , 25(3):626–649, 2013
work page 2013
-
[27]
Context-dependent computation by recurrent dynamics in prefrontal cortex
Valerio Mante, David Sussillo, Krishna V Shenoy, and William T Newsome. Context-dependent computation by recurrent dynamics in prefrontal cortex. nature, 503(7474):78–84, 2013
work page 2013
-
[28]
Task representations in neural networks trained to perform many cognitive tasks
Guangyu Robert Yang, Madhura R Joglekar, H Francis Song, William T Newsome, and Xiao-Jing Wang. Task representations in neural networks trained to perform many cognitive tasks. Nature neuroscience, 22(2):297– 306, 2019
work page 2019
-
[29]
The role of population structure in computations through neural dynamics
Alexis Dubreuil, Adrian Valente, Manuel Beiran, Francesca Mastrogiuseppe, and Srdjan Ostojic. The role of population structure in computations through neural dynamics. Nature Neuroscience, pages 1–12, 2022
work page 2022
-
[30]
Extracting computational mechanisms from neural data using low-rank rnns
Adrian Valente, Jonathan W Pillow, and Srdjan Ostojic. Extracting computational mechanisms from neural data using low-rank rnns. Advances in Neural Information Processing Systems, 35:24072–24086, 2022
work page 2022
-
[31]
Linking connectivity, dynamics, and computations in low-rank re- current neural networks
Francesca Mastrogiuseppe and Srdjan Ostojic. Linking connectivity, dynamics, and computations in low-rank re- current neural networks. Neuron, 99(3):609–623, 2018
work page 2018
-
[32]
Shap- ing dynamics with multiple populations in low-rank re- current networks
Manuel Beiran, Alexis Dubreuil, Adrian Valente, Francesca Mastrogiuseppe, and Srdjan Ostojic. Shap- ing dynamics with multiple populations in low-rank re- current networks. Neural Computation, 33(6):1572–1615, 2021
work page 2021
-
[33]
The inter- 7 play between randomness and structure during learning in rnns
Friedrich Schuessler, Francesca Mastrogiuseppe, Alexis Dubreuil, Srdjan Ostojic, and Omri Barak. The inter- 7 play between randomness and structure during learning in rnns. Advances in neural information processing sys- tems, 33:13352–13362, 2020
work page 2020
-
[34]
Generalized teacher forcing for learn- ing chaotic dynamics
Florian Hess, Zahra Monfared, Manuel Brenner, and Daniel Durstewitz. Generalized teacher forcing for learn- ing chaotic dynamics. In Proceedings of the 40th In- ternational Conference on Machine Learning , ICML’23. JMLR.org, 2023
work page 2023
-
[35]
Beyond exploding and vanishing gradi- ents: analysing rnn training using attractors and smooth- ness
Antˆ onio H Ribeiro, Koen Tiels, Luis A Aguirre, and Thomas Sch¨ on. Beyond exploding and vanishing gradi- ents: analysing rnn training using attractors and smooth- ness. In International conference on artificial intelligence and statistics, pages 2370–2380. PMLR, 2020
work page 2020
-
[36]
Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Reverse engineer- ing recurrent networks for sentiment classification reveals line attractor dynamics. Advances in neural information processing systems, 32, 2019
work page 2019
-
[37]
Universality and in- dividuality in neural dynamics across large populations of recurrent networks
Niru Maheswaranathan, Alex Williams, Matthew Golub, Surya Ganguli, and David Sussillo. Universality and in- dividuality in neural dynamics across large populations of recurrent networks. Advances in neural information processing systems, 32, 2019
work page 2019
-
[38]
Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineer- ing
Steven H Strogatz. Nonlinear dynamics and chaos: with applications to physics, biology, chemistry, and engineer- ing. CRC press, 2018
work page 2018
-
[39]
Identifying non- linear dynamical systems with multiple time scales and long-range dependencies
Dominik Schmidt, Georgia Koppe, Zahra Monfared, Max Beutelspacher, and Daniel Durstewitz. Identifying non- linear dynamical systems with multiple time scales and long-range dependencies. In International Conference on Learning Representations, 2021
work page 2021
-
[40]
Robert Haschke and Jochen J. Steil. Input space bifur- cation manifolds of recurrent neural networks. Neuro- computing, 64:25–38, 2005. Trends in Neurocomputing: 12th European Symposium on Artificial Neural Networks 2004
work page 2005
-
[41]
Alexander Rehmer and Andreas Kroll. The effect of the forget gate on bifurcation boundaries and dynamics in re- current neural networks and its implications for gradient- based optimization. In 2022 International Joint Confer- ence on Neural Networks (IJCNN) , pages 01–08, 2022
work page 2022
-
[42]
Viktor Avrutin, Michael Schanz, and Soumitro Banerjee. Occurrence of multiple attractor bifurcations in the two- dimensional piecewise linear normal form map.Nonlinear Dynamics, 67:293–307, 2012
work page 2012
-
[43]
Anindita Ganguli and Soumitro Banerjee. Dangerous bi- furcation at border collision: When does it occur? Phys- ical Review E—Statistical, Nonlinear, and Soft Matter Physics, 71(5):057202, 2005
work page 2005
-
[44]
Z. Monfared and D. Durstewitz. Existence of n-cycles and border-collision bifurcations in piecewise-linear continu- ous maps with applications to recurrent neural networks. Nonlinear Dynamics, 101(2):1037–1052, Jul 2020
work page 2020
-
[45]
Matthew D Golub and David Sussillo. Fixedpointfinder: A tensorflow toolbox for identifying and characterizing fixed points in recurrent neural networks. Journal of Open Source Software, 3(31):1003, 2018
work page 2018
-
[46]
Visualizing the loss landscape of neural nets
Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, and Tom Goldstein. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018
work page 2018
-
[47]
Generating coherent patterns of activity from chaotic neural networks
David Sussillo and Larry F Abbott. Generating coherent patterns of activity from chaotic neural networks. Neu- ron, 63(4):544–557, 2009
work page 2009
-
[48]
Cornn: Convex optimization of recurrent neural networks for rapid inference of neural dynam- ics
Fatih Dinc, Adam Shai, Mark Schnitzer, and Hide- nori Tanaka. Cornn: Convex optimization of recurrent neural networks for rapid inference of neural dynam- ics. Advances in Neural Information Processing Systems, 36:51273–51301, 2023
work page 2023
-
[49]
Optimization by simulated annealing
Scott Kirkpatrick, C Daniel Gelatt Jr, and Mario P Vecchi. Optimization by simulated annealing. science, 220(4598):671–680, 1983
work page 1983
-
[50]
Universally sloppy parameter sensitivities in systems biology models
Ryan N Gutenkunst, Joshua J Waterfall, Fergal P Casey, Kevin S Brown, Christopher R Myers, and James P Sethna. Universally sloppy parameter sensitivities in systems biology models. PLoS computational biology , 3(10):e189, 2007
work page 2007
-
[51]
The large learning rate phase of deep learning: the catapult mechanism
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020
-
[52]
Deep learning via hessian-free op- timization
James Martens et al. Deep learning via hessian-free op- timization. In Icml, volume 27, pages 735–742, 2010
work page 2010
-
[53]
Social context mod- ulates singing-related neural activity in the songbird fore- brain
Neal A Hessler and Allison J Doupe. Social context mod- ulates singing-related neural activity in the songbird fore- brain. Nature neuroscience, 2(3):209–211, 1999
work page 1999
-
[54]
Mimi H Kao, Brian D Wright, and Allison J Doupe. Neu- rons in a forebrain nucleus required for vocal plasticity rapidly switch between precise firing and variable burst- ing depending on social context. Journal of Neuroscience, 28(49):13232–13247, 2008
work page 2008
-
[55]
Automatic differentiation in pytorch
Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zach DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. In 31st Conference on Neural Information Processing Systems, 2017. 8 End Matter Derivations of the toy model results In this section, we perform the analytical derivations of th...
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.