Anti-Collapse Dynamics and the Emergence of Multi-Time-Scale Learning in Recurrent Neural Networks

Lorenzo Livi

arxiv: 2606.29519 · v1 · pith:X6Y25RZOnew · submitted 2026-06-28 · 💻 cs.LG · physics.data-an

Anti-Collapse Dynamics and the Emergence of Multi-Time-Scale Learning in Recurrent Neural Networks

Lorenzo Livi This is my paper

Pith reviewed 2026-06-30 07:31 UTC · model grok-4.3

classification 💻 cs.LG physics.data-an

keywords recurrent neural networkslong-range dependenciestime scalespower-law decaystochastic gradient descentanti-collapsespectral exponent

0 comments

The pith

The asymptotic decay of influence in recurrent networks emerges from the coupling of state and parameter dynamics, not the architecture alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recurrent networks trained with stochastic gradient descent face the problem that influence from past inputs fades with lag, making long-range learning difficult if the fade is exponential. The paper demonstrates that the decay class of this influence envelope is determined by the interaction between the network's state evolution and the parameter updates during training. This coupling can produce either fast exponential forgetting or slow power-law forgetting, depending on whether heavy-tailed fluctuations in the learning dynamics are strong enough to counter the tendency toward short time scales. A spectral exponent β then controls both the spread of time scales and the rate of forgetting. Realizing the slow-decay regime requires that the architecture and optimizer support a sufficiently broad spectrum of time scales.

Core claim

The asymptotic decay class of f(ℓ) is not fixed by the architecture but emerges from the coupling between the state dynamics and parameter dynamics, settling into either a collapsed regime with fast exponential forgetting or an extended anti-collapsed regime with slow power-law forgetting, governed by the spectral exponent β.

What carries the argument

The coarse-grained stochastic process that models the competition between training-driven contraction of time scales and heavy-tailed pushes to longer scales, with the spectral exponent β governing the resulting distribution and forgetting.

Load-bearing premise

The architecture and optimizer together must support generating and maintaining a broad spread of time scales in the network.

What would settle it

Train recurrent networks on tasks requiring long-range dependencies and measure whether the empirical influence envelope f(ℓ) follows an exponential or power-law decay under varying optimizer and architecture choices.

Figures

Figures reproduced from arXiv: 2606.29519 by Lorenzo Livi.

**Figure 2.** Figure 2: Schematic phase diagram in the (ηJ , κ) plane for λ ≥ κ/ηG, the regime in which the existence threshold η ∗ J = 0 (Eq. (97)) and the anti-collapsed regime occupies the entire interior ηJ > 0. The time-scale distribution has a power-law tail p∞(τ ) ∼ c τ −1−β . The red curve is the critical manifold Mc, separating concentrated anti-collapse (β > 1, finite mean, steep power-law envelope) from broad anti-coll… view at source ↗

**Figure 3.** Figure 3: Structural negative control on ConstGate. Even when artificially [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Drift diagnostics. The top panel compares the far-left restoring drift [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗

**Figure 5.** Figure 5: Drift-subtracted far-left log-rate residuals for SharedGate and Di b [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗

**Figure 6.** Figure 6: The panel reports the calibrated slow-mode tail-index [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Spectrum and envelope diagnostics. The top panel shows the final [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗

read the original abstract

Long-range learning is hard for recurrent networks trained with stochastic gradient descent, because the influence of a past input fades with the lag $\ell$, and if it fades too fast the dependence cannot be learned from finite data. This fade is captured by an envelope $f(\ell)$. An exponential fade makes the data needed to learn a lag-$\ell$ dependence grow exponentially, putting long horizons out of reach; a power-law fade keeps the cost polynomial. We show that the asymptotic decay class of $f(\ell)$ is not fixed by the architecture. Instead, it emerges from the coupling between the state dynamics and parameter dynamics, settling into either a collapsed regime (fast, exponential forgetting) or an extended, anti-collapsed regime (slow, power-law forgetting). The intuition is a competition within these coupled dynamics. Training drives the network's effective time scales toward short ones, while rare, heavy-tailed fluctuations of the learning dynamics push a few of them to very long values. The extended regime survives only when these heavy-tailed pushes are strong enough to balance the pull. We make this mathematically precise with a coarse-grained stochastic process and prove exactly when the extended regime exists. A single exponent, the spectral exponent~$\beta$, then governs both the spread of time scales and how slowly the network forgets. Realizing the regime in practice needs one more ingredient: the joint action of the architecture and the optimizer must be able to hold such a broad spread. A network whose capacity to generate broad time-scale spectra is severely constrained still collapses, even when supplied with strong heavy-tailed forcing. Heavy-tailed fluctuations thus act not as noise to be suppressed, but as the mechanism that sustains long-range learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RNN forgetting decay class emerges from state-parameter coupling via heavy-tailed fluctuations rather than architecture alone, with a spectral exponent linking spread and rate, but the coarse-graining step needs verification.

read the letter

The main takeaway is that the asymptotic form of the influence envelope f(ℓ) in RNNs is not fixed by the network itself. It comes out of the joint evolution of hidden states and parameters under training, landing in either a collapsed exponential regime or an anti-collapsed power-law regime when heavy-tailed fluctuations in the learning dynamics are strong enough to counter the usual pull toward short time scales.

The paper does a clean job of making this competition precise. It reduces the coupled system to a coarse-grained stochastic process and derives an existence condition for the extended regime. The single spectral exponent β is then shown to set both the breadth of time scales and the decay class of f(ℓ). Treating the heavy tails as the sustaining mechanism rather than noise to remove is a useful shift in perspective, and the caveat that architecture plus optimizer must still support a wide spectrum of scales is stated plainly.

The soft spot is the coarse-graining step itself. Averaging the recurrent dynamics and gradient updates into effective variables with assumed heavy-tailed noise risks losing correlations from the hidden-state Jacobian or the precise coupling of updates. If those correlations matter for whether the balance between pull and push actually produces the claimed decay class, the existence proof may not carry over to the original RNN. The abstract-only view leaves the explicit assumptions and derivation uncheckable, so the central claim rests on how faithfully the reduced process represents the full system.

This is for researchers who think about sequence models through coupled dynamical systems rather than pure architecture search. A reader interested in why long-range dependencies sometimes appear despite the usual vanishing-gradient story will get a fresh angle.

Send it to peer review. The framing is substantive enough to merit referee time on the modeling reduction and any empirical tests, even if revisions are likely.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that the asymptotic decay class of the influence envelope f(ℓ) in RNNs trained by SGD is not fixed by architecture alone but emerges from the coupling of state and parameter dynamics. This coupling produces either a collapsed regime with exponential forgetting or an anti-collapsed regime with power-law forgetting; the latter is sustained when heavy-tailed fluctuations in the learning dynamics are strong enough to counteract the optimizer's pull toward short time scales. A coarse-grained stochastic process is introduced to derive an exact existence condition for the anti-collapsed regime, controlled by a single spectral exponent β that simultaneously governs the spread of time scales and the forgetting rate. Realization of the regime further requires that the architecture and optimizer together can support a sufficiently broad spectrum of time scales.

Significance. If the central derivation holds, the work supplies a dynamical-systems account of how long-range learning can arise in recurrent networks without being architecturally predetermined, with heavy-tailed optimizer fluctuations acting as the sustaining mechanism rather than noise. The explicit existence proof via the coarse-grained process and the identification of β as the governing parameter constitute a clear strength, offering falsifiable predictions about regime boundaries that could be tested in controlled simulations. This perspective may inform both theoretical analyses of memory in RNNs and practical choices of architectures/optimizers for long-horizon tasks.

major comments (2)

[coarse-grained stochastic process section] The reduction from the coupled RNN state/parameter dynamics to the coarse-grained stochastic process (described in the section introducing the stochastic process and the existence proof) averages recurrent correlations and Jacobian effects into effective time-scale variables whose fluctuations are taken to be heavy-tailed. No explicit error bound or invariance argument is supplied showing that this averaging preserves the competition between the short-scale pull and the heavy-tailed pushes; if recurrent correlations alter the tail exponent, the proven existence condition for the anti-collapsed regime may not transfer to the original system.
[derivation of β and decay class] The spectral exponent β is stated to govern both the spread of time scales and the asymptotic decay class of f(ℓ). The manuscript does not clarify whether this dual role follows from an independent derivation of the decay class or is built into the definition of the coarse-grained process itself; a concrete check (e.g., an auxiliary calculation showing the decay exponent is not tautological in β) would strengthen the claim that β is the single controlling parameter.

minor comments (2)

[Introduction] Notation for the envelope f(ℓ) and the lag variable ℓ is introduced in the abstract but could be restated with a brief equation when first used in the main text to aid readers.
[Abstract] The abstract asserts that 'the joint action of the architecture and the optimizer must be able to hold such a broad spread'; a short forward reference to the section containing the capacity constraint would improve flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments and the positive evaluation of the manuscript's contributions. We address the two major comments below, indicating where revisions will be made to clarify the points raised.

read point-by-point responses

Referee: [coarse-grained stochastic process section] The reduction from the coupled RNN state/parameter dynamics to the coarse-grained stochastic process (described in the section introducing the stochastic process and the existence proof) averages recurrent correlations and Jacobian effects into effective time-scale variables whose fluctuations are taken to be heavy-tailed. No explicit error bound or invariance argument is supplied showing that this averaging preserves the competition between the short-scale pull and the heavy-tailed pushes; if recurrent correlations alter the tail exponent, the proven existence condition for the anti-collapsed regime may not transfer to the original system.

Authors: We agree that an explicit error bound or invariance argument would strengthen the justification for the coarse-graining step. The reduction relies on a separation of timescales between fast state dynamics and slower parameter updates, under which the effective time-scale variables inherit heavy-tailed fluctuations from the SGD noise. While the manuscript validates the regime through simulations of the full system, we will add a dedicated subsection discussing the assumptions of the averaging procedure and arguing that recurrent correlations do not modify the tail exponent β in the relevant parameter regime, thereby preserving the existence condition. This will be a partial revision as a full rigorous bound may require additional technical work beyond the current scope. revision: partial
Referee: [derivation of β and decay class] The spectral exponent β is stated to govern both the spread of time scales and the asymptotic decay class of f(ℓ). The manuscript does not clarify whether this dual role follows from an independent derivation of the decay class or is built into the definition of the coarse-grained process itself; a concrete check (e.g., an auxiliary calculation showing the decay exponent is not tautological in β) would strengthen the claim that β is the single controlling parameter.

Authors: The dual role of β is not tautological. In the coarse-grained process, β parameterizes the heavy-tailed distribution of the time-scale fluctuations. The spread of time scales follows directly from this distribution. The asymptotic decay class of the influence envelope f(ℓ) is then obtained by integrating the contributions from the stationary distribution of time scales, yielding a power-law decay whose exponent is a function of β. This calculation is derived separately from the definition of the process. We will include an auxiliary calculation in the appendix of the revised manuscript that explicitly derives the decay exponent in terms of β, demonstrating its independent emergence. revision: yes

Circularity Check

0 steps flagged

Derivation from coarse-grained stochastic process is self-contained with no load-bearing self-definition or self-citation

full rationale

The paper constructs a coarse-grained stochastic process to capture the competition between training-driven collapse toward short time scales and heavy-tailed fluctuations that sustain long scales, then derives the existence condition for the anti-collapsed regime and the governing role of the spectral exponent β directly from the properties of that process. No quoted step reduces the claimed result to a tautological redefinition of inputs, a fitted parameter renamed as prediction, or a self-citation chain; the abstract and description present the β relation as a derived consequence of the model rather than an input assumption. The requirement that architecture plus optimizer must support broad spectra is stated as an additional practical condition, not smuggled into the mathematical claim. This is the normal case of a model-based derivation that remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; the central claim rests on the existence of a coarse-grained stochastic process that captures the competition between training-driven collapse and heavy-tailed fluctuations.

axioms (1)

domain assumption The coupling between state dynamics and parameter dynamics during SGD can be coarse-grained into a stochastic process whose statistics determine the decay class of f(ℓ).
Invoked to prove exactly when the extended regime exists.

pith-pipeline@v0.9.1-grok · 5834 in / 1378 out tokens · 42173 ms · 2026-06-30T07:31:33.263998+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 35 canonical work pages · 5 internal anchors

[1]

Applebaum.Lévy Processes and Stochastic Calculus

D. Applebaum.Lévy Processes and Stochastic Calculus. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 2nd edition, 2009. ISBN 9780511809781

2009
[2]

Asabuki and C

T. Asabuki and C. Clopath. Taming the chaos gently: a predictive alignment learning rule in recurrent neural networks.Nature Communications, 16(1):6784, 2025. doi: 10.1038/s41467-025-61309-9

work page doi:10.1038/s41467-025-61309-9 2025
[3]

Barsbey, M

M. Barsbey, M. Sefidgaran, M. A. Erdogdu, G. Richard, and U. Şimşekli. Heavy tails in SGD and compressibilityofoverparametrizedneuralnetworks. InProceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2021. Curran Associates Inc

2021
[4]

Bauer, W

M. Bauer, W. Bialek, C. Goddard, C. M. Holmes, K. Krishnamurthy, S. E. Palmer, R. Pang, D. J. Schwab, and L. Susman. Optimization and variability can coexist.arXiv preprint arXiv:2505.23398, 2025

work page arXiv 2025
[5]

Bertschinger and T

N. Bertschinger and T. Natschläger. Real-time computation at the edge of chaos in recurrent neural networks.Neural Computation, 16(7):1413–1436, 2004. doi: 10.1162/089976604323057443

work page doi:10.1162/089976604323057443 2004
[6]

N. H. Bingham, C. M. Goldie, and J. L. Teugels.Regular Variation, volume 27 ofEncyclopedia of Math- ematics and Its Applications. Cambridge University Press, Cambridge, UK, 1989. ISBN 9780511721434

1989
[7]

Bonnaire, D

T. Bonnaire, D. Ghio, K. Krishnamurthy, F. Mignacco, A. Yamamura, and G. Biroli. High-dimensional non-convex landscapes and gradient descent dynamics.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104004, Oct 2024. doi: 10.1088/1742-5468/ad2929

work page doi:10.1088/1742-5468/ad2929 2024
[8]

T. Can, K. Krishnamurthy, and D. J. Schwab. Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs. InProceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107, pages 476–511. PMLR, 2020

2020
[9]

P. Carr, H. Geman, D. B. Madan, and M. Yor. The fine structure of asset returns: An empirical investigation.The Journal of Business, 75(2):305–332, Apr 2002. doi: 10.1086/338705

work page doi:10.1086/338705 2002
[10]

A. Ceni, P. Ashwin, L. Livi, and C. Postlethwaite. The echo index and multistability in input-driven recurrent neural networks.Physica D: Nonlinear Phenomena, 412:132609, 2020. doi: 10.1016/j.physd. 2020.132609

work page doi:10.1016/j.physd 2020
[11]

Chemnitz and M

D. Chemnitz and M. Engel. Characterizing dynamical stability of stochastic gradient descent in over- parameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025

2025
[12]

M. Chen, J. Pennington, and S. S. Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. InProceedings of the 35th International Conference on Machine Learning, pages 873–882, Stockholm, Sweden, 2018. PMLR

2018
[13]

K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724– 1734, Doha, Qatar, 2014

2014
[14]

Cont and P

R. Cont and P. Tankov.Financial Modelling with Jump Processes. Chapman and Hall/CRC, Boca Raton, FL, USA, 2004. ISBN 9781584884132

2004
[15]

Şimşekli, L

U. Şimşekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InProceedings of the 36th International Conference on Machine Learning, pages 5827–5837, Long Beach, California, USA, 2019. PMLR. 59

2019
[16]

de Bruijn.Asymptotic Methods in Analysis

N. de Bruijn.Asymptotic Methods in Analysis. Dover Books on Mathematics. Dover Publications, Garden City, NY, USA, 2010. ISBN 9780486642215

2010
[17]

A. L. M. Dekkers, J. H. J. Einmahl, and L. De Haan. A Moment Estimator for the Index of an Extreme- Value Distribution.The Annals of Statistics, 17(4):1833–1855, 1989. doi: 10.1214/aos/1176347397

work page doi:10.1214/aos/1176347397 1989
[18]

Durrett.Probability: Theory and Examples

R. Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Math- ematics. Cambridge University Press, Cambridge, UK, 5th edition, 2019. ISBN 9781108591034

2019
[19]

Engelken, F

R. Engelken, F. Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.Physical Review Research, 5(4):043044, 2023. doi: 10.1103/PhysRevResearch.5.043044

work page doi:10.1103/physrevresearch.5.043044 2023
[20]

N. B. Erichson, S. H. Lim, and M. W. Mahoney. Gated recurrent neural networks with weighted time-delay feedback. InProceedings of the 28th International Conference on Artificial Intelligence and Statistics, volume 258, pages 3646–3654, Mai Khao, Thailand, 2025. PMLR

2025
[21]

Feller.An Introduction to Probability Theory and Its Applications

W. Feller.An Introduction to Probability Theory and Its Applications. Vol. II. John Wiley & Sons, New York, USA, 2nd edition, 1971. ISBN 9780471257097

1971
[22]

Gerbelot, E

C. Gerbelot, E. Troiani, F. Mignacco, F. Krzakala, and L. Zdeborová. Rigorous dynamical mean-field theory for stochastic gradient descent methods.SIAM Journal on Mathematics of Data Science, 6(2): 400–427, 2024. doi: 10.1137/23M1594388

work page doi:10.1137/23m1594388 2024
[23]

Gloter, D

A. Gloter, D. Loukianova, and H. Mai. Jump filtering and efficient drift estimation for Lévy-driven SDEs.The Annals of Statistics, 46(4):1445–1480, 2018. doi: 10.1214/17-AOS1591

work page doi:10.1214/17-aos1591 2018
[24]

Greff, R

K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey.IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, 2017. doi: 10.1109/TNNLS.2016.2582924

work page doi:10.1109/tnnls.2016.2582924 2017
[25]

Gürbüzbalaban, U

M. Gürbüzbalaban, U. Şimşekli, and L. Zhu. The heavy-tail phenomenon in SGD. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 3964–3975. PMLR, 2021

2021
[26]

Hesse and T

J. Hesse and T. Gross. Self-organized criticality as a fundamental property of neural systems.Frontiers in Systems Neuroscience, 8:166, 2014. doi: 10.3389/fnsys.2014.00166

work page doi:10.3389/fnsys.2014.00166 2014
[27]

B. M. Hill. A simple general approach to inference about the tail of a distribution.The Annals of Statistics, 3(5):1163–1174, 1975. doi: 10.1214/aos/1176343247

work page doi:10.1214/aos/1176343247 1975
[28]

Hodgkinson and M

L. Hodgkinson and M. W. Mahoney. Multiplicative noise and heavy tails in stochastic optimization. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4262–4274. PMLR, 2021

2021
[29]

Imkeller and I

P. Imkeller and I. Pavlyukevich. First exit times of SDEs driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006. doi: 10.1016/j.spa.2005.11.006

work page doi:10.1016/j.spa.2005.11.006 2006
[30]

Kadmon and H

J. Kadmon and H. Sompolinsky. Transition to chaos in random neuronal networks.Physical Review X, 5:041030, Nov 2015. doi: 10.1103/PhysRevX.5.041030

work page doi:10.1103/physrevx.5.041030 2015
[31]

T. D. Kim, T. Can, and K. Krishnamurthy. Trainability, expressivity and interpretability in gated neural ODEs. InProceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, 2023. PMLR

2023
[32]

Krishnamurthy, T

K. Krishnamurthy, T. Can, and D. J. Schwab. Theory of gating in recurrent neural networks.Physical Review X, 12:011011, Jan 2022. doi: 10.1103/PhysRevX.12.011011

work page doi:10.1103/physrevx.12.011011 2022
[33]

Küchler and S

U. Küchler and S. Tappe. Tempered stable distributions and processes.Stochastic Processes and their Applications, 123(12):4256–4293, 2013. doi: 10.1016/j.spa.2013.06.012. 60

work page doi:10.1016/j.spa.2013.06.012 2013
[34]

Q. Li, C. Tai, and W. E. Stochastic modified equations and adaptive stochastic gradient algorithms. InProceedings of the 34th International Conference on Machine Learning, volume 70, pages 2101–2110, Sydney, Australia, 2017. PMLR

2017
[35]

L. Livi. Time-scale coupling between states and parameters in recurrent neural networks.arXiv preprint arXiv:2508.12121, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

L. Livi. Learnability window in gated recurrent neural networks.arXiv preprint arXiv:2512.05790, 2025

work page arXiv 2025
[37]

L. Livi, F. M. Bianchi, and C. Alippi. Determination of the edge of criticality in echo state networks through Fisher information maximization.IEEE Transactions on Neural Networks and Learning Sys- tems, 29(3):706–717, 2018. doi: 10.1109/TNNLS.2016.2644268

work page doi:10.1109/tnnls.2016.2644268 2018
[38]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.International Conference on Learning Representations, 2019

2019
[39]

Mandt, M

S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian infer- ence.Journal of Machine Learning Research, 18(134):1–35, 2017

2017
[40]

C. H. Martin and M. W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165): 1–73, 2021

2021
[41]

Massar and S

M. Massar and S. Massar. Mean-field theory of echo state networks.Physical Review E, 87(4):042809,
[42]

doi: 10.1103/PhysRevE.87.042809

work page doi:10.1103/physreve.87.042809
[43]

Mastrogiuseppe and S

F. Mastrogiuseppe and S. Ostojic. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks.Neuron, 99(3):609–623, 2018. doi: 10.1016/j.neuron.2018.07.003

work page doi:10.1016/j.neuron.2018.07.003 2018
[44]

M. M. Meerschaert, Y. Zhang, and B. Baeumer. Tempered anomalous diffusion in heterogeneous sys- tems.Geophysical Research Letters, 35(17):L17403, 2008. doi: 10.1029/2008GL034899

work page doi:10.1029/2008gl034899 2008
[45]

S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. doi: 10.1073/ pnas.1806579115

2018
[46]

Nguyen and H

P.-M. Nguyen and H. T. Pham. A rigorous framework for the mean field limit of multilayer neural networks.Mathematical Statistics and Learning, 6(3):201–357, 2023. doi: 10.4171/MSL/42

work page doi:10.4171/msl/42 2023
[47]

T. H. Nguyen, U. Şimşekli, M. Gürbüzbalaban, and G. Richard. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. InProceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc

2019
[48]

B. R. Olsen, S. Fatehmanesh, F. Xiao, A. Kumarappan, and A. Gajula. From SGD to spectra: A theory of neural network weight dynamics.arXiv preprint arXiv:2507.12709, 2025

work page arXiv 2025
[49]

Pascanu, T

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 1310–1318, Atlanta, Georgia, USA, 2013. PMLR

2013
[50]

Poole, S

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. InProceedings of the 30th Advances in Neural Information Processing Systems, volume 29, Barcelona, Spain, 2016. Curran Associates, Inc

2016
[51]

Rosiński

J. Rosiński. Tempering stable processes.Stochastic Processes and their Applications, 117(6):677–707,
[52]

doi: 10.1016/j.spa.2006.10.003. 61

work page doi:10.1016/j.spa.2006.10.003 2006
[53]

S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[54]

Sato.Lévy Processes and Infinitely Divisible Distributions, volume 68 ofCambridge Studies in Advanced Mathematics

K. Sato.Lévy Processes and Infinitely Divisible Distributions, volume 68 ofCambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 1999. ISBN 978-1107656499

1999
[55]

R. L. Schilling. An introduction to Lévy and feller processes. advanced courses in mathematics-CRM Barcelona 2014.arXiv preprint arXiv:1603.00251, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2014
[56]

Sieber, C

J. Sieber, C. Amo Alonso, A. Didier, M. N. Zeilinger, and A. Orvieto. Understanding the differences in foundation models: Attention, state space models, and recurrent neural networks. InAdvances in Neural Information Processing Systems, volume 37, Vancouver, BC, Canada, 2024

2024
[57]

There Will Be a Scientific Theory of Deep Learning

J. Simon, D. Kunin, A. Atanasov, E. Boix-Adserà, B. Bordelon, J. Cohen, N. Ghosh, F. Guth, A. Jacot, M. Kamb, D. Karkada, E. J. Michaud, B. Ottlik, and J. Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[58]

Meanfieldanalysisofdeepneuralnetworks.Mathematics of Operations Research, 47(1):120–152, 2022

J.SirignanoandK.Spiliopoulos. Meanfieldanalysisofdeepneuralnetworks.Mathematics of Operations Research, 47(1):120–152, 2022. doi: 10.1287/moor.2020.1118

work page doi:10.1287/moor.2020.1118 2022
[59]

S. L. Smith, B. Dherin, D. G. T. Barrett, and S. De. On the origin of implicit regularization in stochastic gradient descent. InInternational Conference on Learning Representations, 2021

2021
[60]

Stanislavsky, K

A. Stanislavsky, K. Weron, and A. Weron. Diffusion and relaxation controlled by temperedα-stable processes.Physical Review E, 78(5):051106, 2008. doi: 10.1103/PhysRevE.78.051106

work page doi:10.1103/physreve.78.051106 2008
[61]

Sussillo and O

D. Sussillo and O. Barak. Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013. doi: 10.1162/NECO_a_00409

work page doi:10.1162/neco_a_00409 2013
[62]

Tallec and Y

C. Tallec and Y. Ollivier. Can recurrent neural networks warp time? InInternational Conference on Learning Representations, Vancouver, BC, Canada, 2018

2018
[63]

Cambridge University Press, Cambridge, UK, 2014

M.Viana.Lectures on Lyapunov Exponents, volume145ofCambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 2014. ISBN 9781139976602

2014
[64]

Z. Xie, I. Sato, and M. Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InInternational Conference on Learning Representations, 2021

2021
[65]

S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent.arXiv preprint arXiv:1810.00004, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[66]

Zhang and C

X.-Y. Zhang and C. Tang. Heavy-tailed update distributions arise from information-driven self- organization in nonequilibrium learning.Proceedings of the National Academy of Sciences, 122(51): e2523012122, 2025. doi: 10.1073/pnas.2523012122

work page doi:10.1073/pnas.2523012122 2025
[67]

Ziyin, H

L. Ziyin, H. Li, and M. Ueda. Noise balance and stationary distribution of stochastic gradient descent. Physical Review E, 111(6):065303, 2025. doi: 10.1103/PhysRevE.111.065303

work page doi:10.1103/physreve.111.065303 2025
[68]

Zucchet and A

N. Zucchet and A. Orvieto. Recurrent neural networks: vanishing and exploding gradients are not the end of the story. InProceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 2024. Curran Associates Inc. 62

2024

[1] [1]

Applebaum.Lévy Processes and Stochastic Calculus

D. Applebaum.Lévy Processes and Stochastic Calculus. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 2nd edition, 2009. ISBN 9780511809781

2009

[2] [2]

Asabuki and C

T. Asabuki and C. Clopath. Taming the chaos gently: a predictive alignment learning rule in recurrent neural networks.Nature Communications, 16(1):6784, 2025. doi: 10.1038/s41467-025-61309-9

work page doi:10.1038/s41467-025-61309-9 2025

[3] [3]

Barsbey, M

M. Barsbey, M. Sefidgaran, M. A. Erdogdu, G. Richard, and U. Şimşekli. Heavy tails in SGD and compressibilityofoverparametrizedneuralnetworks. InProceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2021. Curran Associates Inc

2021

[4] [4]

Bauer, W

M. Bauer, W. Bialek, C. Goddard, C. M. Holmes, K. Krishnamurthy, S. E. Palmer, R. Pang, D. J. Schwab, and L. Susman. Optimization and variability can coexist.arXiv preprint arXiv:2505.23398, 2025

work page arXiv 2025

[5] [5]

Bertschinger and T

N. Bertschinger and T. Natschläger. Real-time computation at the edge of chaos in recurrent neural networks.Neural Computation, 16(7):1413–1436, 2004. doi: 10.1162/089976604323057443

work page doi:10.1162/089976604323057443 2004

[6] [6]

N. H. Bingham, C. M. Goldie, and J. L. Teugels.Regular Variation, volume 27 ofEncyclopedia of Math- ematics and Its Applications. Cambridge University Press, Cambridge, UK, 1989. ISBN 9780511721434

1989

[7] [7]

Bonnaire, D

T. Bonnaire, D. Ghio, K. Krishnamurthy, F. Mignacco, A. Yamamura, and G. Biroli. High-dimensional non-convex landscapes and gradient descent dynamics.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104004, Oct 2024. doi: 10.1088/1742-5468/ad2929

work page doi:10.1088/1742-5468/ad2929 2024

[8] [8]

T. Can, K. Krishnamurthy, and D. J. Schwab. Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs. InProceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107, pages 476–511. PMLR, 2020

2020

[9] [9]

P. Carr, H. Geman, D. B. Madan, and M. Yor. The fine structure of asset returns: An empirical investigation.The Journal of Business, 75(2):305–332, Apr 2002. doi: 10.1086/338705

work page doi:10.1086/338705 2002

[10] [10]

A. Ceni, P. Ashwin, L. Livi, and C. Postlethwaite. The echo index and multistability in input-driven recurrent neural networks.Physica D: Nonlinear Phenomena, 412:132609, 2020. doi: 10.1016/j.physd. 2020.132609

work page doi:10.1016/j.physd 2020

[11] [11]

Chemnitz and M

D. Chemnitz and M. Engel. Characterizing dynamical stability of stochastic gradient descent in over- parameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025

2025

[12] [12]

M. Chen, J. Pennington, and S. S. Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. InProceedings of the 35th International Conference on Machine Learning, pages 873–882, Stockholm, Sweden, 2018. PMLR

2018

[13] [13]

K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724– 1734, Doha, Qatar, 2014

2014

[14] [14]

Cont and P

R. Cont and P. Tankov.Financial Modelling with Jump Processes. Chapman and Hall/CRC, Boca Raton, FL, USA, 2004. ISBN 9781584884132

2004

[15] [15]

Şimşekli, L

U. Şimşekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InProceedings of the 36th International Conference on Machine Learning, pages 5827–5837, Long Beach, California, USA, 2019. PMLR. 59

2019

[16] [16]

de Bruijn.Asymptotic Methods in Analysis

N. de Bruijn.Asymptotic Methods in Analysis. Dover Books on Mathematics. Dover Publications, Garden City, NY, USA, 2010. ISBN 9780486642215

2010

[17] [17]

A. L. M. Dekkers, J. H. J. Einmahl, and L. De Haan. A Moment Estimator for the Index of an Extreme- Value Distribution.The Annals of Statistics, 17(4):1833–1855, 1989. doi: 10.1214/aos/1176347397

work page doi:10.1214/aos/1176347397 1989

[18] [18]

Durrett.Probability: Theory and Examples

R. Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Math- ematics. Cambridge University Press, Cambridge, UK, 5th edition, 2019. ISBN 9781108591034

2019

[19] [19]

Engelken, F

R. Engelken, F. Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.Physical Review Research, 5(4):043044, 2023. doi: 10.1103/PhysRevResearch.5.043044

work page doi:10.1103/physrevresearch.5.043044 2023

[20] [20]

N. B. Erichson, S. H. Lim, and M. W. Mahoney. Gated recurrent neural networks with weighted time-delay feedback. InProceedings of the 28th International Conference on Artificial Intelligence and Statistics, volume 258, pages 3646–3654, Mai Khao, Thailand, 2025. PMLR

2025

[21] [21]

Feller.An Introduction to Probability Theory and Its Applications

W. Feller.An Introduction to Probability Theory and Its Applications. Vol. II. John Wiley & Sons, New York, USA, 2nd edition, 1971. ISBN 9780471257097

1971

[22] [22]

Gerbelot, E

C. Gerbelot, E. Troiani, F. Mignacco, F. Krzakala, and L. Zdeborová. Rigorous dynamical mean-field theory for stochastic gradient descent methods.SIAM Journal on Mathematics of Data Science, 6(2): 400–427, 2024. doi: 10.1137/23M1594388

work page doi:10.1137/23m1594388 2024

[23] [23]

Gloter, D

A. Gloter, D. Loukianova, and H. Mai. Jump filtering and efficient drift estimation for Lévy-driven SDEs.The Annals of Statistics, 46(4):1445–1480, 2018. doi: 10.1214/17-AOS1591

work page doi:10.1214/17-aos1591 2018

[24] [24]

Greff, R

K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey.IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, 2017. doi: 10.1109/TNNLS.2016.2582924

work page doi:10.1109/tnnls.2016.2582924 2017

[25] [25]

Gürbüzbalaban, U

M. Gürbüzbalaban, U. Şimşekli, and L. Zhu. The heavy-tail phenomenon in SGD. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 3964–3975. PMLR, 2021

2021

[26] [26]

Hesse and T

J. Hesse and T. Gross. Self-organized criticality as a fundamental property of neural systems.Frontiers in Systems Neuroscience, 8:166, 2014. doi: 10.3389/fnsys.2014.00166

work page doi:10.3389/fnsys.2014.00166 2014

[27] [27]

B. M. Hill. A simple general approach to inference about the tail of a distribution.The Annals of Statistics, 3(5):1163–1174, 1975. doi: 10.1214/aos/1176343247

work page doi:10.1214/aos/1176343247 1975

[28] [28]

Hodgkinson and M

L. Hodgkinson and M. W. Mahoney. Multiplicative noise and heavy tails in stochastic optimization. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4262–4274. PMLR, 2021

2021

[29] [29]

Imkeller and I

P. Imkeller and I. Pavlyukevich. First exit times of SDEs driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006. doi: 10.1016/j.spa.2005.11.006

work page doi:10.1016/j.spa.2005.11.006 2006

[30] [30]

Kadmon and H

J. Kadmon and H. Sompolinsky. Transition to chaos in random neuronal networks.Physical Review X, 5:041030, Nov 2015. doi: 10.1103/PhysRevX.5.041030

work page doi:10.1103/physrevx.5.041030 2015

[31] [31]

T. D. Kim, T. Can, and K. Krishnamurthy. Trainability, expressivity and interpretability in gated neural ODEs. InProceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, 2023. PMLR

2023

[32] [32]

Krishnamurthy, T

K. Krishnamurthy, T. Can, and D. J. Schwab. Theory of gating in recurrent neural networks.Physical Review X, 12:011011, Jan 2022. doi: 10.1103/PhysRevX.12.011011

work page doi:10.1103/physrevx.12.011011 2022

[33] [33]

Küchler and S

U. Küchler and S. Tappe. Tempered stable distributions and processes.Stochastic Processes and their Applications, 123(12):4256–4293, 2013. doi: 10.1016/j.spa.2013.06.012. 60

work page doi:10.1016/j.spa.2013.06.012 2013

[34] [34]

Q. Li, C. Tai, and W. E. Stochastic modified equations and adaptive stochastic gradient algorithms. InProceedings of the 34th International Conference on Machine Learning, volume 70, pages 2101–2110, Sydney, Australia, 2017. PMLR

2017

[35] [35]

L. Livi. Time-scale coupling between states and parameters in recurrent neural networks.arXiv preprint arXiv:2508.12121, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

L. Livi. Learnability window in gated recurrent neural networks.arXiv preprint arXiv:2512.05790, 2025

work page arXiv 2025

[37] [37]

L. Livi, F. M. Bianchi, and C. Alippi. Determination of the edge of criticality in echo state networks through Fisher information maximization.IEEE Transactions on Neural Networks and Learning Sys- tems, 29(3):706–717, 2018. doi: 10.1109/TNNLS.2016.2644268

work page doi:10.1109/tnnls.2016.2644268 2018

[38] [38]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.International Conference on Learning Representations, 2019

2019

[39] [39]

Mandt, M

S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian infer- ence.Journal of Machine Learning Research, 18(134):1–35, 2017

2017

[40] [40]

C. H. Martin and M. W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165): 1–73, 2021

2021

[41] [41]

Massar and S

M. Massar and S. Massar. Mean-field theory of echo state networks.Physical Review E, 87(4):042809,

[42] [42]

doi: 10.1103/PhysRevE.87.042809

work page doi:10.1103/physreve.87.042809

[43] [43]

Mastrogiuseppe and S

F. Mastrogiuseppe and S. Ostojic. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks.Neuron, 99(3):609–623, 2018. doi: 10.1016/j.neuron.2018.07.003

work page doi:10.1016/j.neuron.2018.07.003 2018

[44] [44]

M. M. Meerschaert, Y. Zhang, and B. Baeumer. Tempered anomalous diffusion in heterogeneous sys- tems.Geophysical Research Letters, 35(17):L17403, 2008. doi: 10.1029/2008GL034899

work page doi:10.1029/2008gl034899 2008

[45] [45]

S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. doi: 10.1073/ pnas.1806579115

2018

[46] [46]

Nguyen and H

P.-M. Nguyen and H. T. Pham. A rigorous framework for the mean field limit of multilayer neural networks.Mathematical Statistics and Learning, 6(3):201–357, 2023. doi: 10.4171/MSL/42

work page doi:10.4171/msl/42 2023

[47] [47]

T. H. Nguyen, U. Şimşekli, M. Gürbüzbalaban, and G. Richard. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. InProceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc

2019

[48] [48]

B. R. Olsen, S. Fatehmanesh, F. Xiao, A. Kumarappan, and A. Gajula. From SGD to spectra: A theory of neural network weight dynamics.arXiv preprint arXiv:2507.12709, 2025

work page arXiv 2025

[49] [49]

Pascanu, T

R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 1310–1318, Atlanta, Georgia, USA, 2013. PMLR

2013

[50] [50]

Poole, S

B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. InProceedings of the 30th Advances in Neural Information Processing Systems, volume 29, Barcelona, Spain, 2016. Curran Associates, Inc

2016

[51] [51]

Rosiński

J. Rosiński. Tempering stable processes.Stochastic Processes and their Applications, 117(6):677–707,

[52] [52]

doi: 10.1016/j.spa.2006.10.003. 61

work page doi:10.1016/j.spa.2006.10.003 2006

[53] [53]

S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[54] [54]

Sato.Lévy Processes and Infinitely Divisible Distributions, volume 68 ofCambridge Studies in Advanced Mathematics

K. Sato.Lévy Processes and Infinitely Divisible Distributions, volume 68 ofCambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 1999. ISBN 978-1107656499

1999

[55] [55]

R. L. Schilling. An introduction to Lévy and feller processes. advanced courses in mathematics-CRM Barcelona 2014.arXiv preprint arXiv:1603.00251, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2014

[56] [56]

Sieber, C

J. Sieber, C. Amo Alonso, A. Didier, M. N. Zeilinger, and A. Orvieto. Understanding the differences in foundation models: Attention, state space models, and recurrent neural networks. InAdvances in Neural Information Processing Systems, volume 37, Vancouver, BC, Canada, 2024

2024

[57] [57]

There Will Be a Scientific Theory of Deep Learning

J. Simon, D. Kunin, A. Atanasov, E. Boix-Adserà, B. Bordelon, J. Cohen, N. Ghosh, F. Guth, A. Jacot, M. Kamb, D. Karkada, E. J. Michaud, B. Ottlik, and J. Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[58] [58]

Meanfieldanalysisofdeepneuralnetworks.Mathematics of Operations Research, 47(1):120–152, 2022

J.SirignanoandK.Spiliopoulos. Meanfieldanalysisofdeepneuralnetworks.Mathematics of Operations Research, 47(1):120–152, 2022. doi: 10.1287/moor.2020.1118

work page doi:10.1287/moor.2020.1118 2022

[59] [59]

S. L. Smith, B. Dherin, D. G. T. Barrett, and S. De. On the origin of implicit regularization in stochastic gradient descent. InInternational Conference on Learning Representations, 2021

2021

[60] [60]

Stanislavsky, K

A. Stanislavsky, K. Weron, and A. Weron. Diffusion and relaxation controlled by temperedα-stable processes.Physical Review E, 78(5):051106, 2008. doi: 10.1103/PhysRevE.78.051106

work page doi:10.1103/physreve.78.051106 2008

[61] [61]

Sussillo and O

D. Sussillo and O. Barak. Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013. doi: 10.1162/NECO_a_00409

work page doi:10.1162/neco_a_00409 2013

[62] [62]

Tallec and Y

C. Tallec and Y. Ollivier. Can recurrent neural networks warp time? InInternational Conference on Learning Representations, Vancouver, BC, Canada, 2018

2018

[63] [63]

Cambridge University Press, Cambridge, UK, 2014

M.Viana.Lectures on Lyapunov Exponents, volume145ofCambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 2014. ISBN 9781139976602

2014

[64] [64]

Z. Xie, I. Sato, and M. Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InInternational Conference on Learning Representations, 2021

2021

[65] [65]

S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent.arXiv preprint arXiv:1810.00004, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[66] [66]

Zhang and C

X.-Y. Zhang and C. Tang. Heavy-tailed update distributions arise from information-driven self- organization in nonequilibrium learning.Proceedings of the National Academy of Sciences, 122(51): e2523012122, 2025. doi: 10.1073/pnas.2523012122

work page doi:10.1073/pnas.2523012122 2025

[67] [67]

Ziyin, H

L. Ziyin, H. Li, and M. Ueda. Noise balance and stationary distribution of stochastic gradient descent. Physical Review E, 111(6):065303, 2025. doi: 10.1103/PhysRevE.111.065303

work page doi:10.1103/physreve.111.065303 2025

[68] [68]

Zucchet and A

N. Zucchet and A. Orvieto. Recurrent neural networks: vanishing and exploding gradients are not the end of the story. InProceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 2024. Curran Associates Inc. 62

2024