pith. sign in

arxiv: 2606.29519 · v1 · pith:X6Y25RZOnew · submitted 2026-06-28 · 💻 cs.LG · physics.data-an

Anti-Collapse Dynamics and the Emergence of Multi-Time-Scale Learning in Recurrent Neural Networks

Pith reviewed 2026-06-30 07:31 UTC · model grok-4.3

classification 💻 cs.LG physics.data-an
keywords recurrent neural networkslong-range dependenciestime scalespower-law decaystochastic gradient descentanti-collapsespectral exponent
0
0 comments X

The pith

The asymptotic decay of influence in recurrent networks emerges from the coupling of state and parameter dynamics, not the architecture alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Recurrent networks trained with stochastic gradient descent face the problem that influence from past inputs fades with lag, making long-range learning difficult if the fade is exponential. The paper demonstrates that the decay class of this influence envelope is determined by the interaction between the network's state evolution and the parameter updates during training. This coupling can produce either fast exponential forgetting or slow power-law forgetting, depending on whether heavy-tailed fluctuations in the learning dynamics are strong enough to counter the tendency toward short time scales. A spectral exponent β then controls both the spread of time scales and the rate of forgetting. Realizing the slow-decay regime requires that the architecture and optimizer support a sufficiently broad spectrum of time scales.

Core claim

The asymptotic decay class of f(ℓ) is not fixed by the architecture but emerges from the coupling between the state dynamics and parameter dynamics, settling into either a collapsed regime with fast exponential forgetting or an extended anti-collapsed regime with slow power-law forgetting, governed by the spectral exponent β.

What carries the argument

The coarse-grained stochastic process that models the competition between training-driven contraction of time scales and heavy-tailed pushes to longer scales, with the spectral exponent β governing the resulting distribution and forgetting.

Load-bearing premise

The architecture and optimizer together must support generating and maintaining a broad spread of time scales in the network.

What would settle it

Train recurrent networks on tasks requiring long-range dependencies and measure whether the empirical influence envelope f(ℓ) follows an exponential or power-law decay under varying optimizer and architecture choices.

Figures

Figures reproduced from arXiv: 2606.29519 by Lorenzo Livi.

Figure 1
Figure 1. Figure 1: Phenomenological summary of the regimes predicted by the theory. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Schematic phase diagram in the (ηJ , κ) plane for λ ≥ κ/ηG, the regime in which the existence threshold η ∗ J = 0 (Eq. (97)) and the anti-collapsed regime occupies the entire interior ηJ > 0. The time-scale distribution has a power-law tail p∞(τ ) ∼ c τ −1−β . The red curve is the critical manifold Mc, separating concentrated anti-collapse (β > 1, finite mean, steep power-law envelope) from broad anti-coll… view at source ↗
Figure 3
Figure 3. Figure 3: Structural negative control on ConstGate. Even when artificially [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Drift diagnostics. The top panel compares the far-left restoring drift [PITH_FULL_IMAGE:figures/full_fig_p026_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Drift-subtracted far-left log-rate residuals for SharedGate and Di b [PITH_FULL_IMAGE:figures/full_fig_p027_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The panel reports the calibrated slow-mode tail-index [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Spectrum and envelope diagnostics. The top panel shows the final [PITH_FULL_IMAGE:figures/full_fig_p028_7.png] view at source ↗
read the original abstract

Long-range learning is hard for recurrent networks trained with stochastic gradient descent, because the influence of a past input fades with the lag $\ell$, and if it fades too fast the dependence cannot be learned from finite data. This fade is captured by an envelope $f(\ell)$. An exponential fade makes the data needed to learn a lag-$\ell$ dependence grow exponentially, putting long horizons out of reach; a power-law fade keeps the cost polynomial. We show that the asymptotic decay class of $f(\ell)$ is not fixed by the architecture. Instead, it emerges from the coupling between the state dynamics and parameter dynamics, settling into either a collapsed regime (fast, exponential forgetting) or an extended, anti-collapsed regime (slow, power-law forgetting). The intuition is a competition within these coupled dynamics. Training drives the network's effective time scales toward short ones, while rare, heavy-tailed fluctuations of the learning dynamics push a few of them to very long values. The extended regime survives only when these heavy-tailed pushes are strong enough to balance the pull. We make this mathematically precise with a coarse-grained stochastic process and prove exactly when the extended regime exists. A single exponent, the spectral exponent~$\beta$, then governs both the spread of time scales and how slowly the network forgets. Realizing the regime in practice needs one more ingredient: the joint action of the architecture and the optimizer must be able to hold such a broad spread. A network whose capacity to generate broad time-scale spectra is severely constrained still collapses, even when supplied with strong heavy-tailed forcing. Heavy-tailed fluctuations thus act not as noise to be suppressed, but as the mechanism that sustains long-range learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that the asymptotic decay class of the influence envelope f(ℓ) in RNNs trained by SGD is not fixed by architecture alone but emerges from the coupling of state and parameter dynamics. This coupling produces either a collapsed regime with exponential forgetting or an anti-collapsed regime with power-law forgetting; the latter is sustained when heavy-tailed fluctuations in the learning dynamics are strong enough to counteract the optimizer's pull toward short time scales. A coarse-grained stochastic process is introduced to derive an exact existence condition for the anti-collapsed regime, controlled by a single spectral exponent β that simultaneously governs the spread of time scales and the forgetting rate. Realization of the regime further requires that the architecture and optimizer together can support a sufficiently broad spectrum of time scales.

Significance. If the central derivation holds, the work supplies a dynamical-systems account of how long-range learning can arise in recurrent networks without being architecturally predetermined, with heavy-tailed optimizer fluctuations acting as the sustaining mechanism rather than noise. The explicit existence proof via the coarse-grained process and the identification of β as the governing parameter constitute a clear strength, offering falsifiable predictions about regime boundaries that could be tested in controlled simulations. This perspective may inform both theoretical analyses of memory in RNNs and practical choices of architectures/optimizers for long-horizon tasks.

major comments (2)
  1. [coarse-grained stochastic process section] The reduction from the coupled RNN state/parameter dynamics to the coarse-grained stochastic process (described in the section introducing the stochastic process and the existence proof) averages recurrent correlations and Jacobian effects into effective time-scale variables whose fluctuations are taken to be heavy-tailed. No explicit error bound or invariance argument is supplied showing that this averaging preserves the competition between the short-scale pull and the heavy-tailed pushes; if recurrent correlations alter the tail exponent, the proven existence condition for the anti-collapsed regime may not transfer to the original system.
  2. [derivation of β and decay class] The spectral exponent β is stated to govern both the spread of time scales and the asymptotic decay class of f(ℓ). The manuscript does not clarify whether this dual role follows from an independent derivation of the decay class or is built into the definition of the coarse-grained process itself; a concrete check (e.g., an auxiliary calculation showing the decay exponent is not tautological in β) would strengthen the claim that β is the single controlling parameter.
minor comments (2)
  1. [Introduction] Notation for the envelope f(ℓ) and the lag variable ℓ is introduced in the abstract but could be restated with a brief equation when first used in the main text to aid readers.
  2. [Abstract] The abstract asserts that 'the joint action of the architecture and the optimizer must be able to hold such a broad spread'; a short forward reference to the section containing the capacity constraint would improve flow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments and the positive evaluation of the manuscript's contributions. We address the two major comments below, indicating where revisions will be made to clarify the points raised.

read point-by-point responses
  1. Referee: [coarse-grained stochastic process section] The reduction from the coupled RNN state/parameter dynamics to the coarse-grained stochastic process (described in the section introducing the stochastic process and the existence proof) averages recurrent correlations and Jacobian effects into effective time-scale variables whose fluctuations are taken to be heavy-tailed. No explicit error bound or invariance argument is supplied showing that this averaging preserves the competition between the short-scale pull and the heavy-tailed pushes; if recurrent correlations alter the tail exponent, the proven existence condition for the anti-collapsed regime may not transfer to the original system.

    Authors: We agree that an explicit error bound or invariance argument would strengthen the justification for the coarse-graining step. The reduction relies on a separation of timescales between fast state dynamics and slower parameter updates, under which the effective time-scale variables inherit heavy-tailed fluctuations from the SGD noise. While the manuscript validates the regime through simulations of the full system, we will add a dedicated subsection discussing the assumptions of the averaging procedure and arguing that recurrent correlations do not modify the tail exponent β in the relevant parameter regime, thereby preserving the existence condition. This will be a partial revision as a full rigorous bound may require additional technical work beyond the current scope. revision: partial

  2. Referee: [derivation of β and decay class] The spectral exponent β is stated to govern both the spread of time scales and the asymptotic decay class of f(ℓ). The manuscript does not clarify whether this dual role follows from an independent derivation of the decay class or is built into the definition of the coarse-grained process itself; a concrete check (e.g., an auxiliary calculation showing the decay exponent is not tautological in β) would strengthen the claim that β is the single controlling parameter.

    Authors: The dual role of β is not tautological. In the coarse-grained process, β parameterizes the heavy-tailed distribution of the time-scale fluctuations. The spread of time scales follows directly from this distribution. The asymptotic decay class of the influence envelope f(ℓ) is then obtained by integrating the contributions from the stationary distribution of time scales, yielding a power-law decay whose exponent is a function of β. This calculation is derived separately from the definition of the process. We will include an auxiliary calculation in the appendix of the revised manuscript that explicitly derives the decay exponent in terms of β, demonstrating its independent emergence. revision: yes

Circularity Check

0 steps flagged

Derivation from coarse-grained stochastic process is self-contained with no load-bearing self-definition or self-citation

full rationale

The paper constructs a coarse-grained stochastic process to capture the competition between training-driven collapse toward short time scales and heavy-tailed fluctuations that sustain long scales, then derives the existence condition for the anti-collapsed regime and the governing role of the spectral exponent β directly from the properties of that process. No quoted step reduces the claimed result to a tautological redefinition of inputs, a fitted parameter renamed as prediction, or a self-citation chain; the abstract and description present the β relation as a derived consequence of the model rather than an input assumption. The requirement that architecture plus optimizer must support broad spectra is stated as an additional practical condition, not smuggled into the mathematical claim. This is the normal case of a model-based derivation that remains independent of its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review based on abstract only; the central claim rests on the existence of a coarse-grained stochastic process that captures the competition between training-driven collapse and heavy-tailed fluctuations.

axioms (1)
  • domain assumption The coupling between state dynamics and parameter dynamics during SGD can be coarse-grained into a stochastic process whose statistics determine the decay class of f(ℓ).
    Invoked to prove exactly when the extended regime exists.

pith-pipeline@v0.9.1-grok · 5834 in / 1378 out tokens · 42173 ms · 2026-06-30T07:31:33.263998+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

68 extracted references · 35 canonical work pages · 5 internal anchors

  1. [1]

    Applebaum.Lévy Processes and Stochastic Calculus

    D. Applebaum.Lévy Processes and Stochastic Calculus. Cambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 2nd edition, 2009. ISBN 9780511809781

  2. [2]

    Asabuki and C

    T. Asabuki and C. Clopath. Taming the chaos gently: a predictive alignment learning rule in recurrent neural networks.Nature Communications, 16(1):6784, 2025. doi: 10.1038/s41467-025-61309-9

  3. [3]

    Barsbey, M

    M. Barsbey, M. Sefidgaran, M. A. Erdogdu, G. Richard, and U. Şimşekli. Heavy tails in SGD and compressibilityofoverparametrizedneuralnetworks. InProceedings of the 35th International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2021. Curran Associates Inc

  4. [4]

    Bauer, W

    M. Bauer, W. Bialek, C. Goddard, C. M. Holmes, K. Krishnamurthy, S. E. Palmer, R. Pang, D. J. Schwab, and L. Susman. Optimization and variability can coexist.arXiv preprint arXiv:2505.23398, 2025

  5. [5]

    Bertschinger and T

    N. Bertschinger and T. Natschläger. Real-time computation at the edge of chaos in recurrent neural networks.Neural Computation, 16(7):1413–1436, 2004. doi: 10.1162/089976604323057443

  6. [6]

    N. H. Bingham, C. M. Goldie, and J. L. Teugels.Regular Variation, volume 27 ofEncyclopedia of Math- ematics and Its Applications. Cambridge University Press, Cambridge, UK, 1989. ISBN 9780511721434

  7. [7]

    Bonnaire, D

    T. Bonnaire, D. Ghio, K. Krishnamurthy, F. Mignacco, A. Yamamura, and G. Biroli. High-dimensional non-convex landscapes and gradient descent dynamics.Journal of Statistical Mechanics: Theory and Experiment, 2024(10):104004, Oct 2024. doi: 10.1088/1742-5468/ad2929

  8. [8]

    T. Can, K. Krishnamurthy, and D. J. Schwab. Gating creates slow modes and controls phase-space complexity in GRUs and LSTMs. InProceedings of The First Mathematical and Scientific Machine Learning Conference, volume 107, pages 476–511. PMLR, 2020

  9. [9]

    P. Carr, H. Geman, D. B. Madan, and M. Yor. The fine structure of asset returns: An empirical investigation.The Journal of Business, 75(2):305–332, Apr 2002. doi: 10.1086/338705

  10. [10]

    A. Ceni, P. Ashwin, L. Livi, and C. Postlethwaite. The echo index and multistability in input-driven recurrent neural networks.Physica D: Nonlinear Phenomena, 412:132609, 2020. doi: 10.1016/j.physd. 2020.132609

  11. [11]

    Chemnitz and M

    D. Chemnitz and M. Engel. Characterizing dynamical stability of stochastic gradient descent in over- parameterized learning.Journal of Machine Learning Research, 26(134):1–46, 2025

  12. [12]

    M. Chen, J. Pennington, and S. S. Schoenholz. Dynamical isometry and a mean field theory of RNNs: Gating enables signal propagation in recurrent neural networks. InProceedings of the 35th International Conference on Machine Learning, pages 873–882, Stockholm, Sweden, 2018. PMLR

  13. [13]

    K. Cho, B. van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pages 1724– 1734, Doha, Qatar, 2014

  14. [14]

    Cont and P

    R. Cont and P. Tankov.Financial Modelling with Jump Processes. Chapman and Hall/CRC, Boca Raton, FL, USA, 2004. ISBN 9781584884132

  15. [15]

    Şimşekli, L

    U. Şimşekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InProceedings of the 36th International Conference on Machine Learning, pages 5827–5837, Long Beach, California, USA, 2019. PMLR. 59

  16. [16]

    de Bruijn.Asymptotic Methods in Analysis

    N. de Bruijn.Asymptotic Methods in Analysis. Dover Books on Mathematics. Dover Publications, Garden City, NY, USA, 2010. ISBN 9780486642215

  17. [17]

    A. L. M. Dekkers, J. H. J. Einmahl, and L. De Haan. A Moment Estimator for the Index of an Extreme- Value Distribution.The Annals of Statistics, 17(4):1833–1855, 1989. doi: 10.1214/aos/1176347397

  18. [18]

    Durrett.Probability: Theory and Examples

    R. Durrett.Probability: Theory and Examples. Cambridge Series in Statistical and Probabilistic Math- ematics. Cambridge University Press, Cambridge, UK, 5th edition, 2019. ISBN 9781108591034

  19. [19]

    Engelken, F

    R. Engelken, F. Wolf, and L. F. Abbott. Lyapunov spectra of chaotic recurrent neural networks.Physical Review Research, 5(4):043044, 2023. doi: 10.1103/PhysRevResearch.5.043044

  20. [20]

    N. B. Erichson, S. H. Lim, and M. W. Mahoney. Gated recurrent neural networks with weighted time-delay feedback. InProceedings of the 28th International Conference on Artificial Intelligence and Statistics, volume 258, pages 3646–3654, Mai Khao, Thailand, 2025. PMLR

  21. [21]

    Feller.An Introduction to Probability Theory and Its Applications

    W. Feller.An Introduction to Probability Theory and Its Applications. Vol. II. John Wiley & Sons, New York, USA, 2nd edition, 1971. ISBN 9780471257097

  22. [22]

    Gerbelot, E

    C. Gerbelot, E. Troiani, F. Mignacco, F. Krzakala, and L. Zdeborová. Rigorous dynamical mean-field theory for stochastic gradient descent methods.SIAM Journal on Mathematics of Data Science, 6(2): 400–427, 2024. doi: 10.1137/23M1594388

  23. [23]

    Gloter, D

    A. Gloter, D. Loukianova, and H. Mai. Jump filtering and efficient drift estimation for Lévy-driven SDEs.The Annals of Statistics, 46(4):1445–1480, 2018. doi: 10.1214/17-AOS1591

  24. [24]

    Greff, R

    K. Greff, R. K. Srivastava, J. Koutnik, B. R. Steunebrink, and J. Schmidhuber. LSTM: A search space odyssey.IEEE Transactions on Neural Networks and Learning Systems, 28(10):2222–2232, 2017. doi: 10.1109/TNNLS.2016.2582924

  25. [25]

    Gürbüzbalaban, U

    M. Gürbüzbalaban, U. Şimşekli, and L. Zhu. The heavy-tail phenomenon in SGD. InProceedings of the 38th International Conference on Machine Learning, volume 139, pages 3964–3975. PMLR, 2021

  26. [26]

    Hesse and T

    J. Hesse and T. Gross. Self-organized criticality as a fundamental property of neural systems.Frontiers in Systems Neuroscience, 8:166, 2014. doi: 10.3389/fnsys.2014.00166

  27. [27]

    B. M. Hill. A simple general approach to inference about the tail of a distribution.The Annals of Statistics, 3(5):1163–1174, 1975. doi: 10.1214/aos/1176343247

  28. [28]

    Hodgkinson and M

    L. Hodgkinson and M. W. Mahoney. Multiplicative noise and heavy tails in stochastic optimization. In Proceedings of the 38th International Conference on Machine Learning, volume 139, pages 4262–4274. PMLR, 2021

  29. [29]

    Imkeller and I

    P. Imkeller and I. Pavlyukevich. First exit times of SDEs driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006. doi: 10.1016/j.spa.2005.11.006

  30. [30]

    Kadmon and H

    J. Kadmon and H. Sompolinsky. Transition to chaos in random neuronal networks.Physical Review X, 5:041030, Nov 2015. doi: 10.1103/PhysRevX.5.041030

  31. [31]

    T. D. Kim, T. Can, and K. Krishnamurthy. Trainability, expressivity and interpretability in gated neural ODEs. InProceedings of the 40th International Conference on Machine Learning, Honolulu, Hawaii, USA, 2023. PMLR

  32. [32]

    Krishnamurthy, T

    K. Krishnamurthy, T. Can, and D. J. Schwab. Theory of gating in recurrent neural networks.Physical Review X, 12:011011, Jan 2022. doi: 10.1103/PhysRevX.12.011011

  33. [33]

    Küchler and S

    U. Küchler and S. Tappe. Tempered stable distributions and processes.Stochastic Processes and their Applications, 123(12):4256–4293, 2013. doi: 10.1016/j.spa.2013.06.012. 60

  34. [34]

    Q. Li, C. Tai, and W. E. Stochastic modified equations and adaptive stochastic gradient algorithms. InProceedings of the 34th International Conference on Machine Learning, volume 70, pages 2101–2110, Sydney, Australia, 2017. PMLR

  35. [35]

    L. Livi. Time-scale coupling between states and parameters in recurrent neural networks.arXiv preprint arXiv:2508.12121, 2025

  36. [36]

    L. Livi. Learnability window in gated recurrent neural networks.arXiv preprint arXiv:2512.05790, 2025

  37. [37]

    L. Livi, F. M. Bianchi, and C. Alippi. Determination of the edge of criticality in echo state networks through Fisher information maximization.IEEE Transactions on Neural Networks and Learning Sys- tems, 29(3):706–717, 2018. doi: 10.1109/TNNLS.2016.2644268

  38. [38]

    Loshchilov and F

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization.International Conference on Learning Representations, 2019

  39. [39]

    Mandt, M

    S. Mandt, M. D. Hoffman, and D. M. Blei. Stochastic gradient descent as approximate bayesian infer- ence.Journal of Machine Learning Research, 18(134):1–35, 2017

  40. [40]

    C. H. Martin and M. W. Mahoney. Implicit self-regularization in deep neural networks: Evidence from random matrix theory and implications for learning.Journal of Machine Learning Research, 22(165): 1–73, 2021

  41. [41]

    Massar and S

    M. Massar and S. Massar. Mean-field theory of echo state networks.Physical Review E, 87(4):042809,

  42. [42]

    doi: 10.1103/PhysRevE.87.042809

  43. [43]

    Mastrogiuseppe and S

    F. Mastrogiuseppe and S. Ostojic. Linking connectivity, dynamics, and computations in low-rank recurrent neural networks.Neuron, 99(3):609–623, 2018. doi: 10.1016/j.neuron.2018.07.003

  44. [44]

    M. M. Meerschaert, Y. Zhang, and B. Baeumer. Tempered anomalous diffusion in heterogeneous sys- tems.Geophysical Research Letters, 35(17):L17403, 2008. doi: 10.1029/2008GL034899

  45. [45]

    S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. doi: 10.1073/ pnas.1806579115

  46. [46]

    Nguyen and H

    P.-M. Nguyen and H. T. Pham. A rigorous framework for the mean field limit of multilayer neural networks.Mathematical Statistics and Learning, 6(3):201–357, 2023. doi: 10.4171/MSL/42

  47. [47]

    T. H. Nguyen, U. Şimşekli, M. Gürbüzbalaban, and G. Richard. First exit time analysis of stochastic gradient descent under heavy-tailed gradient noise. InProceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019. Curran Associates Inc

  48. [48]

    B. R. Olsen, S. Fatehmanesh, F. Xiao, A. Kumarappan, and A. Gajula. From SGD to spectra: A theory of neural network weight dynamics.arXiv preprint arXiv:2507.12709, 2025

  49. [49]

    Pascanu, T

    R. Pascanu, T. Mikolov, and Y. Bengio. On the difficulty of training recurrent neural networks. In Proceedings of the 30th International Conference on Machine Learning, volume 28, pages 1310–1318, Atlanta, Georgia, USA, 2013. PMLR

  50. [50]

    Poole, S

    B. Poole, S. Lahiri, M. Raghu, J. Sohl-Dickstein, and S. Ganguli. Exponential expressivity in deep neural networks through transient chaos. InProceedings of the 30th Advances in Neural Information Processing Systems, volume 29, Barcelona, Spain, 2016. Curran Associates, Inc

  51. [51]

    Rosiński

    J. Rosiński. Tempering stable processes.Stochastic Processes and their Applications, 117(6):677–707,

  52. [52]

    doi: 10.1016/j.spa.2006.10.003. 61

  53. [53]

    S. Ruder. An overview of gradient descent optimization algorithms.arXiv preprint arXiv:1609.04747, 2016

  54. [54]

    Sato.Lévy Processes and Infinitely Divisible Distributions, volume 68 ofCambridge Studies in Advanced Mathematics

    K. Sato.Lévy Processes and Infinitely Divisible Distributions, volume 68 ofCambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 1999. ISBN 978-1107656499

  55. [55]

    R. L. Schilling. An introduction to Lévy and feller processes. advanced courses in mathematics-CRM Barcelona 2014.arXiv preprint arXiv:1603.00251, 2016

  56. [56]

    Sieber, C

    J. Sieber, C. Amo Alonso, A. Didier, M. N. Zeilinger, and A. Orvieto. Understanding the differences in foundation models: Attention, state space models, and recurrent neural networks. InAdvances in Neural Information Processing Systems, volume 37, Vancouver, BC, Canada, 2024

  57. [57]

    There Will Be a Scientific Theory of Deep Learning

    J. Simon, D. Kunin, A. Atanasov, E. Boix-Adserà, B. Bordelon, J. Cohen, N. Ghosh, F. Guth, A. Jacot, M. Kamb, D. Karkada, E. J. Michaud, B. Ottlik, and J. Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691, 2026

  58. [58]

    Meanfieldanalysisofdeepneuralnetworks.Mathematics of Operations Research, 47(1):120–152, 2022

    J.SirignanoandK.Spiliopoulos. Meanfieldanalysisofdeepneuralnetworks.Mathematics of Operations Research, 47(1):120–152, 2022. doi: 10.1287/moor.2020.1118

  59. [59]

    S. L. Smith, B. Dherin, D. G. T. Barrett, and S. De. On the origin of implicit regularization in stochastic gradient descent. InInternational Conference on Learning Representations, 2021

  60. [60]

    Stanislavsky, K

    A. Stanislavsky, K. Weron, and A. Weron. Diffusion and relaxation controlled by temperedα-stable processes.Physical Review E, 78(5):051106, 2008. doi: 10.1103/PhysRevE.78.051106

  61. [61]

    Sussillo and O

    D. Sussillo and O. Barak. Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks.Neural Computation, 25(3):626–649, 2013. doi: 10.1162/NECO_a_00409

  62. [62]

    Tallec and Y

    C. Tallec and Y. Ollivier. Can recurrent neural networks warp time? InInternational Conference on Learning Representations, Vancouver, BC, Canada, 2018

  63. [63]

    Cambridge University Press, Cambridge, UK, 2014

    M.Viana.Lectures on Lyapunov Exponents, volume145ofCambridge Studies in Advanced Mathematics. Cambridge University Press, Cambridge, UK, 2014. ISBN 9781139976602

  64. [64]

    Z. Xie, I. Sato, and M. Sugiyama. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InInternational Conference on Learning Representations, 2021

  65. [65]

    S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent.arXiv preprint arXiv:1810.00004, 2018

  66. [66]

    Zhang and C

    X.-Y. Zhang and C. Tang. Heavy-tailed update distributions arise from information-driven self- organization in nonequilibrium learning.Proceedings of the National Academy of Sciences, 122(51): e2523012122, 2025. doi: 10.1073/pnas.2523012122

  67. [67]

    Ziyin, H

    L. Ziyin, H. Li, and M. Ueda. Noise balance and stationary distribution of stochastic gradient descent. Physical Review E, 111(6):065303, 2025. doi: 10.1103/PhysRevE.111.065303

  68. [68]

    Zucchet and A

    N. Zucchet and A. Orvieto. Recurrent neural networks: vanishing and exploding gradients are not the end of the story. InProceedings of the 38th International Conference on Neural Information Processing Systems, Vancouver, BC, Canada, 2024. Curran Associates Inc. 62