pith. sign in

arxiv: 2406.09241 · v3 · pith:WSDVWHG2new · submitted 2024-06-13 · 🧮 math.OC · cs.LG· math.PR· stat.ML

What is the long-run distribution of stochastic gradient descent? A large deviations analysis

Pith reviewed 2026-05-24 00:21 UTC · model grok-4.3

classification 🧮 math.OC cs.LGmath.PRstat.ML
keywords stochastic gradient descentlarge deviationsBoltzmann-Gibbs distributionnon-convex optimizationstationary distributionperturbed dynamical systemsasymptotic analysis
0
0 comments X

The pith

The long-run distribution of SGD follows a Boltzmann-Gibbs measure whose temperature equals the step-size and whose energy levels are set by the objective and noise statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies large deviations theory to stochastic gradient descent on non-convex landscapes by treating the algorithm as a randomly perturbed dynamical system. It shows that the stationary distribution of the iterates is the Boltzmann-Gibbs distribution with temperature equal to the step-size. As a result, the algorithm visits the critical region exponentially more often than any non-critical region and concentrates around the lowest-energy critical component rather than necessarily the global minimum of the objective. Frequencies among other critical components are exponentially weighted by their energy levels, so components containing only local maximizers or saddles are visited far less often than nearby components of local minimizers.

Core claim

Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, the problem's critical region is visited exponentially more often than any non-critical region; the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); all other connected components of critical points are visited with frequency,

What carries the argument

The large-deviations rate function obtained by viewing SGD iterates as a randomly perturbed dynamical system, with the rate function constructed explicitly from the objective and the noise covariance.

If this is right

  • The critical region is visited exponentially more often than any non-critical region.
  • Iterates concentrate exponentially around the minimum-energy state rather than always the global minimum.
  • Connected components of critical points are visited with frequency exponentially proportional to their energy level.
  • Any component consisting only of local maximizers or saddles is visited exponentially less often than a connected component of local minimizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Step-size selection therefore controls not only speed but also which basin the iterates ultimately favor in the long run.
  • The same large-deviations mechanism could be used to compare the implicit bias of different noise models or different stochastic first-order methods.
  • In high-dimensional problems the explicit rate function may allow prediction of which spurious local minima are effectively avoided.

Load-bearing premise

The SGD trajectory can be treated as a randomly perturbed dynamical system to which the large-deviations principle applies directly.

What would settle it

Numerical runs that track the empirical long-run frequency of visits to different connected components of critical points and show that these frequencies do not scale exponentially with the energy levels computed from the objective and noise covariance.

Figures

Figures reproduced from arXiv: 2406.09241 by Franck Iutzeler, J\'er\^ome Malick, Panayotis Mertikopoulos, Wa\"iss Azizian.

Figure 1
Figure 1. Figure 1: Graphical illustration of Theorems 1–4 for the Himmelblau test function f(x1, x2) = (x 2 1 + x2 − 11)2 + (x1 + x 2 2 − 7)2 . The figure to the left depicts the loss landscape of f with several (deterministic) orbits of the gradient flow of f superimposed for visual convenience. The figure in the middle highlights the 9 critical components of f as well as the “most likely” transitions between them: K1, K3, … view at source ↗
read the original abstract

In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper claims that the long-run occupation measure of SGD on non-convex problems obeys a large-deviations principle and therefore takes a Boltzmann-Gibbs form whose temperature equals the step-size and whose energy levels are determined by the objective together with the noise covariance; the resulting stationary distribution implies the four stated properties (a)–(d) on the relative visitation frequencies of critical components.

Significance. If the derivation is correct, the result supplies an explicit, falsifiable characterization of SGD’s stationary behavior that explains why the algorithm preferentially visits certain critical components over others, even when those components are not global minimizers of the objective. The explicit construction of the rate function from the objective and noise statistics, together with the invocation of standard Freidlin–Wentzell-type arguments for randomly perturbed dynamical systems, constitutes a clear technical contribution.

minor comments (2)
  1. [Abstract] Abstract: the high-level claim would be more informative if it included the explicit functional form of the rate function (or at least the quasipotential) that appears in the large-deviations statement.
  2. [Setup / Assumptions] The manuscript should state the precise moment and Lipschitz conditions on the stochastic gradient noise that are needed for the large-deviations principle to hold; these are standard but should be recorded explicitly rather than left implicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive assessment of the manuscript, including the accurate summary of the main claims and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation applies external large-deviations theory

full rationale

The paper invokes the large deviations principle for randomly perturbed dynamical systems (Freidlin-Wentzell theory and discrete-time analogues) to obtain an explicit rate function for the stationary occupation measure of SGD. This rate function is constructed directly from the objective function and the noise covariance, without any parameter fitting, self-definition of energy levels, or load-bearing self-citations. The Boltzmann-Gibbs form and properties (a)-(d) are standard consequences of the resulting quasipotential under the stated local Lipschitz and moment assumptions. The central claim therefore remains independent of the paper's own inputs and does not reduce to a renaming or tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of large-deviations theory to the discrete-time SGD recursion viewed as a randomly perturbed flow; no free parameters are fitted inside the abstract, but the energy function itself is defined from the objective and noise statistics whose precise form is not given.

axioms (1)
  • domain assumption Large deviations principle holds for the SGD process under the problem's noise statistics
    Invoked in the sentence 'Using an approach based on the theory of large deviations and randomly perturbed dynamical systems' to obtain the rate function that defines the Boltzmann-Gibbs measure.

pith-pipeline@v0.9.0 · 5762 in / 1438 out tokens · 22924 ms · 2026-05-24T00:21:08.972337+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD

    math.OC 2026-04 unverdicted novelty 7.0

    Combining random reshuffling and Richardson-Romberg extrapolation yields cubic bias refinement and better MSE for constant-step SGD on structured non-monotone variational inequalities.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Alongi, J. M. and Nelson, G. S.Recurrence and Topology, volume 85 ofGraduate Studies in Mathematics. American Mathematical Society, 2007

  2. [2]

    AdaGrad avoids saddle points

    Antonakopoulos, K., Mertikopoulos, P., Piliouras, G., and Wang, X. AdaGrad avoids saddle points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

  3. [3]

    Dynamics of stochastic approximation algorithms

    Benaïm, M. Dynamics of stochastic approximation algorithms. In Azéma, J., Émery, M., Ledoux, M., and Yor, M. (eds.),Séminaire de Probabilités XXXIII, volume 1709 ofLecture Notes in Mathematics, pp. 1–68. Springer Berlin Heidelberg, 1999

  4. [4]

    and Hirsch, M

    Benaïm, M. and Hirsch, M. W. Dynamics of Morse-Smale urn processes.Ergodic Theory and Dynamical Systems, 15(6):1005–1030, December 1995

  5. [5]

    Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000

  6. [6]

    Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

    Blanc, G., Gupta, N., Valiant, G., and Valiant, P. Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. InCOLT ’20: Proceedings of the 33rd Annual Conference on Learning Theory, 2020

  7. [7]

    D.Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 ofLecture Notes-Monograph Series

    Brown, L. D.Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 ofLecture Notes-Monograph Series. Institute of Mathematical Statistics, 1986

  8. [8]

    The loss surfaces of multilayer networks

    Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and LeCun, Y. The loss surfaces of multilayer networks. InAISTATS ’15: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015

  9. [9]

    An introduction too-minimal geometry, 1999

    Coste, M. An introduction too-minimal geometry, 1999

  10. [10]

    Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization

    Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., andBengio, Y. Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization. InNIPS ’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014

  11. [11]

    D.Large Deviations for Markov Chains

    de Acosta, A. D.Large Deviations for Markov Chains. Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, UK, 2022

  12. [12]

    and Zeitouni, O.Large Deviations Techniques and Applications

    Dembo, A. and Zeitouni, O.Large Deviations Techniques and Applications. Springer-Verlag, Berlin, 1998

  13. [13]

    Bridging the gap between constant step size stochastic gradient descent and Markov chains.The Annals of Statistics, 48(3):1348–1382, June 2020

    Dieuleveut, A., Durmus, A., and Bach, F. Bridging the gap between constant step size stochastic gradient descent and Markov chains.The Annals of Statistics, 48(3):1348–1382, June 2020

  14. [14]

    Springer Series in Operations Research and Financial Engineering

    Douc, R., Moulines, E., Priouret, P., and Soulier, P.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

  15. [15]

    Large deviations analysis of some recursive algorithms with state dependent noise.The Annals of Probability, 16(4):1509–1536, October 1988

    Dupuis, P. Large deviations analysis of some recursive algorithms with state dependent noise.The Annals of Probability, 16(4):1509–1536, October 1988

  16. [16]

    and Kushner, H

    Dupuis, P. and Kushner, H. J. Stochastic approximations via large deviations: Asymptotic properties. SIAM Journal on Control and Optimization, 23(5):675–696, September 1985

  17. [17]

    Reflection couplings and contraction rates for diffusions.Probability Theory and Related Fields, 166(3-4):851–886, December 2016

    Eberle, A. Reflection couplings and contraction rates for diffusions.Probability Theory and Related Fields, 166(3-4):851–886, December 2016

  18. [18]

    A., Mackey, L

    Erdogdu, M. A., Mackey, L. W., and Shamir, O. Global non-convex optimization with discretized diffusions. InNeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018

  19. [19]

    Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation.Communications in Mathematical Sciences, 18 (1):163–188, 2020

    Feng, Y., Gao, T., Li, L., Liu, J.-G., and Lu, Y. Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation.Communications in Mathematical Sciences, 18 (1):163–188, 2020

  20. [20]

    Freidlin, M. I. and Wentzell, A. D.Random Perturbations of Dynamical Systems. Springer, 1 edition, 1998

  21. [21]

    Escaping from saddle points – Online stochastic gradient for tensor decomposition

    Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points – Online stochastic gradient for tensor decomposition. InCOLT ’15: Proceedings of the 28th Annual Conference on Learning Theory, 2015

  22. [22]

    Gulinsky, O. V. and Veretennikov, A. Y.Large Deviations for Discrete-Time Processes with Averaging. De Gruyter, 1993

  23. [23]

    The heavy-tail phenomenon in SGD

    Gürbüzbalaban, M., Şimşekli, U., and Zhu, L. The heavy-tail phenomenon in SGD. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021. THE LONG-RUN DISTRIBUTION OF STOCHASTIC GRADIENT DESCENT 69

  24. [24]

    and Lasserre, J.-B.Markov Chains and Invariant Probabilities

    Hernández-Lerma, O. and Lasserre, J.-B.Markov Chains and Invariant Probabilities. Birkhäuser, Basel, 2003

  25. [25]

    and Mahoney, M

    Hodgkinson, L. and Mahoney, M. Multiplicative noise and heavy tails in stochastic optimization. In ICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021

  26. [26]

    The limits of min-max optimization algorithms: Convergence to spurious non-critical sets

    Hsieh, Y.-P., Mertikopoulos, P., and Cevher, V. The limits of min-max optimization algorithms: Convergence to spurious non-critical sets. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021

  27. [27]

    R., Krause, A., and Mertikopoulos, P

    Hsieh, Y.-P., Karimi, M. R., Krause, A., and Mertikopoulos, P. Riemannian stochastic optimization methods avoid strict saddle points. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

  28. [28]

    J., Li, L., and Liu, J.-G

    Hu, W., Li, C. J., Li, L., and Liu, J.-G. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, 4(1):3–32, 2019

  29. [29]

    D., Tikhomirov, V

    Ioffe, A. D., Tikhomirov, V. M., and Makowski, K.Theory of Extremal Problems. North Holland, Amsterdam, NL, 1979

  30. [30]

    Three Factors Influencing Minima in SGD

    Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three factors influencing minima in SGD.https://arxiv.org/abs/1711.04623, 2017

  31. [31]

    M., and Jordan, M

    Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017

  32. [32]

    Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling

    Kallenberg, O. Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling. Springer, 2021

  33. [33]

    Deep learning without poor local minima

    Kawaguchi, K. Deep learning without poor local minima. In NIPS ’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016

  34. [34]

    and Wolfowitz, J

    Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function.The Annals of Mathematical Statistics, 23(3):462–466, 1952

  35. [35]

    Random Perturbations of Dynamical Systems

    Kifer, Y. Random Perturbations of Dynamical Systems. Birkhäuser, Boston, MA, 1988

  36. [36]

    A discrete-time version of the Wentzell-Freidlin theory.Annals of Probability, 18(4):1676–1692, October 1990

    Kifer, Y. A discrete-time version of the Wentzell-Freidlin theory.Annals of Probability, 18(4):1676–1692, October 1990

  37. [37]

    First-order and Stochastic Optimization Methods for Machine Learning

    Lan, G. First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020

  38. [38]

    Landau, L. D. and Lifshitz, E. M. Statistical physics. InCourse of Theoretical Physics, volume 5. Pergamon Press, Oxford, 1976

  39. [39]

    and Wang, Y

    Li, L. and Wang, Y. A sharp uniform-in-time error estimate for stochastic gradient Langevin dynamics. https://arxiv.org/abs/2207.09304, 2022

  40. [40]

    and Wang, Y

    Li, L. and Wang, Y. On uniform-in-time diffusion approximation for stochastic gradient descent. https://arxiv.org/abs/2207.04922, 2022

  41. [41]

    Stochastic modified equations and adaptive stochastic gradient algorithms

    Li, Q., Tai, C., and E, W. Stochastic modified equations and adaptive stochastic gradient algorithms. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017

  42. [42]

    Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

    Li, Q., Tai, C., and E, W. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

  43. [43]

    Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate

    Li, Z., Lyu, K., and Arora, S. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  44. [44]

    What happens after SGD reaches zero loss? –A mathematical framework

    Li, Z., Wang, T., and Arora, S. What happens after SGD reaches zero loss? –A mathematical framework. In ICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021

  45. [45]

    Fast mixing of stochastic gradient descent with normalization and weight decay

    Li, Z., Wang, T., and Yu, D. Fast mixing of stochastic gradient descent with normalization and weight decay. In NeurIPS ’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022

  46. [46]

    Bad global minima exist and SGD can reach them

    Liu, S., Papailiopoulos, D., and Achlioptas, D. Bad global minima exist and SGD can reach them. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  47. [47]

    Analysis of recursive stochastic algorithms.IEEE Trans

    Ljung, L. Analysis of recursive stochastic algorithms.IEEE Trans. Autom. Control, 22(4):551–575, August 1977

  48. [48]

    Lu, Y., Balasubramanian, K., Volgushev, S., and Erdogdu, M. A. An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. InNeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021. 70 W. AZIZIAN, F. IUTZELER, J. MALICK, AND P. MERTIKOPOULOS

  49. [49]

    and Mertikopoulos, P

    Lytras, I. and Mertikopoulos, P. Tamed Langevin sampling under weaker conditions.https://arxiv.org/ abs/2405.17693, 2024

  50. [50]

    B., Mijatović, A., and Szpruch, Ł

    Majka, M. B., Mijatović, A., and Szpruch, Ł. Nonasymptotic bounds for sampling algorithms without log-concavity. The Annals of Applied Probability, 30(4):1534–1581, 2020

  51. [51]

    and Staudigl, M

    Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, January 2018

  52. [52]

    and Staudigl, M

    Mertikopoulos, P. and Staudigl, M. Stochastic mirror descent dynamics and their convergence in monotone variational inequalities.Journal of Optimization Theory and Applications, 179(3):838–867, December 2018

  53. [53]

    On the almost sure convergence of stochastic gradient descent in non-convex problems

    Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. On the almost sure convergence of stochastic gradient descent in non-convex problems. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  54. [54]

    A unified stochastic approximation framework for learning in games.Mathematical Programming, 203:559–609, January 2024

    Mertikopoulos, P., Hsieh, Y.-P., and Cevher, V. A unified stochastic approximation framework for learning in games.Mathematical Programming, 203:559–609, January 2024

  55. [55]

    and Urbani, P

    Mignacco, F. and Urbani, P. The effective noise of stochastic gradient descent.Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022

  56. [56]

    Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification

    Mignacco, F., Krzakala, F., Urbani, P., and Zdeborová, L. Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

  57. [57]

    Power-law escape rate of SGD

    Mori, T., Ziyin, L., Liu, K., and Ueda, M. Power-law escape rate of SGD. InICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

  58. [58]

    L., Durmus, A., and Şimşekli, U

    Pavasovic, K. L., Durmus, A., and Şimşekli, U. Approximate heavy tails in offline (multi-pass) stochastic gradient descent. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

  59. [59]

    Nonconvergence to unstable points in urn models and stochastic aproximations.Annals of Probability, 18(2):698–712, April 1990

    Pemantle, R. Nonconvergence to unstable points in urn models and stochastic aproximations.Annals of Probability, 18(2):698–712, April 1990

  60. [60]

    Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis

    Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. InCOLT ’17: Proceedings of the 30th Annual Conference on Learning Theory, 2017

  61. [61]

    and Hütter, J.-C

    Rigollet, P. and Hütter, J.-C. High-dimensional statistics.https://arxiv.org/abs/2310.19244, 2023

  62. [62]

    and Monro, S

    Robbins, H. and Monro, S. A stochastic approximation method.Annals of Mathematical Statistics, 22: 400–407, 1951

  63. [63]

    and Miller, C

    van den Dries, L. and Miller, C. Geometric categories ando-minimal structures. Duke Mathematical Journal, 84(2), August 1996

  64. [64]

    Stochastic gradient flow dynamics of test risk and its exact solution for weak features.https://arxiv.org/abs/2402.07626, 2024

    Veiga, R., Remizova, A., and Macris, N. Stochastic gradient flow dynamics of test risk and its exact solution for weak features.https://arxiv.org/abs/2402.07626, 2024

  65. [65]

    J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48

    Wainwright, M. J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019

  66. [66]

    and Wang, Z

    Wang, Y. and Wang, Z. Three-stage evolution and fast equilibrium for SGD with non-degerate critical points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

  67. [67]

    Stochastic gradient descent with noise of machine learning type

    Wojtowytsch, S. Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023

  68. [68]

    A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

    Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021

  69. [69]

    Yang, J., Hu, W., and Li, C. J. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis, 122(3-4):371–393, 2021

  70. [70]

    P., and Glynn, P

    Zhou, Z., Mertikopoulos, P., Bambos, N., Boyd, S. P., and Glynn, P. W. On the convergence of mirror descent beyond stochastic convex programming.SIAM Journal on Optimization, 30(1):687–716, 2020

  71. [71]

    Law of balance and stationary distribution of stochastic gradient descent

    Ziyin, L., Li, H., and Ueda, M. Law of balance and stationary distribution of stochastic gradient descent. https://arxiv.org/abs/2308.06671, 2023