What is the long-run distribution of stochastic gradient descent? A large deviations analysis
Pith reviewed 2026-05-24 00:21 UTC · model grok-4.3
The pith
The long-run distribution of SGD follows a Boltzmann-Gibbs measure whose temperature equals the step-size and whose energy levels are set by the objective and noise statistics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, the problem's critical region is visited exponentially more often than any non-critical region; the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); all other connected components of critical points are visited with frequency,
What carries the argument
The large-deviations rate function obtained by viewing SGD iterates as a randomly perturbed dynamical system, with the rate function constructed explicitly from the objective and the noise covariance.
If this is right
- The critical region is visited exponentially more often than any non-critical region.
- Iterates concentrate exponentially around the minimum-energy state rather than always the global minimum.
- Connected components of critical points are visited with frequency exponentially proportional to their energy level.
- Any component consisting only of local maximizers or saddles is visited exponentially less often than a connected component of local minimizers.
Where Pith is reading between the lines
- Step-size selection therefore controls not only speed but also which basin the iterates ultimately favor in the long run.
- The same large-deviations mechanism could be used to compare the implicit bias of different noise models or different stochastic first-order methods.
- In high-dimensional problems the explicit rate function may allow prediction of which spurious local minima are effectively avoided.
Load-bearing premise
The SGD trajectory can be treated as a randomly perturbed dynamical system to which the large-deviations principle applies directly.
What would settle it
Numerical runs that track the empirical long-run frequency of visits to different connected components of critical points and show that these frequencies do not scale exponentially with the energy levels computed from the objective and noise covariance.
Figures
read the original abstract
In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that the long-run occupation measure of SGD on non-convex problems obeys a large-deviations principle and therefore takes a Boltzmann-Gibbs form whose temperature equals the step-size and whose energy levels are determined by the objective together with the noise covariance; the resulting stationary distribution implies the four stated properties (a)–(d) on the relative visitation frequencies of critical components.
Significance. If the derivation is correct, the result supplies an explicit, falsifiable characterization of SGD’s stationary behavior that explains why the algorithm preferentially visits certain critical components over others, even when those components are not global minimizers of the objective. The explicit construction of the rate function from the objective and noise statistics, together with the invocation of standard Freidlin–Wentzell-type arguments for randomly perturbed dynamical systems, constitutes a clear technical contribution.
minor comments (2)
- [Abstract] Abstract: the high-level claim would be more informative if it included the explicit functional form of the rate function (or at least the quasipotential) that appears in the large-deviations statement.
- [Setup / Assumptions] The manuscript should state the precise moment and Lipschitz conditions on the stochastic gradient noise that are needed for the large-deviations principle to hold; these are standard but should be recorded explicitly rather than left implicit.
Simulated Author's Rebuttal
We thank the referee for the careful reading and positive assessment of the manuscript, including the accurate summary of the main claims and the recommendation for minor revision. No specific major comments were raised in the report.
Circularity Check
No significant circularity; derivation applies external large-deviations theory
full rationale
The paper invokes the large deviations principle for randomly perturbed dynamical systems (Freidlin-Wentzell theory and discrete-time analogues) to obtain an explicit rate function for the stationary occupation measure of SGD. This rate function is constructed directly from the objective function and the noise covariance, without any parameter fitting, self-definition of energy levels, or load-bearing self-citations. The Boltzmann-Gibbs form and properties (a)-(d) are standard consequences of the resulting quasipotential under the stated local Lipschitz and moment assumptions. The central claim therefore remains independent of the paper's own inputs and does not reduce to a renaming or tautology.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Large deviations principle holds for the SGD process under the problem's noise statistics
Forward citations
Cited by 1 Pith paper
-
Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD
Combining random reshuffling and Richardson-Romberg extrapolation yields cubic bias refinement and better MSE for constant-step SGD on structured non-monotone variational inequalities.
Reference graph
Works this paper leans on
-
[1]
Alongi, J. M. and Nelson, G. S.Recurrence and Topology, volume 85 ofGraduate Studies in Mathematics. American Mathematical Society, 2007
work page 2007
-
[2]
Antonakopoulos, K., Mertikopoulos, P., Piliouras, G., and Wang, X. AdaGrad avoids saddle points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022
work page 2022
-
[3]
Dynamics of stochastic approximation algorithms
Benaïm, M. Dynamics of stochastic approximation algorithms. In Azéma, J., Émery, M., Ledoux, M., and Yor, M. (eds.),Séminaire de Probabilités XXXIII, volume 1709 ofLecture Notes in Mathematics, pp. 1–68. Springer Berlin Heidelberg, 1999
work page 1999
-
[4]
Benaïm, M. and Hirsch, M. W. Dynamics of Morse-Smale urn processes.Ergodic Theory and Dynamical Systems, 15(6):1005–1030, December 1995
work page 1995
-
[5]
Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000
work page 2000
-
[6]
Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process
Blanc, G., Gupta, N., Valiant, G., and Valiant, P. Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. InCOLT ’20: Proceedings of the 33rd Annual Conference on Learning Theory, 2020
work page 2020
-
[7]
Brown, L. D.Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 ofLecture Notes-Monograph Series. Institute of Mathematical Statistics, 1986
work page 1986
-
[8]
The loss surfaces of multilayer networks
Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and LeCun, Y. The loss surfaces of multilayer networks. InAISTATS ’15: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015
work page 2015
-
[9]
An introduction too-minimal geometry, 1999
Coste, M. An introduction too-minimal geometry, 1999
work page 1999
-
[10]
Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization
Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., andBengio, Y. Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization. InNIPS ’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014
work page 2014
-
[11]
D.Large Deviations for Markov Chains
de Acosta, A. D.Large Deviations for Markov Chains. Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, UK, 2022
work page 2022
-
[12]
and Zeitouni, O.Large Deviations Techniques and Applications
Dembo, A. and Zeitouni, O.Large Deviations Techniques and Applications. Springer-Verlag, Berlin, 1998
work page 1998
-
[13]
Dieuleveut, A., Durmus, A., and Bach, F. Bridging the gap between constant step size stochastic gradient descent and Markov chains.The Annals of Statistics, 48(3):1348–1382, June 2020
work page 2020
-
[14]
Springer Series in Operations Research and Financial Engineering
Douc, R., Moulines, E., Priouret, P., and Soulier, P.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018
work page 2018
-
[15]
Dupuis, P. Large deviations analysis of some recursive algorithms with state dependent noise.The Annals of Probability, 16(4):1509–1536, October 1988
work page 1988
-
[16]
Dupuis, P. and Kushner, H. J. Stochastic approximations via large deviations: Asymptotic properties. SIAM Journal on Control and Optimization, 23(5):675–696, September 1985
work page 1985
-
[17]
Eberle, A. Reflection couplings and contraction rates for diffusions.Probability Theory and Related Fields, 166(3-4):851–886, December 2016
work page 2016
-
[18]
Erdogdu, M. A., Mackey, L. W., and Shamir, O. Global non-convex optimization with discretized diffusions. InNeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018
work page 2018
-
[19]
Feng, Y., Gao, T., Li, L., Liu, J.-G., and Lu, Y. Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation.Communications in Mathematical Sciences, 18 (1):163–188, 2020
work page 2020
-
[20]
Freidlin, M. I. and Wentzell, A. D.Random Perturbations of Dynamical Systems. Springer, 1 edition, 1998
work page 1998
-
[21]
Escaping from saddle points – Online stochastic gradient for tensor decomposition
Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points – Online stochastic gradient for tensor decomposition. InCOLT ’15: Proceedings of the 28th Annual Conference on Learning Theory, 2015
work page 2015
-
[22]
Gulinsky, O. V. and Veretennikov, A. Y.Large Deviations for Discrete-Time Processes with Averaging. De Gruyter, 1993
work page 1993
-
[23]
The heavy-tail phenomenon in SGD
Gürbüzbalaban, M., Şimşekli, U., and Zhu, L. The heavy-tail phenomenon in SGD. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021. THE LONG-RUN DISTRIBUTION OF STOCHASTIC GRADIENT DESCENT 69
work page 2021
-
[24]
and Lasserre, J.-B.Markov Chains and Invariant Probabilities
Hernández-Lerma, O. and Lasserre, J.-B.Markov Chains and Invariant Probabilities. Birkhäuser, Basel, 2003
work page 2003
-
[25]
Hodgkinson, L. and Mahoney, M. Multiplicative noise and heavy tails in stochastic optimization. In ICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021
work page 2021
-
[26]
The limits of min-max optimization algorithms: Convergence to spurious non-critical sets
Hsieh, Y.-P., Mertikopoulos, P., and Cevher, V. The limits of min-max optimization algorithms: Convergence to spurious non-critical sets. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021
work page 2021
-
[27]
R., Krause, A., and Mertikopoulos, P
Hsieh, Y.-P., Karimi, M. R., Krause, A., and Mertikopoulos, P. Riemannian stochastic optimization methods avoid strict saddle points. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023
work page 2023
-
[28]
Hu, W., Li, C. J., Li, L., and Liu, J.-G. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, 4(1):3–32, 2019
work page 2019
-
[29]
Ioffe, A. D., Tikhomirov, V. M., and Makowski, K.Theory of Extremal Problems. North Holland, Amsterdam, NL, 1979
work page 1979
-
[30]
Three Factors Influencing Minima in SGD
Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three factors influencing minima in SGD.https://arxiv.org/abs/1711.04623, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[31]
Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017
work page 2017
-
[32]
Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling
Kallenberg, O. Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling. Springer, 2021
work page 2021
-
[33]
Deep learning without poor local minima
Kawaguchi, K. Deep learning without poor local minima. In NIPS ’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016
work page 2016
-
[34]
Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function.The Annals of Mathematical Statistics, 23(3):462–466, 1952
work page 1952
-
[35]
Random Perturbations of Dynamical Systems
Kifer, Y. Random Perturbations of Dynamical Systems. Birkhäuser, Boston, MA, 1988
work page 1988
-
[36]
Kifer, Y. A discrete-time version of the Wentzell-Freidlin theory.Annals of Probability, 18(4):1676–1692, October 1990
work page 1990
-
[37]
First-order and Stochastic Optimization Methods for Machine Learning
Lan, G. First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020
work page 2020
-
[38]
Landau, L. D. and Lifshitz, E. M. Statistical physics. InCourse of Theoretical Physics, volume 5. Pergamon Press, Oxford, 1976
work page 1976
-
[39]
Li, L. and Wang, Y. A sharp uniform-in-time error estimate for stochastic gradient Langevin dynamics. https://arxiv.org/abs/2207.09304, 2022
-
[40]
Li, L. and Wang, Y. On uniform-in-time diffusion approximation for stochastic gradient descent. https://arxiv.org/abs/2207.04922, 2022
-
[41]
Stochastic modified equations and adaptive stochastic gradient algorithms
Li, Q., Tai, C., and E, W. Stochastic modified equations and adaptive stochastic gradient algorithms. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017
work page 2017
-
[42]
Li, Q., Tai, C., and E, W. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019
work page 2019
-
[43]
Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate
Li, Z., Lyu, K., and Arora, S. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020
work page 2020
-
[44]
What happens after SGD reaches zero loss? –A mathematical framework
Li, Z., Wang, T., and Arora, S. What happens after SGD reaches zero loss? –A mathematical framework. In ICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021
work page 2021
-
[45]
Fast mixing of stochastic gradient descent with normalization and weight decay
Li, Z., Wang, T., and Yu, D. Fast mixing of stochastic gradient descent with normalization and weight decay. In NeurIPS ’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022
work page 2022
-
[46]
Bad global minima exist and SGD can reach them
Liu, S., Papailiopoulos, D., and Achlioptas, D. Bad global minima exist and SGD can reach them. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020
work page 2020
-
[47]
Analysis of recursive stochastic algorithms.IEEE Trans
Ljung, L. Analysis of recursive stochastic algorithms.IEEE Trans. Autom. Control, 22(4):551–575, August 1977
work page 1977
-
[48]
Lu, Y., Balasubramanian, K., Volgushev, S., and Erdogdu, M. A. An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. InNeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021. 70 W. AZIZIAN, F. IUTZELER, J. MALICK, AND P. MERTIKOPOULOS
work page 2021
-
[49]
Lytras, I. and Mertikopoulos, P. Tamed Langevin sampling under weaker conditions.https://arxiv.org/ abs/2405.17693, 2024
-
[50]
B., Mijatović, A., and Szpruch, Ł
Majka, M. B., Mijatović, A., and Szpruch, Ł. Nonasymptotic bounds for sampling algorithms without log-concavity. The Annals of Applied Probability, 30(4):1534–1581, 2020
work page 2020
-
[51]
Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, January 2018
work page 2018
-
[52]
Mertikopoulos, P. and Staudigl, M. Stochastic mirror descent dynamics and their convergence in monotone variational inequalities.Journal of Optimization Theory and Applications, 179(3):838–867, December 2018
work page 2018
-
[53]
On the almost sure convergence of stochastic gradient descent in non-convex problems
Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. On the almost sure convergence of stochastic gradient descent in non-convex problems. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020
work page 2020
-
[54]
Mertikopoulos, P., Hsieh, Y.-P., and Cevher, V. A unified stochastic approximation framework for learning in games.Mathematical Programming, 203:559–609, January 2024
work page 2024
-
[55]
Mignacco, F. and Urbani, P. The effective noise of stochastic gradient descent.Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022
work page 2022
-
[56]
Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification
Mignacco, F., Krzakala, F., Urbani, P., and Zdeborová, L. Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020
work page 2020
-
[57]
Mori, T., Ziyin, L., Liu, K., and Ueda, M. Power-law escape rate of SGD. InICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022
work page 2022
-
[58]
L., Durmus, A., and Şimşekli, U
Pavasovic, K. L., Durmus, A., and Şimşekli, U. Approximate heavy tails in offline (multi-pass) stochastic gradient descent. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023
work page 2023
-
[59]
Pemantle, R. Nonconvergence to unstable points in urn models and stochastic aproximations.Annals of Probability, 18(2):698–712, April 1990
work page 1990
-
[60]
Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis
Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. InCOLT ’17: Proceedings of the 30th Annual Conference on Learning Theory, 2017
work page 2017
-
[61]
Rigollet, P. and Hütter, J.-C. High-dimensional statistics.https://arxiv.org/abs/2310.19244, 2023
-
[62]
Robbins, H. and Monro, S. A stochastic approximation method.Annals of Mathematical Statistics, 22: 400–407, 1951
work page 1951
-
[63]
van den Dries, L. and Miller, C. Geometric categories ando-minimal structures. Duke Mathematical Journal, 84(2), August 1996
work page 1996
-
[64]
Veiga, R., Remizova, A., and Macris, N. Stochastic gradient flow dynamics of test risk and its exact solution for weak features.https://arxiv.org/abs/2402.07626, 2024
-
[65]
J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48
Wainwright, M. J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019
work page 2019
-
[66]
Wang, Y. and Wang, Z. Three-stage evolution and fast equilibrium for SGD with non-degerate critical points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022
work page 2022
-
[67]
Stochastic gradient descent with noise of machine learning type
Wojtowytsch, S. Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023
work page 2023
-
[68]
Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021
work page 2021
-
[69]
Yang, J., Hu, W., and Li, C. J. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis, 122(3-4):371–393, 2021
work page 2021
-
[70]
Zhou, Z., Mertikopoulos, P., Bambos, N., Boyd, S. P., and Glynn, P. W. On the convergence of mirror descent beyond stochastic convex programming.SIAM Journal on Optimization, 30(1):687–716, 2020
work page 2020
-
[71]
Law of balance and stationary distribution of stochastic gradient descent
Ziyin, L., Li, H., and Ueda, M. Law of balance and stationary distribution of stochastic gradient descent. https://arxiv.org/abs/2308.06671, 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.