What is the long-run distribution of stochastic gradient descent? A large deviations analysis

Franck Iutzeler; J\'er\^ome Malick; Panayotis Mertikopoulos; Wa\"iss Azizian

arxiv: 2406.09241 · v3 · pith:WSDVWHG2new · submitted 2024-06-13 · 🧮 math.OC · cs.LG· math.PR· stat.ML

What is the long-run distribution of stochastic gradient descent? A large deviations analysis

Wa\"iss Azizian , Franck Iutzeler , J\'er\^ome Malick , Panayotis Mertikopoulos This is my paper

Pith reviewed 2026-05-24 00:21 UTC · model grok-4.3

classification 🧮 math.OC cs.LGmath.PRstat.ML

keywords stochastic gradient descentlarge deviationsBoltzmann-Gibbs distributionnon-convex optimizationstationary distributionperturbed dynamical systemsasymptotic analysis

0 comments

The pith

The long-run distribution of SGD follows a Boltzmann-Gibbs measure whose temperature equals the step-size and whose energy levels are set by the objective and noise statistics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies large deviations theory to stochastic gradient descent on non-convex landscapes by treating the algorithm as a randomly perturbed dynamical system. It shows that the stationary distribution of the iterates is the Boltzmann-Gibbs distribution with temperature equal to the step-size. As a result, the algorithm visits the critical region exponentially more often than any non-critical region and concentrates around the lowest-energy critical component rather than necessarily the global minimum of the objective. Frequencies among other critical components are exponentially weighted by their energy levels, so components containing only local maximizers or saddles are visited far less often than nearby components of local minimizers.

Core claim

Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, the problem's critical region is visited exponentially more often than any non-critical region; the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); all other connected components of critical points are visited with frequency,

What carries the argument

The large-deviations rate function obtained by viewing SGD iterates as a randomly perturbed dynamical system, with the rate function constructed explicitly from the objective and the noise covariance.

If this is right

The critical region is visited exponentially more often than any non-critical region.
Iterates concentrate exponentially around the minimum-energy state rather than always the global minimum.
Connected components of critical points are visited with frequency exponentially proportional to their energy level.
Any component consisting only of local maximizers or saddles is visited exponentially less often than a connected component of local minimizers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Step-size selection therefore controls not only speed but also which basin the iterates ultimately favor in the long run.
The same large-deviations mechanism could be used to compare the implicit bias of different noise models or different stochastic first-order methods.
In high-dimensional problems the explicit rate function may allow prediction of which spurious local minima are effectively avoided.

Load-bearing premise

The SGD trajectory can be treated as a randomly perturbed dynamical system to which the large-deviations principle applies directly.

What would settle it

Numerical runs that track the empirical long-run frequency of visits to different connected components of critical points and show that these frequencies do not scale exponentially with the energy levels computed from the objective and noise covariance.

Figures

Figures reproduced from arXiv: 2406.09241 by Franck Iutzeler, J\'er\^ome Malick, Panayotis Mertikopoulos, Wa\"iss Azizian.

**Figure 1.** Figure 1: Graphical illustration of Theorems 1–4 for the Himmelblau test function f(x1, x2) = (x 2 1 + x2 − 11)2 + (x1 + x 2 2 − 7)2 . The figure to the left depicts the loss landscape of f with several (deterministic) orbits of the gradient flow of f superimposed for visual convenience. The figure in the middle highlights the 9 critical components of f as well as the “most likely” transitions between them: K1, K3, … view at source ↗

read the original abstract

In this paper, we examine the long-run distribution of stochastic gradient descent (SGD) in general, non-convex problems. Specifically, we seek to understand which regions of the problem's state space are more likely to be visited by SGD, and by how much. Using an approach based on the theory of large deviations and randomly perturbed dynamical systems, we show that the long-run distribution of SGD resembles the Boltzmann-Gibbs distribution of equilibrium thermodynamics with temperature equal to the method's step-size and energy levels determined by the problem's objective and the statistics of the noise. In particular, we show that, in the long run, (a) the problem's critical region is visited exponentially more often than any non-critical region; (b) the iterates of SGD are exponentially concentrated around the problem's minimum energy state (which does not always coincide with the global minimum of the objective); (c) all other connected components of critical points are visited with frequency that is exponentially proportional to their energy level; and, finally (d) any component of local maximizers or saddle points is "dominated" by a component of local minimizers which is visited exponentially more often.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SGD long-run distribution is Boltzmann-Gibbs with step-size temperature, derived via large deviations on the discrete recursion.

read the letter

The key thing to know is that the paper uses large deviations to show SGD's long-run distribution is Boltzmann-Gibbs with temperature equal to the step size. The energy levels depend on the objective and noise, leading to exponential preferences for lower-energy critical components over others, and critical regions over non-critical ones. The minimum-energy state need not be the global minimizer, and maximizer or saddle components are dominated exponentially by nearby minimizer components. The paper does well by giving a first-principles derivation that produces this explicit form for general non-convex problems. It spells out the implications clearly for visitation frequencies and supplies a single framework that organizes a lot of existing intuition about where SGD spends time asymptotically. The approach treats the SGD trajectory as a randomly perturbed dynamical system and applies standard large deviations results. The stress-test note finds this consistent with Freidlin-Wentzell-type theory for discrete time under usual local Lipschitz and moment conditions on the noise, with no internal contradictions or missing hypotheses that would block the rate-function derivation. The circularity burden is low because the construction uses external theory rather than quantities fitted from the paper itself. Soft spots are minor. The abstract supplies no equations or explicit bounds, so the precise rate function and the exact technical conditions on noise covariance and step-size schedule need checking in the body. If those hold as stated, the central argument looks solid. This paper is for optimization theorists working on asymptotic analysis of first-order stochastic methods and for readers interested in thermodynamic analogies in machine learning. It deserves a serious referee to verify the technical details and assess applicability to common SGD variants. I would bring it to a reading group to discuss the large-deviations setup.

Referee Report

0 major / 2 minor

Summary. The paper claims that the long-run occupation measure of SGD on non-convex problems obeys a large-deviations principle and therefore takes a Boltzmann-Gibbs form whose temperature equals the step-size and whose energy levels are determined by the objective together with the noise covariance; the resulting stationary distribution implies the four stated properties (a)–(d) on the relative visitation frequencies of critical components.

Significance. If the derivation is correct, the result supplies an explicit, falsifiable characterization of SGD’s stationary behavior that explains why the algorithm preferentially visits certain critical components over others, even when those components are not global minimizers of the objective. The explicit construction of the rate function from the objective and noise statistics, together with the invocation of standard Freidlin–Wentzell-type arguments for randomly perturbed dynamical systems, constitutes a clear technical contribution.

minor comments (2)

[Abstract] Abstract: the high-level claim would be more informative if it included the explicit functional form of the rate function (or at least the quasipotential) that appears in the large-deviations statement.
[Setup / Assumptions] The manuscript should state the precise moment and Lipschitz conditions on the stochastic gradient noise that are needed for the large-deviations principle to hold; these are standard but should be recorded explicitly rather than left implicit.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the careful reading and positive assessment of the manuscript, including the accurate summary of the main claims and the recommendation for minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; derivation applies external large-deviations theory

full rationale

The paper invokes the large deviations principle for randomly perturbed dynamical systems (Freidlin-Wentzell theory and discrete-time analogues) to obtain an explicit rate function for the stationary occupation measure of SGD. This rate function is constructed directly from the objective function and the noise covariance, without any parameter fitting, self-definition of energy levels, or load-bearing self-citations. The Boltzmann-Gibbs form and properties (a)-(d) are standard consequences of the resulting quasipotential under the stated local Lipschitz and moment assumptions. The central claim therefore remains independent of the paper's own inputs and does not reduce to a renaming or tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the applicability of large-deviations theory to the discrete-time SGD recursion viewed as a randomly perturbed flow; no free parameters are fitted inside the abstract, but the energy function itself is defined from the objective and noise statistics whose precise form is not given.

axioms (1)

domain assumption Large deviations principle holds for the SGD process under the problem's noise statistics
Invoked in the sentence 'Using an approach based on the theory of large deviations and randomly perturbed dynamical systems' to obtain the rate function that defines the Boltzmann-Gibbs measure.

pith-pipeline@v0.9.0 · 5762 in / 1438 out tokens · 22924 ms · 2026-05-24T00:21:08.972337+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Shuffling the Data, Stretching the Step-size: Sharper Bias in constant step-size SGD
math.OC 2026-04 unverdicted novelty 7.0

Combining random reshuffling and Richardson-Romberg extrapolation yields cubic bias refinement and better MSE for constant-step SGD on structured non-monotone variational inequalities.

Reference graph

Works this paper leans on

71 extracted references · 71 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Alongi, J. M. and Nelson, G. S.Recurrence and Topology, volume 85 ofGraduate Studies in Mathematics. American Mathematical Society, 2007

work page 2007
[2]

AdaGrad avoids saddle points

Antonakopoulos, K., Mertikopoulos, P., Piliouras, G., and Wang, X. AdaGrad avoids saddle points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

work page 2022
[3]

Dynamics of stochastic approximation algorithms

Benaïm, M. Dynamics of stochastic approximation algorithms. In Azéma, J., Émery, M., Ledoux, M., and Yor, M. (eds.),Séminaire de Probabilités XXXIII, volume 1709 ofLecture Notes in Mathematics, pp. 1–68. Springer Berlin Heidelberg, 1999

work page 1999
[4]

and Hirsch, M

Benaïm, M. and Hirsch, M. W. Dynamics of Morse-Smale urn processes.Ergodic Theory and Dynamical Systems, 15(6):1005–1030, December 1995

work page 1995
[5]

Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000

work page 2000
[6]

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

Blanc, G., Gupta, N., Valiant, G., and Valiant, P. Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. InCOLT ’20: Proceedings of the 33rd Annual Conference on Learning Theory, 2020

work page 2020
[7]

D.Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 ofLecture Notes-Monograph Series

Brown, L. D.Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 ofLecture Notes-Monograph Series. Institute of Mathematical Statistics, 1986

work page 1986
[8]

The loss surfaces of multilayer networks

Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and LeCun, Y. The loss surfaces of multilayer networks. InAISTATS ’15: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015

work page 2015
[9]

An introduction too-minimal geometry, 1999

Coste, M. An introduction too-minimal geometry, 1999

work page 1999
[10]

Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization

Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., andBengio, Y. Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization. InNIPS ’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014

work page 2014
[11]

D.Large Deviations for Markov Chains

de Acosta, A. D.Large Deviations for Markov Chains. Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, UK, 2022

work page 2022
[12]

and Zeitouni, O.Large Deviations Techniques and Applications

Dembo, A. and Zeitouni, O.Large Deviations Techniques and Applications. Springer-Verlag, Berlin, 1998

work page 1998
[13]

Bridging the gap between constant step size stochastic gradient descent and Markov chains.The Annals of Statistics, 48(3):1348–1382, June 2020

Dieuleveut, A., Durmus, A., and Bach, F. Bridging the gap between constant step size stochastic gradient descent and Markov chains.The Annals of Statistics, 48(3):1348–1382, June 2020

work page 2020
[14]

Springer Series in Operations Research and Financial Engineering

Douc, R., Moulines, E., Priouret, P., and Soulier, P.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

work page 2018
[15]

Large deviations analysis of some recursive algorithms with state dependent noise.The Annals of Probability, 16(4):1509–1536, October 1988

Dupuis, P. Large deviations analysis of some recursive algorithms with state dependent noise.The Annals of Probability, 16(4):1509–1536, October 1988

work page 1988
[16]

and Kushner, H

Dupuis, P. and Kushner, H. J. Stochastic approximations via large deviations: Asymptotic properties. SIAM Journal on Control and Optimization, 23(5):675–696, September 1985

work page 1985
[17]

Reflection couplings and contraction rates for diffusions.Probability Theory and Related Fields, 166(3-4):851–886, December 2016

Eberle, A. Reflection couplings and contraction rates for diffusions.Probability Theory and Related Fields, 166(3-4):851–886, December 2016

work page 2016
[18]

A., Mackey, L

Erdogdu, M. A., Mackey, L. W., and Shamir, O. Global non-convex optimization with discretized diffusions. InNeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018

work page 2018
[19]

Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation.Communications in Mathematical Sciences, 18 (1):163–188, 2020

Feng, Y., Gao, T., Li, L., Liu, J.-G., and Lu, Y. Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation.Communications in Mathematical Sciences, 18 (1):163–188, 2020

work page 2020
[20]

Freidlin, M. I. and Wentzell, A. D.Random Perturbations of Dynamical Systems. Springer, 1 edition, 1998

work page 1998
[21]

Escaping from saddle points – Online stochastic gradient for tensor decomposition

Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points – Online stochastic gradient for tensor decomposition. InCOLT ’15: Proceedings of the 28th Annual Conference on Learning Theory, 2015

work page 2015
[22]

Gulinsky, O. V. and Veretennikov, A. Y.Large Deviations for Discrete-Time Processes with Averaging. De Gruyter, 1993

work page 1993
[23]

The heavy-tail phenomenon in SGD

Gürbüzbalaban, M., Şimşekli, U., and Zhu, L. The heavy-tail phenomenon in SGD. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021. THE LONG-RUN DISTRIBUTION OF STOCHASTIC GRADIENT DESCENT 69

work page 2021
[24]

and Lasserre, J.-B.Markov Chains and Invariant Probabilities

Hernández-Lerma, O. and Lasserre, J.-B.Markov Chains and Invariant Probabilities. Birkhäuser, Basel, 2003

work page 2003
[25]

and Mahoney, M

Hodgkinson, L. and Mahoney, M. Multiplicative noise and heavy tails in stochastic optimization. In ICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021

work page 2021
[26]

The limits of min-max optimization algorithms: Convergence to spurious non-critical sets

Hsieh, Y.-P., Mertikopoulos, P., and Cevher, V. The limits of min-max optimization algorithms: Convergence to spurious non-critical sets. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021

work page 2021
[27]

R., Krause, A., and Mertikopoulos, P

Hsieh, Y.-P., Karimi, M. R., Krause, A., and Mertikopoulos, P. Riemannian stochastic optimization methods avoid strict saddle points. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023
[28]

J., Li, L., and Liu, J.-G

Hu, W., Li, C. J., Li, L., and Liu, J.-G. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, 4(1):3–32, 2019

work page 2019
[29]

D., Tikhomirov, V

Ioffe, A. D., Tikhomirov, V. M., and Makowski, K.Theory of Extremal Problems. North Holland, Amsterdam, NL, 1979

work page 1979
[30]

Three Factors Influencing Minima in SGD

Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three factors influencing minima in SGD.https://arxiv.org/abs/1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

M., and Jordan, M

Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017
[32]

Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling

Kallenberg, O. Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling. Springer, 2021

work page 2021
[33]

Deep learning without poor local minima

Kawaguchi, K. Deep learning without poor local minima. In NIPS ’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016

work page 2016
[34]

and Wolfowitz, J

Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function.The Annals of Mathematical Statistics, 23(3):462–466, 1952

work page 1952
[35]

Random Perturbations of Dynamical Systems

Kifer, Y. Random Perturbations of Dynamical Systems. Birkhäuser, Boston, MA, 1988

work page 1988
[36]

A discrete-time version of the Wentzell-Freidlin theory.Annals of Probability, 18(4):1676–1692, October 1990

Kifer, Y. A discrete-time version of the Wentzell-Freidlin theory.Annals of Probability, 18(4):1676–1692, October 1990

work page 1990
[37]

First-order and Stochastic Optimization Methods for Machine Learning

Lan, G. First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020

work page 2020
[38]

Landau, L. D. and Lifshitz, E. M. Statistical physics. InCourse of Theoretical Physics, volume 5. Pergamon Press, Oxford, 1976

work page 1976
[39]

and Wang, Y

Li, L. and Wang, Y. A sharp uniform-in-time error estimate for stochastic gradient Langevin dynamics. https://arxiv.org/abs/2207.09304, 2022

work page arXiv 2022
[40]

and Wang, Y

Li, L. and Wang, Y. On uniform-in-time diffusion approximation for stochastic gradient descent. https://arxiv.org/abs/2207.04922, 2022

work page arXiv 2022
[41]

Stochastic modified equations and adaptive stochastic gradient algorithms

Li, Q., Tai, C., and E, W. Stochastic modified equations and adaptive stochastic gradient algorithms. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017
[42]

Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

Li, Q., Tai, C., and E, W. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

work page 2019
[43]

Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate

Li, Z., Lyu, K., and Arora, S. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020
[44]

What happens after SGD reaches zero loss? –A mathematical framework

Li, Z., Wang, T., and Arora, S. What happens after SGD reaches zero loss? –A mathematical framework. In ICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021

work page 2021
[45]

Fast mixing of stochastic gradient descent with normalization and weight decay

Li, Z., Wang, T., and Yu, D. Fast mixing of stochastic gradient descent with normalization and weight decay. In NeurIPS ’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022

work page 2022
[46]

Bad global minima exist and SGD can reach them

Liu, S., Papailiopoulos, D., and Achlioptas, D. Bad global minima exist and SGD can reach them. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020
[47]

Analysis of recursive stochastic algorithms.IEEE Trans

Ljung, L. Analysis of recursive stochastic algorithms.IEEE Trans. Autom. Control, 22(4):551–575, August 1977

work page 1977
[48]

Lu, Y., Balasubramanian, K., Volgushev, S., and Erdogdu, M. A. An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. InNeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021. 70 W. AZIZIAN, F. IUTZELER, J. MALICK, AND P. MERTIKOPOULOS

work page 2021
[49]

and Mertikopoulos, P

Lytras, I. and Mertikopoulos, P. Tamed Langevin sampling under weaker conditions.https://arxiv.org/ abs/2405.17693, 2024

work page arXiv 2024
[50]

B., Mijatović, A., and Szpruch, Ł

Majka, M. B., Mijatović, A., and Szpruch, Ł. Nonasymptotic bounds for sampling algorithms without log-concavity. The Annals of Applied Probability, 30(4):1534–1581, 2020

work page 2020
[51]

and Staudigl, M

Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, January 2018

work page 2018
[52]

and Staudigl, M

Mertikopoulos, P. and Staudigl, M. Stochastic mirror descent dynamics and their convergence in monotone variational inequalities.Journal of Optimization Theory and Applications, 179(3):838–867, December 2018

work page 2018
[53]

On the almost sure convergence of stochastic gradient descent in non-convex problems

Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. On the almost sure convergence of stochastic gradient descent in non-convex problems. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020
[54]

A unified stochastic approximation framework for learning in games.Mathematical Programming, 203:559–609, January 2024

Mertikopoulos, P., Hsieh, Y.-P., and Cevher, V. A unified stochastic approximation framework for learning in games.Mathematical Programming, 203:559–609, January 2024

work page 2024
[55]

and Urbani, P

Mignacco, F. and Urbani, P. The effective noise of stochastic gradient descent.Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022

work page 2022
[56]

Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification

Mignacco, F., Krzakala, F., Urbani, P., and Zdeborová, L. Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020
[57]

Power-law escape rate of SGD

Mori, T., Ziyin, L., Liu, K., and Ueda, M. Power-law escape rate of SGD. InICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

work page 2022
[58]

L., Durmus, A., and Şimşekli, U

Pavasovic, K. L., Durmus, A., and Şimşekli, U. Approximate heavy tails in offline (multi-pass) stochastic gradient descent. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023
[59]

Nonconvergence to unstable points in urn models and stochastic aproximations.Annals of Probability, 18(2):698–712, April 1990

Pemantle, R. Nonconvergence to unstable points in urn models and stochastic aproximations.Annals of Probability, 18(2):698–712, April 1990

work page 1990
[60]

Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis

Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. InCOLT ’17: Proceedings of the 30th Annual Conference on Learning Theory, 2017

work page 2017
[61]

and Hütter, J.-C

Rigollet, P. and Hütter, J.-C. High-dimensional statistics.https://arxiv.org/abs/2310.19244, 2023

work page arXiv 2023
[62]

and Monro, S

Robbins, H. and Monro, S. A stochastic approximation method.Annals of Mathematical Statistics, 22: 400–407, 1951

work page 1951
[63]

and Miller, C

van den Dries, L. and Miller, C. Geometric categories ando-minimal structures. Duke Mathematical Journal, 84(2), August 1996

work page 1996
[64]

Stochastic gradient flow dynamics of test risk and its exact solution for weak features.https://arxiv.org/abs/2402.07626, 2024

Veiga, R., Remizova, A., and Macris, N. Stochastic gradient flow dynamics of test risk and its exact solution for weak features.https://arxiv.org/abs/2402.07626, 2024

work page arXiv 2024
[65]

J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48

Wainwright, M. J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019

work page 2019
[66]

and Wang, Z

Wang, Y. and Wang, Z. Three-stage evolution and fast equilibrium for SGD with non-degerate critical points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

work page 2022
[67]

Stochastic gradient descent with noise of machine learning type

Wojtowytsch, S. Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023

work page 2023
[68]

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021

work page 2021
[69]

Yang, J., Hu, W., and Li, C. J. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis, 122(3-4):371–393, 2021

work page 2021
[70]

P., and Glynn, P

Zhou, Z., Mertikopoulos, P., Bambos, N., Boyd, S. P., and Glynn, P. W. On the convergence of mirror descent beyond stochastic convex programming.SIAM Journal on Optimization, 30(1):687–716, 2020

work page 2020
[71]

Law of balance and stationary distribution of stochastic gradient descent

Ziyin, L., Li, H., and Ueda, M. Law of balance and stationary distribution of stochastic gradient descent. https://arxiv.org/abs/2308.06671, 2023

work page arXiv 2023

[1] [1]

Alongi, J. M. and Nelson, G. S.Recurrence and Topology, volume 85 ofGraduate Studies in Mathematics. American Mathematical Society, 2007

work page 2007

[2] [2]

AdaGrad avoids saddle points

Antonakopoulos, K., Mertikopoulos, P., Piliouras, G., and Wang, X. AdaGrad avoids saddle points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

work page 2022

[3] [3]

Dynamics of stochastic approximation algorithms

Benaïm, M. Dynamics of stochastic approximation algorithms. In Azéma, J., Émery, M., Ledoux, M., and Yor, M. (eds.),Séminaire de Probabilités XXXIII, volume 1709 ofLecture Notes in Mathematics, pp. 1–68. Springer Berlin Heidelberg, 1999

work page 1999

[4] [4]

and Hirsch, M

Benaïm, M. and Hirsch, M. W. Dynamics of Morse-Smale urn processes.Ergodic Theory and Dynamical Systems, 15(6):1005–1030, December 1995

work page 1995

[5] [5]

Bertsekas, D. P. and Tsitsiklis, J. N. Gradient convergence in gradient methods with errors.SIAM Journal on Optimization, 10(3):627–642, 2000

work page 2000

[6] [6]

Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process

Blanc, G., Gupta, N., Valiant, G., and Valiant, P. Implicit regularization for deep neural networks driven by an Ornstein-Uhlenbeck like process. InCOLT ’20: Proceedings of the 33rd Annual Conference on Learning Theory, 2020

work page 2020

[7] [7]

D.Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 ofLecture Notes-Monograph Series

Brown, L. D.Fundamentals of Statistical Exponential Families with Applications in Statistical Decision Theory, volume 9 ofLecture Notes-Monograph Series. Institute of Mathematical Statistics, 1986

work page 1986

[8] [8]

The loss surfaces of multilayer networks

Choromanska, A., Henaff, M., Mathieu, M., Ben Arous, G., and LeCun, Y. The loss surfaces of multilayer networks. InAISTATS ’15: Proceedings of the 18th International Conference on Artificial Intelligence and Statistics, 2015

work page 2015

[9] [9]

An introduction too-minimal geometry, 1999

Coste, M. An introduction too-minimal geometry, 1999

work page 1999

[10] [10]

Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization

Dauphin, Y.N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., andBengio, Y. Identifyingandattacking the saddle point problem in high-dimensional non-convex optimization. InNIPS ’14: Proceedings of the 27th International Conference on Neural Information Processing Systems, 2014

work page 2014

[11] [11]

D.Large Deviations for Markov Chains

de Acosta, A. D.Large Deviations for Markov Chains. Cambridge Tracts in Mathematics. Cambridge University Press, Cambridge, UK, 2022

work page 2022

[12] [12]

and Zeitouni, O.Large Deviations Techniques and Applications

Dembo, A. and Zeitouni, O.Large Deviations Techniques and Applications. Springer-Verlag, Berlin, 1998

work page 1998

[13] [13]

Bridging the gap between constant step size stochastic gradient descent and Markov chains.The Annals of Statistics, 48(3):1348–1382, June 2020

Dieuleveut, A., Durmus, A., and Bach, F. Bridging the gap between constant step size stochastic gradient descent and Markov chains.The Annals of Statistics, 48(3):1348–1382, June 2020

work page 2020

[14] [14]

Springer Series in Operations Research and Financial Engineering

Douc, R., Moulines, E., Priouret, P., and Soulier, P.Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer, 2018

work page 2018

[15] [15]

Large deviations analysis of some recursive algorithms with state dependent noise.The Annals of Probability, 16(4):1509–1536, October 1988

Dupuis, P. Large deviations analysis of some recursive algorithms with state dependent noise.The Annals of Probability, 16(4):1509–1536, October 1988

work page 1988

[16] [16]

and Kushner, H

Dupuis, P. and Kushner, H. J. Stochastic approximations via large deviations: Asymptotic properties. SIAM Journal on Control and Optimization, 23(5):675–696, September 1985

work page 1985

[17] [17]

Reflection couplings and contraction rates for diffusions.Probability Theory and Related Fields, 166(3-4):851–886, December 2016

Eberle, A. Reflection couplings and contraction rates for diffusions.Probability Theory and Related Fields, 166(3-4):851–886, December 2016

work page 2016

[18] [18]

A., Mackey, L

Erdogdu, M. A., Mackey, L. W., and Shamir, O. Global non-convex optimization with discretized diffusions. InNeurIPS ’18: Proceedings of the 32nd International Conference of Neural Information Processing Systems, 2018

work page 2018

[19] [19]

Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation.Communications in Mathematical Sciences, 18 (1):163–188, 2020

Feng, Y., Gao, T., Li, L., Liu, J.-G., and Lu, Y. Uniform-in-time weak error analysis for stochastic gradient descent algorithms via diffusion approximation.Communications in Mathematical Sciences, 18 (1):163–188, 2020

work page 2020

[20] [20]

Freidlin, M. I. and Wentzell, A. D.Random Perturbations of Dynamical Systems. Springer, 1 edition, 1998

work page 1998

[21] [21]

Escaping from saddle points – Online stochastic gradient for tensor decomposition

Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points – Online stochastic gradient for tensor decomposition. InCOLT ’15: Proceedings of the 28th Annual Conference on Learning Theory, 2015

work page 2015

[22] [22]

Gulinsky, O. V. and Veretennikov, A. Y.Large Deviations for Discrete-Time Processes with Averaging. De Gruyter, 1993

work page 1993

[23] [23]

The heavy-tail phenomenon in SGD

Gürbüzbalaban, M., Şimşekli, U., and Zhu, L. The heavy-tail phenomenon in SGD. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021. THE LONG-RUN DISTRIBUTION OF STOCHASTIC GRADIENT DESCENT 69

work page 2021

[24] [24]

and Lasserre, J.-B.Markov Chains and Invariant Probabilities

Hernández-Lerma, O. and Lasserre, J.-B.Markov Chains and Invariant Probabilities. Birkhäuser, Basel, 2003

work page 2003

[25] [25]

and Mahoney, M

Hodgkinson, L. and Mahoney, M. Multiplicative noise and heavy tails in stochastic optimization. In ICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021

work page 2021

[26] [26]

The limits of min-max optimization algorithms: Convergence to spurious non-critical sets

Hsieh, Y.-P., Mertikopoulos, P., and Cevher, V. The limits of min-max optimization algorithms: Convergence to spurious non-critical sets. InICML ’21: Proceedings of the 38th International Conference on Machine Learning, 2021

work page 2021

[27] [27]

R., Krause, A., and Mertikopoulos, P

Hsieh, Y.-P., Karimi, M. R., Krause, A., and Mertikopoulos, P. Riemannian stochastic optimization methods avoid strict saddle points. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023

[28] [28]

J., Li, L., and Liu, J.-G

Hu, W., Li, C. J., Li, L., and Liu, J.-G. On the diffusion approximation of nonconvex stochastic gradient descent. Annals of Mathematical Sciences and Applications, 4(1):3–32, 2019

work page 2019

[29] [29]

D., Tikhomirov, V

Ioffe, A. D., Tikhomirov, V. M., and Makowski, K.Theory of Extremal Problems. North Holland, Amsterdam, NL, 1979

work page 1979

[30] [30]

Three Factors Influencing Minima in SGD

Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., and Storkey, A. Three factors influencing minima in SGD.https://arxiv.org/abs/1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

M., and Jordan, M

Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017

[32] [32]

Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling

Kallenberg, O. Foundations of Modern Probability, volume 99 ofProbability Theory and Stochastic Modelling. Springer, 2021

work page 2021

[33] [33]

Deep learning without poor local minima

Kawaguchi, K. Deep learning without poor local minima. In NIPS ’16: Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016

work page 2016

[34] [34]

and Wolfowitz, J

Kiefer, J. and Wolfowitz, J. Stochastic estimation of the maximum of a regression function.The Annals of Mathematical Statistics, 23(3):462–466, 1952

work page 1952

[35] [35]

Random Perturbations of Dynamical Systems

Kifer, Y. Random Perturbations of Dynamical Systems. Birkhäuser, Boston, MA, 1988

work page 1988

[36] [36]

A discrete-time version of the Wentzell-Freidlin theory.Annals of Probability, 18(4):1676–1692, October 1990

Kifer, Y. A discrete-time version of the Wentzell-Freidlin theory.Annals of Probability, 18(4):1676–1692, October 1990

work page 1990

[37] [37]

First-order and Stochastic Optimization Methods for Machine Learning

Lan, G. First-order and Stochastic Optimization Methods for Machine Learning. Springer, 2020

work page 2020

[38] [38]

Landau, L. D. and Lifshitz, E. M. Statistical physics. InCourse of Theoretical Physics, volume 5. Pergamon Press, Oxford, 1976

work page 1976

[39] [39]

and Wang, Y

Li, L. and Wang, Y. A sharp uniform-in-time error estimate for stochastic gradient Langevin dynamics. https://arxiv.org/abs/2207.09304, 2022

work page arXiv 2022

[40] [40]

and Wang, Y

Li, L. and Wang, Y. On uniform-in-time diffusion approximation for stochastic gradient descent. https://arxiv.org/abs/2207.04922, 2022

work page arXiv 2022

[41] [41]

Stochastic modified equations and adaptive stochastic gradient algorithms

Li, Q., Tai, C., and E, W. Stochastic modified equations and adaptive stochastic gradient algorithms. In ICML ’17: Proceedings of the 34th International Conference on Machine Learning, 2017

work page 2017

[42] [42]

Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

Li, Q., Tai, C., and E, W. Stochastic modified equations and dynamics of stochastic gradient algorithms I: Mathematical foundations.Journal of Machine Learning Research, 20(40):1–47, 2019

work page 2019

[43] [43]

Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate

Li, Z., Lyu, K., and Arora, S. Reconciling modern deep learning with traditional optimization analyses: The intrinsic learning rate. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020

[44] [44]

What happens after SGD reaches zero loss? –A mathematical framework

Li, Z., Wang, T., and Arora, S. What happens after SGD reaches zero loss? –A mathematical framework. In ICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021

work page 2021

[45] [45]

Fast mixing of stochastic gradient descent with normalization and weight decay

Li, Z., Wang, T., and Yu, D. Fast mixing of stochastic gradient descent with normalization and weight decay. In NeurIPS ’22: Proceedings of the 36th International Conference on Neural Information Processing Systems, 2022

work page 2022

[46] [46]

Bad global minima exist and SGD can reach them

Liu, S., Papailiopoulos, D., and Achlioptas, D. Bad global minima exist and SGD can reach them. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020

[47] [47]

Analysis of recursive stochastic algorithms.IEEE Trans

Ljung, L. Analysis of recursive stochastic algorithms.IEEE Trans. Autom. Control, 22(4):551–575, August 1977

work page 1977

[48] [48]

Lu, Y., Balasubramanian, K., Volgushev, S., and Erdogdu, M. A. An analysis of constant step size SGD in the non-convex regime: Asymptotic normality and bias. InNeurIPS ’21: Proceedings of the 35th International Conference on Neural Information Processing Systems, 2021. 70 W. AZIZIAN, F. IUTZELER, J. MALICK, AND P. MERTIKOPOULOS

work page 2021

[49] [49]

and Mertikopoulos, P

Lytras, I. and Mertikopoulos, P. Tamed Langevin sampling under weaker conditions.https://arxiv.org/ abs/2405.17693, 2024

work page arXiv 2024

[50] [50]

B., Mijatović, A., and Szpruch, Ł

Majka, M. B., Mijatović, A., and Szpruch, Ł. Nonasymptotic bounds for sampling algorithms without log-concavity. The Annals of Applied Probability, 30(4):1534–1581, 2020

work page 2020

[51] [51]

and Staudigl, M

Mertikopoulos, P. and Staudigl, M. On the convergence of gradient-like flows with noisy gradient input. SIAM Journal on Optimization, 28(1):163–197, January 2018

work page 2018

[52] [52]

and Staudigl, M

Mertikopoulos, P. and Staudigl, M. Stochastic mirror descent dynamics and their convergence in monotone variational inequalities.Journal of Optimization Theory and Applications, 179(3):838–867, December 2018

work page 2018

[53] [53]

On the almost sure convergence of stochastic gradient descent in non-convex problems

Mertikopoulos, P., Hallak, N., Kavis, A., and Cevher, V. On the almost sure convergence of stochastic gradient descent in non-convex problems. In NeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020

[54] [54]

A unified stochastic approximation framework for learning in games.Mathematical Programming, 203:559–609, January 2024

Mertikopoulos, P., Hsieh, Y.-P., and Cevher, V. A unified stochastic approximation framework for learning in games.Mathematical Programming, 203:559–609, January 2024

work page 2024

[55] [55]

and Urbani, P

Mignacco, F. and Urbani, P. The effective noise of stochastic gradient descent.Journal of Statistical Mechanics: Theory and Experiment, 2022(8):083405, 2022

work page 2022

[56] [56]

Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification

Mignacco, F., Krzakala, F., Urbani, P., and Zdeborová, L. Dynamical mean-field theory for stochastic gradient descentin Gaussianmixture classification. InNeurIPS ’20: Proceedings of the 34th International Conference on Neural Information Processing Systems, 2020

work page 2020

[57] [57]

Power-law escape rate of SGD

Mori, T., Ziyin, L., Liu, K., and Ueda, M. Power-law escape rate of SGD. InICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

work page 2022

[58] [58]

L., Durmus, A., and Şimşekli, U

Pavasovic, K. L., Durmus, A., and Şimşekli, U. Approximate heavy tails in offline (multi-pass) stochastic gradient descent. InNeurIPS ’23: Proceedings of the 37th International Conference on Neural Information Processing Systems, 2023

work page 2023

[59] [59]

Nonconvergence to unstable points in urn models and stochastic aproximations.Annals of Probability, 18(2):698–712, April 1990

Pemantle, R. Nonconvergence to unstable points in urn models and stochastic aproximations.Annals of Probability, 18(2):698–712, April 1990

work page 1990

[60] [60]

Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis

Raginsky, M., Rakhlin, A., and Telgarsky, M. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. InCOLT ’17: Proceedings of the 30th Annual Conference on Learning Theory, 2017

work page 2017

[61] [61]

and Hütter, J.-C

Rigollet, P. and Hütter, J.-C. High-dimensional statistics.https://arxiv.org/abs/2310.19244, 2023

work page arXiv 2023

[62] [62]

and Monro, S

Robbins, H. and Monro, S. A stochastic approximation method.Annals of Mathematical Statistics, 22: 400–407, 1951

work page 1951

[63] [63]

and Miller, C

van den Dries, L. and Miller, C. Geometric categories ando-minimal structures. Duke Mathematical Journal, 84(2), August 1996

work page 1996

[64] [64]

Stochastic gradient flow dynamics of test risk and its exact solution for weak features.https://arxiv.org/abs/2402.07626, 2024

Veiga, R., Remizova, A., and Macris, N. Stochastic gradient flow dynamics of test risk and its exact solution for weak features.https://arxiv.org/abs/2402.07626, 2024

work page arXiv 2024

[65] [65]

J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48

Wainwright, M. J.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019

work page 2019

[66] [66]

and Wang, Z

Wang, Y. and Wang, Z. Three-stage evolution and fast equilibrium for SGD with non-degerate critical points. In ICML ’22: Proceedings of the 39th International Conference on Machine Learning, 2022

work page 2022

[67] [67]

Stochastic gradient descent with noise of machine learning type

Wojtowytsch, S. Stochastic gradient descent with noise of machine learning type. Part I: Discrete time analysis. Journal of Nonlinear Science, 33(3):45, 2023

work page 2023

[68] [68]

A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima

Xie, Z., Sato, I., and Sugiyama, M. A diffusion theory for deep learning dynamics: Stochastic gradient descent exponentially favors flat minima. InICLR ’21: Proceedings of the 2021 International Conference on Learning Representations, 2021

work page 2021

[69] [69]

Yang, J., Hu, W., and Li, C. J. On the fast convergence of random perturbations of the gradient flow. Asymptotic Analysis, 122(3-4):371–393, 2021

work page 2021

[70] [70]

P., and Glynn, P

Zhou, Z., Mertikopoulos, P., Bambos, N., Boyd, S. P., and Glynn, P. W. On the convergence of mirror descent beyond stochastic convex programming.SIAM Journal on Optimization, 30(1):687–716, 2020

work page 2020

[71] [71]

Law of balance and stationary distribution of stochastic gradient descent

Ziyin, L., Li, H., and Ueda, M. Law of balance and stationary distribution of stochastic gradient descent. https://arxiv.org/abs/2308.06671, 2023

work page arXiv 2023