First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise
Pith reviewed 2026-05-25 18:53 UTC · model grok-4.3
The pith
Explicit step-size bounds make discrete SGD first-exit times match those of its Levy-driven continuous limit under alpha-stable noise.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the modeling assumption that gradient noise follows an alpha-stable law, SGD is a discretization of an SDE driven by Levy motion. The paper establishes explicit step-size conditions under which the first-exit-time statistics of the discrete system remain close to those of the continuous system; the behaviors coincide for small enough steps and the discrepancy is bounded by terms that depend on the algorithm parameters and the problem data.
What carries the argument
First-exit-time comparison between discrete SGD iterates and the continuous Levy-driven SDE under alpha-stable gradient noise.
If this is right
- When the step-size satisfies the explicit bounds, the discrete and continuous systems exhibit similar metastability.
- The error between discrete and continuous exit times is controlled by the step-size, the stability index alpha, and other problem parameters.
- The similarity result applies directly to the alpha-stable noise model used for the gradient.
- Simulations confirm the predicted closeness on both synthetic problems and neural-network training.
Where Pith is reading between the lines
- The derived bounds could be used in practice to choose learning rates that justify applying continuous-time escape-time formulas to discrete SGD runs.
- The same step-size analysis might be testable on other discretization schemes for Levy-driven equations beyond standard SGD.
- The framework raises whether analogous conditions exist for other dynamical properties such as convergence speed under the same noise model.
Load-bearing premise
Gradient noise can be modeled by alpha-stable distributions so that SGD becomes a discretization of a Levy-driven stochastic differential equation.
What would settle it
Numerical computation of first-exit times from a basin showing that the discrete SGD times and the continuous Levy SDE times differ by more than the paper's error bound whenever the step-size exceeds the derived threshold.
Figures
read the original abstract
Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives explicit conditions on the discretization step-size such that the first-exit-time (metastability) behavior of discrete SGD driven by α-stable gradient noise approximates the corresponding behavior of the limiting Lévy-driven SDE; it further identifies the explicit dependence of the approximation error on algorithm and problem parameters and illustrates the results via simulations on a synthetic model and neural networks.
Significance. If the derivations hold, the work supplies the missing theoretical link between discrete SGD and continuous-time metastability analyses under heavy-tailed noise, thereby strengthening the justification for using SDE-based arguments to explain SGD's preference for wide minima. The explicit step-size conditions and error bounds constitute a concrete advance over prior informal appeals to the continuous-time limit.
minor comments (3)
- §1, paragraph following Eq. (3): the statement that the α-stable model is 'empirically demonstrated' would benefit from a brief citation to the specific empirical studies referenced, rather than leaving the claim at the level of the abstract.
- The notation for the Lévy measure and the stability parameter α is introduced without an explicit reminder of the domain (0<α≤2) in the main text; adding this once in §2 would improve readability for readers outside probability theory.
- Figure captions for the neural-network experiments should state the precise architecture, dataset, and number of independent runs so that the simulation results can be reproduced from the text alone.
Simulated Author's Rebuttal
We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. The referee's summary accurately reflects the paper's focus on deriving explicit step-size conditions for the first-exit-time behavior of discrete SGD to approximate its Lévy-driven SDE limit under α-stable noise.
Circularity Check
No significant circularity in the derivation chain
full rationale
The paper derives explicit step-size conditions for the metastability (first-exit-time) behavior of discrete SGD to approximate its continuous-time Lévy-driven SDE limit under α-stable gradient noise. The modeling premise is presented as an empirical starting point drawn from prior observations, after which the analysis proceeds by establishing discretization error bounds that depend explicitly on algorithm and problem parameters. No load-bearing step reduces a claimed prediction or result to a fitted input, self-definition, or self-citation chain within the paper's own equations; the central claim remains an independent extension of continuous-time metastability results rather than a renaming or tautological restatement of its inputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gradient noise admits a non-Gaussian, heavy-tailed behavior that can be modeled by α-stable distributions
- domain assumption SGD can be viewed as a discretization of a stochastic differential equation driven by a Lévy motion
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit... under α-stable distributions... Lévy motion
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
KL(ˆµt,µt) = 1/2ε²σ² E[∫₀ᵗ ∥b(Ŵ) + ∇f(Ŵ(s))∥² ds]
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Adaptive Federated Optimization
Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.
Reference graph
Works this paper leans on
-
[1]
Stochastic gradient descent tricks
Léon Bottou. Stochastic gradient descent tricks. InNeural networks: Tricks of the trade, pages 421–436. Springer, 2012
work page 2012
-
[2]
Large-scale machine learning with stochastic gradient descent
Léon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMP- STAT’2010, pages 177–186. Springer, 2010
work page 2010
-
[3]
P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[4]
Entropy-SGD: Biasing Gradient Descent Into Wide Valleys
Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[5]
Deep learning.nature, 521(7553):436, 2015
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436, 2015
work page 2015
-
[6]
U. “imşekli, L. Sagun, and Gürbüzbalaban. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks. InICML, 2019
work page 2019
-
[7]
Three Factors Influencing Minima in SGD
S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[8]
Flat minima.Neural Computation, 9(1):1–42, 1997
Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997
work page 1997
-
[9]
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.arXiv preprint arXiv:1609.04836, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [10]
-
[11]
Q. Li, C. Tai, and W. E. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms. In Proceedings of the 34th International Conference on Machine Learning, pages 2101–2110, 06–11 Aug 2017
work page 2017
-
[12]
W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects.arXiv preprint arXiv:1803.00195, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization
Thanh Huy Nguyen, Umut “imşekli, and Gaël Richard. Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization. InInternational Conference on Machine Learning, 2019
work page 2019
-
[15]
Bernt Karsten Øksendal and Agnes Sulem.Applied stochastic control of jump diffusions, volume 498. Springer, 2005. 12
work page 2005
-
[16]
Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006
work page 2006
-
[17]
P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of Lévy-driven Langevin equations. The European Physical Journal Special Topics, 191(1):211–222, 2010
work page 2010
-
[18]
Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of non-linear dynamical systems in rd perturbed by multifractal Lévy noise.Journal of Statistical Physics, 141(1):94–119, 2010
work page 2010
-
[19]
S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent. InInternational Conference on Learning Representations, 2019
work page 2019
-
[20]
B. Tzen, T. Liang, and M. Raginsky. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. InProceedings of the 2018 Conference on Learning Theory, 2018
work page 2018
-
[21]
J. Duan. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015
work page 2015
-
[22]
J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. Journal of the american statistical association, 71(354):340–344, 1976
work page 1976
-
[23]
Peter Tankov.Financial modelling with jump processes. Chapman and Hall/CRC, 2003
work page 2003
-
[24]
Anton Bovier, Michael Eckhoff, Véronique Gayrard, and Markus Klein. Metastability in reversible diffu- sion processes I: Sharp asymptotics for capacities and exit times.Journal of the European Mathematical Society, 6(4):399–424, 2004
work page 2004
-
[25]
Ilya Pavlyukevich. First exit times of solutions of stochastic differential equations driven by multiplicative Lévy noise with heavy tails.Stochastics and Dynamics, 11(02n03):495–519, 2011
work page 2011
-
[26]
Kramers' law: Validity, derivations and generalisations
Nils Berglund. Kramers’ law: Validity, derivations and generalisations.arXiv preprint arXiv:1106.5799, 2011
work page internal anchor Pith review Pith/arXiv arXiv 2011
-
[27]
Toralf Burghoff and Ilya Pavlyukevich. Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights.Journal of Statistical Physics, 161(1):171–196, 2015
work page 2015
-
[28]
Enrico Priola et al. Pathwise uniqueness for singular sdes driven by stable processes.Osaka Journal of Mathematics, 49(2):421–447, 2012
work page 2012
-
[29]
On weak uniqueness and distributional properties of a solution to an sde withα-stable noise
Alexei M Kulik. On weak uniqueness and distributional properties of a solution to an sde withα-stable noise. Stochastic Processes and their Applications, 129(2):473–506, 2019
work page 2019
-
[30]
Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative L\'{e}vy Noises via Coupling
Mingjie Liang and Jian Wang. Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative Lévy Noises via Coupling.arXiv preprint arXiv:1801.05936, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[31]
M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pages 1674–1703, 2017. 13
work page 2017
-
[32]
Global convergence of Langevin dynamics based algorithms for nonconvex optimization
Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. InAdvances in Neural Information Processing Systems, pages 3125–3136, 2018
work page 2018
-
[33]
M. A. Erdogdu, L. Mackey, and O. Shamir. Global Non-convex Optimization with Discretized Diffusions. In Advances in Neural Information Processing Systems, pages 9693–9702, 2018
work page 2018
-
[34]
Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization.arXiv e-prints, page arXiv:1812.07725, Dec 2018
- [35]
-
[36]
Convergence Analysis of Two-layer Neural Networks with ReLU Activation
Yuanzhi Li and Yang Yuan. Convergence Analysis of Two-layer Neural Networks with ReLU Activation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 597–607. Curran Associates, Inc., 2017
work page 2017
-
[37]
Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond
Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size
Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size.arXiv preprint arXiv:1811.07062, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[39]
R Mikulevičius and Fanhui Xu. On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes.Stochastics, 90(4):569–604, 2018
work page 2018
-
[40]
A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017
work page 2017
-
[41]
Arnaud Debussche and Nicolas Fournier. Existence of densities for stable-like driven sde’s with Hölder continuous coefficients.Journal of Functional Analysis, 264(8):1757–1778, 2013
work page 2013
-
[42]
Torgny Lindvall.Lectures on the coupling method. Courier Corporation, 2002
work page 2002
-
[43]
Weak reflection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015
Erhan Bayraktar, Sergey Nadtochiy, et al. Weak reflection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015
work page 2015
-
[44]
Ergodicity of stochastic differential equations with jumps and singular coefficients
Longjie Xie and Xicheng Zhang. Ergodicity of stochastic differential equations with jumps and singular coefficients. arXiv preprint arXiv:1705.07402, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[45]
Moments and Absolute Moments of the Normal Distribution
Andreas Winkelbauer. Moments and absolute moments of the normal distribution.arXiv preprint arXiv:1209.4340, 2012. 14 7 Appendix 7.1 Proof of Theorem 2 Proof. Note that (W 1,...,W K)∈A is equivalent to¯τ0,a(ε)>K . Hence, from Lemma 4, the remaining task is to upper-boundP[(W (η),...,W (Kη))∈A]: P[(W (η),...,W (Kη))∈A]≤P[(W (η),...,W (Kη))∈A∩B] + P[(W (η),...
work page internal anchor Pith review Pith/arXiv arXiv 2012
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.