pith. sign in

arxiv: 1906.09069 · v1 · pith:YCVPJBAAnew · submitted 2019-06-21 · 📊 stat.ML · cs.LG

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Pith reviewed 2026-05-25 18:53 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords stochastic gradient descentheavy-tailed noisealpha-stable distributionsmetastabilityfirst exit timeLevy motioncontinuous-time approximationstep-size conditions
0
0 comments X

The pith

Explicit step-size bounds make discrete SGD first-exit times match those of its Levy-driven continuous limit under alpha-stable noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit conditions on the step-size so that the metastability behavior of discrete SGD, modeled with heavy-tailed alpha-stable gradient noise, stays close to the behavior of the corresponding continuous-time SDE driven by Levy motion. A sympathetic reader would care because this supplies a concrete criterion for when practical discrete SGD inherits the escape-time properties analyzed in the continuous setting, such as differential exit rates from narrow versus wide minima. The work shows that the two systems are similar for sufficiently small step-sizes and gives an explicit dependence of the approximation error on the step-size, the stability index, and other problem parameters. The results are illustrated by simulations on a synthetic model and on neural networks.

Core claim

Under the modeling assumption that gradient noise follows an alpha-stable law, SGD is a discretization of an SDE driven by Levy motion. The paper establishes explicit step-size conditions under which the first-exit-time statistics of the discrete system remain close to those of the continuous system; the behaviors coincide for small enough steps and the discrepancy is bounded by terms that depend on the algorithm parameters and the problem data.

What carries the argument

First-exit-time comparison between discrete SGD iterates and the continuous Levy-driven SDE under alpha-stable gradient noise.

If this is right

  • When the step-size satisfies the explicit bounds, the discrete and continuous systems exhibit similar metastability.
  • The error between discrete and continuous exit times is controlled by the step-size, the stability index alpha, and other problem parameters.
  • The similarity result applies directly to the alpha-stable noise model used for the gradient.
  • Simulations confirm the predicted closeness on both synthetic problems and neural-network training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The derived bounds could be used in practice to choose learning rates that justify applying continuous-time escape-time formulas to discrete SGD runs.
  • The same step-size analysis might be testable on other discretization schemes for Levy-driven equations beyond standard SGD.
  • The framework raises whether analogous conditions exist for other dynamical properties such as convergence speed under the same noise model.

Load-bearing premise

Gradient noise can be modeled by alpha-stable distributions so that SGD becomes a discretization of a Levy-driven stochastic differential equation.

What would settle it

Numerical computation of first-exit times from a basin showing that the discrete SGD times and the continuous Levy SDE times differ by more than the paper's error bound whenever the step-size exceeds the derived threshold.

Figures

Figures reproduced from arXiv: 1906.09069 by Ga\"el Richard, Mert G\"urb\"uzbalaban, Thanh Huy Nguyen, Umut \c{S}im\c{s}ekli.

Figure 1
Figure 1. Figure 1: Illustration of SαS (left), L α t (middle), wide-narrow minima (right). [3, 7, 10, 11, 12, 13]. Under the Gaussian noise assumption, the following continuous-time limit of SGD has been considered in the literature to analyze the behavior of SGD: dW(t) = −∇f(W(t))dt + √ ησdB(t) (4) where B(t) is the standard Brownian motion and σ is the noise variance and η is the step-size. The Gaussianity of the gradient … view at source ↗
Figure 2
Figure 2. Figure 2: Results of the synthetic experiments. 5 Numerical Illustration To illustrate our results, we first conduct experiments on a synthetic problem, where the cost function is set to f(x) = 1 2 kxk 2 . This corresponds to an Ornstein-Uhlenbeck-type process, which is commonly considered in metastability analyses [21]. This process locally satisfies the conditions A1-A5. Since we cannot directly simulate the conti… view at source ↗
Figure 3
Figure 3. Figure 3: Results of the neural network experiments. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
read the original abstract

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript derives explicit conditions on the discretization step-size such that the first-exit-time (metastability) behavior of discrete SGD driven by α-stable gradient noise approximates the corresponding behavior of the limiting Lévy-driven SDE; it further identifies the explicit dependence of the approximation error on algorithm and problem parameters and illustrates the results via simulations on a synthetic model and neural networks.

Significance. If the derivations hold, the work supplies the missing theoretical link between discrete SGD and continuous-time metastability analyses under heavy-tailed noise, thereby strengthening the justification for using SDE-based arguments to explain SGD's preference for wide minima. The explicit step-size conditions and error bounds constitute a concrete advance over prior informal appeals to the continuous-time limit.

minor comments (3)
  1. §1, paragraph following Eq. (3): the statement that the α-stable model is 'empirically demonstrated' would benefit from a brief citation to the specific empirical studies referenced, rather than leaving the claim at the level of the abstract.
  2. The notation for the Lévy measure and the stability parameter α is introduced without an explicit reminder of the domain (0<α≤2) in the main text; adding this once in §2 would improve readability for readers outside probability theory.
  3. Figure captions for the neural-network experiments should state the precise architecture, dataset, and number of independent runs so that the simulation results can be reproduced from the text alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. The referee's summary accurately reflects the paper's focus on deriving explicit step-size conditions for the first-exit-time behavior of discrete SGD to approximate its Lévy-driven SDE limit under α-stable noise.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives explicit step-size conditions for the metastability (first-exit-time) behavior of discrete SGD to approximate its continuous-time Lévy-driven SDE limit under α-stable gradient noise. The modeling premise is presented as an empirical starting point drawn from prior observations, after which the analysis proceeds by establishing discretization error bounds that depend explicitly on algorithm and problem parameters. No load-bearing step reduces a claimed prediction or result to a fitted input, self-definition, or self-citation chain within the paper's own equations; the central claim remains an independent extension of continuous-time metastability results rather than a renaming or tautological restatement of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that gradient noise follows an α-stable law and that the continuous SDE driven by Lévy motion is the appropriate limiting object. No free parameters or invented entities are introduced in the abstract.

axioms (2)
  • domain assumption Gradient noise admits a non-Gaussian, heavy-tailed behavior that can be modeled by α-stable distributions
    Core modeling premise stated in the abstract that enables viewing SGD as discretization of a Lévy-driven SDE.
  • domain assumption SGD can be viewed as a discretization of a stochastic differential equation driven by a Lévy motion
    Modeling step that connects discrete algorithm to continuous metastability results.

pith-pipeline@v0.9.0 · 5855 in / 1365 out tokens · 30647 ms · 2026-05-25T18:53:16.487506+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Adaptive Federated Optimization

    cs.LG 2020-02 unverdicted novelty 6.0

    Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 11 internal anchors

  1. [1]

    Stochastic gradient descent tricks

    Léon Bottou. Stochastic gradient descent tricks. InNeural networks: Tricks of the trade, pages 421–436. Springer, 2012

  2. [2]

    Large-scale machine learning with stochastic gradient descent

    Léon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMP- STAT’2010, pages 177–186. Springer, 2010

  3. [3]

    Chaudhari and S

    P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. InInternational Conference on Learning Representations, 2018

  4. [4]

    Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

    Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016

  5. [5]

    Deep learning.nature, 521(7553):436, 2015

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning.nature, 521(7553):436, 2015

  6. [6]

    “imşekli, L

    U. “imşekli, L. Sagun, and Gürbüzbalaban. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks. InICML, 2019

  7. [7]

    Three Factors Influencing Minima in SGD

    S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors influencing minima in SGD.arXiv preprint arXiv:1711.04623, 2017

  8. [8]

    Flat minima.Neural Computation, 9(1):1–42, 1997

    Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

  9. [9]

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.arXiv preprint arXiv:1609.04836, 2016

  10. [10]

    Mandt, M

    S. Mandt, M. Hoffman, and D. Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363, 2016

  11. [11]

    Q. Li, C. Tai, and W. E. Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms. In Proceedings of the 34th International Conference on Machine Learning, pages 2101–2110, 06–11 Aug 2017

  12. [12]

    W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diffusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017

  13. [13]

    The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

    Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Effects.arXiv preprint arXiv:1803.00195, 2018

  14. [14]

    Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

    Thanh Huy Nguyen, Umut “imşekli, and Gaël Richard. Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization. InInternational Conference on Machine Learning, 2019

  15. [15]

    Springer, 2005

    Bernt Karsten Øksendal and Agnes Sulem.Applied stochastic control of jump diffusions, volume 498. Springer, 2005. 12

  16. [16]

    First exit times of sdes driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006

    Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006

  17. [17]

    Imkeller, I

    P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of Lévy-driven Langevin equations. The European Physical Journal Special Topics, 191(1):211–222, 2010

  18. [18]

    First exit times of non-linear dynamical systems in rd perturbed by multifractal Lévy noise.Journal of Statistical Physics, 141(1):94–119, 2010

    Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of non-linear dynamical systems in rd perturbed by multifractal Lévy noise.Journal of Statistical Physics, 141(1):94–119, 2010

  19. [19]

    S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent. InInternational Conference on Learning Representations, 2019

  20. [20]

    B. Tzen, T. Liang, and M. Raginsky. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. InProceedings of the 2018 Conference on Learning Theory, 2018

  21. [21]

    J. Duan. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015

  22. [22]

    J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. Journal of the american statistical association, 71(354):340–344, 1976

  23. [23]

    Chapman and Hall/CRC, 2003

    Peter Tankov.Financial modelling with jump processes. Chapman and Hall/CRC, 2003

  24. [24]

    Metastability in reversible diffu- sion processes I: Sharp asymptotics for capacities and exit times.Journal of the European Mathematical Society, 6(4):399–424, 2004

    Anton Bovier, Michael Eckhoff, Véronique Gayrard, and Markus Klein. Metastability in reversible diffu- sion processes I: Sharp asymptotics for capacities and exit times.Journal of the European Mathematical Society, 6(4):399–424, 2004

  25. [25]

    First exit times of solutions of stochastic differential equations driven by multiplicative Lévy noise with heavy tails.Stochastics and Dynamics, 11(02n03):495–519, 2011

    Ilya Pavlyukevich. First exit times of solutions of stochastic differential equations driven by multiplicative Lévy noise with heavy tails.Stochastics and Dynamics, 11(02n03):495–519, 2011

  26. [26]

    Kramers' law: Validity, derivations and generalisations

    Nils Berglund. Kramers’ law: Validity, derivations and generalisations.arXiv preprint arXiv:1106.5799, 2011

  27. [27]

    Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights.Journal of Statistical Physics, 161(1):171–196, 2015

    Toralf Burghoff and Ilya Pavlyukevich. Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights.Journal of Statistical Physics, 161(1):171–196, 2015

  28. [28]

    Pathwise uniqueness for singular sdes driven by stable processes.Osaka Journal of Mathematics, 49(2):421–447, 2012

    Enrico Priola et al. Pathwise uniqueness for singular sdes driven by stable processes.Osaka Journal of Mathematics, 49(2):421–447, 2012

  29. [29]

    On weak uniqueness and distributional properties of a solution to an sde withα-stable noise

    Alexei M Kulik. On weak uniqueness and distributional properties of a solution to an sde withα-stable noise. Stochastic Processes and their Applications, 129(2):473–506, 2019

  30. [30]

    Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative L\'{e}vy Noises via Coupling

    Mingjie Liang and Jian Wang. Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative Lévy Noises via Coupling.arXiv preprint arXiv:1801.05936, 2018

  31. [31]

    Raginsky, A

    M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pages 1674–1703, 2017. 13

  32. [32]

    Global convergence of Langevin dynamics based algorithms for nonconvex optimization

    Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. InAdvances in Neural Information Processing Systems, pages 3125–3136, 2018

  33. [33]

    M. A. Erdogdu, L. Mackey, and O. Shamir. Global Non-convex Optimization with Discretized Diffusions. In Advances in Neural Information Processing Systems, pages 9693–9702, 2018

  34. [34]

    Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization.arXiv e-prints, page arXiv:1812.07725, Dec 2018

    Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization.arXiv e-prints, page arXiv:1812.07725, Dec 2018

  35. [35]

    Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration.arXiv e-prints, page arXiv:1809.04618, Sep 2018

  36. [36]

    Convergence Analysis of Two-layer Neural Networks with ReLU Activation

    Yuanzhi Li and Yang Yuan. Convergence Analysis of Two-layer Neural Networks with ReLU Activation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 597–607. Curran Associates, Inc., 2017

  37. [37]

    Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

    Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016

  38. [38]

    The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

    Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size.arXiv preprint arXiv:1811.07062, 2018

  39. [39]

    On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes.Stochastics, 90(4):569–604, 2018

    R Mikulevičius and Fanhui Xu. On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes.Stochastics, 90(4):569–604, 2018

  40. [40]

    A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017

  41. [41]

    Existence of densities for stable-like driven sde’s with Hölder continuous coefficients.Journal of Functional Analysis, 264(8):1757–1778, 2013

    Arnaud Debussche and Nicolas Fournier. Existence of densities for stable-like driven sde’s with Hölder continuous coefficients.Journal of Functional Analysis, 264(8):1757–1778, 2013

  42. [42]

    Courier Corporation, 2002

    Torgny Lindvall.Lectures on the coupling method. Courier Corporation, 2002

  43. [43]

    Weak reflection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015

    Erhan Bayraktar, Sergey Nadtochiy, et al. Weak reflection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015

  44. [44]

    Ergodicity of stochastic differential equations with jumps and singular coefficients

    Longjie Xie and Xicheng Zhang. Ergodicity of stochastic differential equations with jumps and singular coefficients. arXiv preprint arXiv:1705.07402, 2017

  45. [45]

    Moments and Absolute Moments of the Normal Distribution

    Andreas Winkelbauer. Moments and absolute moments of the normal distribution.arXiv preprint arXiv:1209.4340, 2012. 14 7 Appendix 7.1 Proof of Theorem 2 Proof. Note that (W 1,...,W K)∈A is equivalent to¯τ0,a(ε)>K . Hence, from Lemma 4, the remaining task is to upper-boundP[(W (η),...,W (Kη))∈A]: P[(W (η),...,W (Kη))∈A]≤P[(W (η),...,W (Kη))∈A∩B] + P[(W (η),...