First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Ga\"el Richard; Mert G\"urb\"uzbalaban; Thanh Huy Nguyen; Umut \c{S}im\c{s}ekli

arxiv: 1906.09069 · v1 · pith:YCVPJBAAnew · submitted 2019-06-21 · 📊 stat.ML · cs.LG

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise

Thanh Huy Nguyen , Umut \c{S}im\c{s}ekli , Mert G\"urb\"uzbalaban , Ga\"el Richard This is my paper

Pith reviewed 2026-05-25 18:53 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords stochastic gradient descentheavy-tailed noisealpha-stable distributionsmetastabilityfirst exit timeLevy motioncontinuous-time approximationstep-size conditions

0 comments

The pith

Explicit step-size bounds make discrete SGD first-exit times match those of its Levy-driven continuous limit under alpha-stable noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit conditions on the step-size so that the metastability behavior of discrete SGD, modeled with heavy-tailed alpha-stable gradient noise, stays close to the behavior of the corresponding continuous-time SDE driven by Levy motion. A sympathetic reader would care because this supplies a concrete criterion for when practical discrete SGD inherits the escape-time properties analyzed in the continuous setting, such as differential exit rates from narrow versus wide minima. The work shows that the two systems are similar for sufficiently small step-sizes and gives an explicit dependence of the approximation error on the step-size, the stability index, and other problem parameters. The results are illustrated by simulations on a synthetic model and on neural networks.

Core claim

Under the modeling assumption that gradient noise follows an alpha-stable law, SGD is a discretization of an SDE driven by Levy motion. The paper establishes explicit step-size conditions under which the first-exit-time statistics of the discrete system remain close to those of the continuous system; the behaviors coincide for small enough steps and the discrepancy is bounded by terms that depend on the algorithm parameters and the problem data.

What carries the argument

First-exit-time comparison between discrete SGD iterates and the continuous Levy-driven SDE under alpha-stable gradient noise.

If this is right

When the step-size satisfies the explicit bounds, the discrete and continuous systems exhibit similar metastability.
The error between discrete and continuous exit times is controlled by the step-size, the stability index alpha, and other problem parameters.
The similarity result applies directly to the alpha-stable noise model used for the gradient.
Simulations confirm the predicted closeness on both synthetic problems and neural-network training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The derived bounds could be used in practice to choose learning rates that justify applying continuous-time escape-time formulas to discrete SGD runs.
The same step-size analysis might be testable on other discretization schemes for Levy-driven equations beyond standard SGD.
The framework raises whether analogous conditions exist for other dynamical properties such as convergence speed under the same noise model.

Load-bearing premise

Gradient noise can be modeled by alpha-stable distributions so that SGD becomes a discretization of a Levy-driven stochastic differential equation.

What would settle it

Numerical computation of first-exit times from a basin showing that the discrete SGD times and the continuous Levy SDE times differ by more than the paper's error bound whenever the step-size exceeds the derived threshold.

Figures

Figures reproduced from arXiv: 1906.09069 by Ga\"el Richard, Mert G\"urb\"uzbalaban, Thanh Huy Nguyen, Umut \c{S}im\c{s}ekli.

**Figure 1.** Figure 1: Illustration of SαS (left), L α t (middle), wide-narrow minima (right). [3, 7, 10, 11, 12, 13]. Under the Gaussian noise assumption, the following continuous-time limit of SGD has been considered in the literature to analyze the behavior of SGD: dW(t) = −∇f(W(t))dt + √ ησdB(t) (4) where B(t) is the standard Brownian motion and σ is the noise variance and η is the step-size. The Gaussianity of the gradient … view at source ↗

**Figure 2.** Figure 2: Results of the synthetic experiments. 5 Numerical Illustration To illustrate our results, we first conduct experiments on a synthetic problem, where the cost function is set to f(x) = 1 2 kxk 2 . This corresponds to an Ornstein-Uhlenbeck-type process, which is commonly considered in metastability analyses [21]. This process locally satisfies the conditions A1-A5. Since we cannot directly simulate the conti… view at source ↗

**Figure 3.** Figure 3: Results of the neural network experiments. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

read the original abstract

Stochastic gradient descent (SGD) has been widely used in machine learning due to its computational efficiency and favorable generalization properties. Recently, it has been empirically demonstrated that the gradient noise in several deep learning settings admits a non-Gaussian, heavy-tailed behavior. This suggests that the gradient noise can be modeled by using $\alpha$-stable distributions, a family of heavy-tailed distributions that appear in the generalized central limit theorem. In this context, SGD can be viewed as a discretization of a stochastic differential equation (SDE) driven by a L\'{e}vy motion, and the metastability results for this SDE can then be used for illuminating the behavior of SGD, especially in terms of `preferring wide minima'. While this approach brings a new perspective for analyzing SGD, it is limited in the sense that, due to the time discretization, SGD might admit a significantly different behavior than its continuous-time limit. Intuitively, the behaviors of these two systems are expected to be similar to each other only when the discretization step is sufficiently small; however, to the best of our knowledge, there is no theoretical understanding on how small the step-size should be chosen in order to guarantee that the discretized system inherits the properties of the continuous-time system. In this study, we provide formal theoretical analysis where we derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit. We show that the behaviors of the two systems are indeed similar for small step-sizes and we identify how the error depends on the algorithm and problem parameters. We illustrate our results with simulations on a synthetic model and neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies explicit step-size conditions so discrete SGD under alpha-stable noise has first-exit times close to its continuous Levy SDE limit, with error scaling made explicit.

read the letter

The main thing to know is that the authors derive concrete step-size conditions under which the first-exit-time behavior of discrete SGD with heavy-tailed alpha-stable noise stays close to the continuous-time Levy-driven SDE, and they track how the approximation error depends on the step size plus other algorithm and problem parameters. They also confirm similarity holds once the step size is small enough. This closes a gap that the continuous metastability literature left open for the discrete case. The work is new in that explicit discretization control for this noise model was not available before. It does a straightforward job of starting from the empirical heavy-tail premise, moving to the SDE limit, and then quantifying when the discrete version inherits the exit-time properties without adding extra assumptions that would break the link. The simulations on a synthetic model and on neural nets are there to check the theory rather than to stand alone. The soft spots are limited. The central modeling choice that gradient noise really is alpha-stable comes from earlier empirical papers and is taken as given here; if that premise is weak the comparison loses relevance, but the paper does not claim to prove the noise model. The error bounds are stated to be explicit, yet without the full derivations visible it is hard to judge their tightness or whether any hidden regularity conditions on the Levy process are needed. No circularity or internal contradiction shows up in the abstract or stress-test notes. This paper is for people already working on continuous-time limits for SGD generalization and metastability arguments. A reader who knows the Levy-process metastability results will see the value immediately in the discretization step. It deserves a serious referee because it supplies a missing quantitative bridge with parameter dependence rather than just another existence argument.

Referee Report

0 major / 3 minor

Summary. The manuscript derives explicit conditions on the discretization step-size such that the first-exit-time (metastability) behavior of discrete SGD driven by α-stable gradient noise approximates the corresponding behavior of the limiting Lévy-driven SDE; it further identifies the explicit dependence of the approximation error on algorithm and problem parameters and illustrates the results via simulations on a synthetic model and neural networks.

Significance. If the derivations hold, the work supplies the missing theoretical link between discrete SGD and continuous-time metastability analyses under heavy-tailed noise, thereby strengthening the justification for using SDE-based arguments to explain SGD's preference for wide minima. The explicit step-size conditions and error bounds constitute a concrete advance over prior informal appeals to the continuous-time limit.

minor comments (3)

§1, paragraph following Eq. (3): the statement that the α-stable model is 'empirically demonstrated' would benefit from a brief citation to the specific empirical studies referenced, rather than leaving the claim at the level of the abstract.
The notation for the Lévy measure and the stability parameter α is introduced without an explicit reminder of the domain (0<α≤2) in the main text; adding this once in §2 would improve readability for readers outside probability theory.
Figure captions for the neural-network experiments should state the precise architecture, dataset, and number of independent runs so that the simulation results can be reproduced from the text alone.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript and the recommendation for minor revision. The referee's summary accurately reflects the paper's focus on deriving explicit step-size conditions for the first-exit-time behavior of discrete SGD to approximate its Lévy-driven SDE limit under α-stable noise.

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper derives explicit step-size conditions for the metastability (first-exit-time) behavior of discrete SGD to approximate its continuous-time Lévy-driven SDE limit under α-stable gradient noise. The modeling premise is presented as an empirical starting point drawn from prior observations, after which the analysis proceeds by establishing discretization error bounds that depend explicitly on algorithm and problem parameters. No load-bearing step reduces a claimed prediction or result to a fitted input, self-definition, or self-citation chain within the paper's own equations; the central claim remains an independent extension of continuous-time metastability results rather than a renaming or tautological restatement of its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions: that gradient noise follows an α-stable law and that the continuous SDE driven by Lévy motion is the appropriate limiting object. No free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Gradient noise admits a non-Gaussian, heavy-tailed behavior that can be modeled by α-stable distributions
Core modeling premise stated in the abstract that enables viewing SGD as discretization of a Lévy-driven SDE.
domain assumption SGD can be viewed as a discretization of a stochastic differential equation driven by a Lévy motion
Modeling step that connects discrete algorithm to continuous metastability results.

pith-pipeline@v0.9.0 · 5855 in / 1365 out tokens · 30647 ms · 2026-05-25T18:53:16.487506+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We derive explicit conditions for the step-size such that the metastability behavior of the discrete-time system is similar to its continuous-time limit... under α-stable distributions... Lévy motion
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

KL(ˆµt,µt) = 1/2ε²σ² E[∫₀ᵗ ∥b(Ŵ) + ∇f(Ŵ(s))∥² ds]

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Adaptive Federated Optimization
cs.LG 2020-02 unverdicted novelty 6.0

Proposes federated adaptive optimizers (FedAdagrad, FedAdam, FedYogi) with convergence analysis for non-convex objectives under data heterogeneity and reports empirical gains over FedAvg.

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · cited by 1 Pith paper · 11 internal anchors

[1]

Stochastic gradient descent tricks

Léon Bottou. Stochastic gradient descent tricks. InNeural networks: Tricks of the trade, pages 421–436. Springer, 2012

work page 2012
[2]

Large-scale machine learning with stochastic gradient descent

Léon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMP- STAT’2010, pages 177–186. Springer, 2010

work page 2010
[3]

Chaudhari and S

P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. InInternational Conference on Learning Representations, 2018

work page 2018
[4]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[5]

Deep learning.nature, 521(7553):436, 2015

Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning.nature, 521(7553):436, 2015

work page 2015
[6]

“imşekli, L

U. “imşekli, L. Sagun, and Gürbüzbalaban. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks. InICML, 2019

work page 2019
[7]

Three Factors Influencing Minima in SGD

S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors inﬂuencing minima in SGD.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[8]

Flat minima.Neural Computation, 9(1):1–42, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

work page 1997
[9]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[10]

Mandt, M

S. Mandt, M. Hoﬀman, and D. Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363, 2016

work page 2016
[11]

Q. Li, C. Tai, and W. E. Stochastic Modiﬁed Equations and Adaptive Stochastic Gradient Algorithms. In Proceedings of the 34th International Conference on Machine Learning, pages 2101–2110, 06–11 Aug 2017

work page 2017
[12]

W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diﬀusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Eﬀects.arXiv preprint arXiv:1803.00195, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

Thanh Huy Nguyen, Umut “imşekli, and Gaël Richard. Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization. InInternational Conference on Machine Learning, 2019

work page 2019
[15]

Springer, 2005

Bernt Karsten Øksendal and Agnes Sulem.Applied stochastic control of jump diﬀusions, volume 498. Springer, 2005. 12

work page 2005
[16]

First exit times of sdes driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006

Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006

work page 2006
[17]

Imkeller, I

P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of Lévy-driven Langevin equations. The European Physical Journal Special Topics, 191(1):211–222, 2010

work page 2010
[18]

First exit times of non-linear dynamical systems in rd perturbed by multifractal Lévy noise.Journal of Statistical Physics, 141(1):94–119, 2010

Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of non-linear dynamical systems in rd perturbed by multifractal Lévy noise.Journal of Statistical Physics, 141(1):94–119, 2010

work page 2010
[19]

S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent. InInternational Conference on Learning Representations, 2019

work page 2019
[20]

B. Tzen, T. Liang, and M. Raginsky. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. InProceedings of the 2018 Conference on Learning Theory, 2018

work page 2018
[21]

J. Duan. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015

work page 2015
[22]

J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. Journal of the american statistical association, 71(354):340–344, 1976

work page 1976
[23]

Chapman and Hall/CRC, 2003

Peter Tankov.Financial modelling with jump processes. Chapman and Hall/CRC, 2003

work page 2003
[24]

Metastability in reversible diﬀu- sion processes I: Sharp asymptotics for capacities and exit times.Journal of the European Mathematical Society, 6(4):399–424, 2004

Anton Bovier, Michael Eckhoﬀ, Véronique Gayrard, and Markus Klein. Metastability in reversible diﬀu- sion processes I: Sharp asymptotics for capacities and exit times.Journal of the European Mathematical Society, 6(4):399–424, 2004

work page 2004
[25]

First exit times of solutions of stochastic diﬀerential equations driven by multiplicative Lévy noise with heavy tails.Stochastics and Dynamics, 11(02n03):495–519, 2011

Ilya Pavlyukevich. First exit times of solutions of stochastic diﬀerential equations driven by multiplicative Lévy noise with heavy tails.Stochastics and Dynamics, 11(02n03):495–519, 2011

work page 2011
[26]

Kramers' law: Validity, derivations and generalisations

Nils Berglund. Kramers’ law: Validity, derivations and generalisations.arXiv preprint arXiv:1106.5799, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011
[27]

Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights.Journal of Statistical Physics, 161(1):171–196, 2015

Toralf Burghoﬀ and Ilya Pavlyukevich. Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights.Journal of Statistical Physics, 161(1):171–196, 2015

work page 2015
[28]

Pathwise uniqueness for singular sdes driven by stable processes.Osaka Journal of Mathematics, 49(2):421–447, 2012

Enrico Priola et al. Pathwise uniqueness for singular sdes driven by stable processes.Osaka Journal of Mathematics, 49(2):421–447, 2012

work page 2012
[29]

On weak uniqueness and distributional properties of a solution to an sde withα-stable noise

Alexei M Kulik. On weak uniqueness and distributional properties of a solution to an sde withα-stable noise. Stochastic Processes and their Applications, 129(2):473–506, 2019

work page 2019
[30]

Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative L\'{e}vy Noises via Coupling

Mingjie Liang and Jian Wang. Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative Lévy Noises via Coupling.arXiv preprint arXiv:1801.05936, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[31]

Raginsky, A

M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pages 1674–1703, 2017. 13

work page 2017
[32]

Global convergence of Langevin dynamics based algorithms for nonconvex optimization

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. InAdvances in Neural Information Processing Systems, pages 3125–3136, 2018

work page 2018
[33]

M. A. Erdogdu, L. Mackey, and O. Shamir. Global Non-convex Optimization with Discretized Diﬀusions. In Advances in Neural Information Processing Systems, pages 9693–9702, 2018

work page 2018
[34]

Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization.arXiv e-prints, page arXiv:1812.07725, Dec 2018

Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization.arXiv e-prints, page arXiv:1812.07725, Dec 2018

work page arXiv 2018
[35]

Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration.arXiv e-prints, page arXiv:1809.04618, Sep 2018

work page arXiv 2018
[36]

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Yuanzhi Li and Yang Yuan. Convergence Analysis of Two-layer Neural Networks with ReLU Activation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 597–607. Curran Associates, Inc., 2017

work page 2017
[37]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size.arXiv preprint arXiv:1811.07062, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[39]

On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes.Stochastics, 90(4):569–604, 2018

R Mikulevičius and Fanhui Xu. On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes.Stochastics, 90(4):569–604, 2018

work page 2018
[40]

A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017

work page 2017
[41]

Existence of densities for stable-like driven sde’s with Hölder continuous coeﬃcients.Journal of Functional Analysis, 264(8):1757–1778, 2013

Arnaud Debussche and Nicolas Fournier. Existence of densities for stable-like driven sde’s with Hölder continuous coeﬃcients.Journal of Functional Analysis, 264(8):1757–1778, 2013

work page 2013
[42]

Courier Corporation, 2002

Torgny Lindvall.Lectures on the coupling method. Courier Corporation, 2002

work page 2002
[43]

Weak reﬂection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015

Erhan Bayraktar, Sergey Nadtochiy, et al. Weak reﬂection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015

work page 2015
[44]

Ergodicity of stochastic differential equations with jumps and singular coefficients

Longjie Xie and Xicheng Zhang. Ergodicity of stochastic diﬀerential equations with jumps and singular coeﬃcients. arXiv preprint arXiv:1705.07402, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[45]

Moments and Absolute Moments of the Normal Distribution

Andreas Winkelbauer. Moments and absolute moments of the normal distribution.arXiv preprint arXiv:1209.4340, 2012. 14 7 Appendix 7.1 Proof of Theorem 2 Proof. Note that (W 1,...,W K)∈A is equivalent to¯τ0,a(ε)>K . Hence, from Lemma 4, the remaining task is to upper-boundP[(W (η),...,W (Kη))∈A]: P[(W (η),...,W (Kη))∈A]≤P[(W (η),...,W (Kη))∈A∩B] + P[(W (η),...

work page internal anchor Pith review Pith/arXiv arXiv 2012

[1] [1]

Stochastic gradient descent tricks

Léon Bottou. Stochastic gradient descent tricks. InNeural networks: Tricks of the trade, pages 421–436. Springer, 2012

work page 2012

[2] [2]

Large-scale machine learning with stochastic gradient descent

Léon Bottou. Large-scale machine learning with stochastic gradient descent. InProceedings of COMP- STAT’2010, pages 177–186. Springer, 2010

work page 2010

[3] [3]

Chaudhari and S

P. Chaudhari and S. Soatto. Stochastic gradient descent performs variational inference, converges to limit cycles for deep networks. InInternational Conference on Learning Representations, 2018

work page 2018

[4] [4]

Entropy-SGD: Biasing Gradient Descent Into Wide Valleys

Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[5] [5]

Deep learning.nature, 521(7553):436, 2015

Yann LeCun, Yoshua Bengio, and Geoﬀrey Hinton. Deep learning.nature, 521(7553):436, 2015

work page 2015

[6] [6]

“imşekli, L

U. “imşekli, L. Sagun, and Gürbüzbalaban. A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks. InICML, 2019

work page 2019

[7] [7]

Three Factors Influencing Minima in SGD

S. Jastrzebski, Z. Kenton, D. Arpit, N. Ballas, A. Fischer, Y. Bengio, and A. Storkey. Three factors inﬂuencing minima in SGD.arXiv preprint arXiv:1711.04623, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[8] [8]

Flat minima.Neural Computation, 9(1):1–42, 1997

Sepp Hochreiter and Jürgen Schmidhuber. Flat minima.Neural Computation, 9(1):1–42, 1997

work page 1997

[9] [9]

On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima

Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima.arXiv preprint arXiv:1609.04836, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[10] [10]

Mandt, M

S. Mandt, M. Hoﬀman, and D. Blei. A variational analysis of stochastic gradient algorithms. In International Conference on Machine Learning, pages 354–363, 2016

work page 2016

[11] [11]

Q. Li, C. Tai, and W. E. Stochastic Modiﬁed Equations and Adaptive Stochastic Gradient Algorithms. In Proceedings of the 34th International Conference on Machine Learning, pages 2101–2110, 06–11 Aug 2017

work page 2017

[12] [12]

W. Hu, C. J. Li, L. Li, and J.-G. Liu. On the diﬀusion approximation of nonconvex stochastic gradient descent. arXiv preprint arXiv:1705.07562, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects

Zhanxing Zhu, Jingfeng Wu, Bing Yu, Lei Wu, and Jinwen Ma. The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Minima and Regularization Eﬀects.arXiv preprint arXiv:1803.00195, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization

Thanh Huy Nguyen, Umut “imşekli, and Gaël Richard. Non-Asymptotic Analysis of Fractional Langevin Monte Carlo for Non-Convex Optimization. InInternational Conference on Machine Learning, 2019

work page 2019

[15] [15]

Springer, 2005

Bernt Karsten Øksendal and Agnes Sulem.Applied stochastic control of jump diﬀusions, volume 498. Springer, 2005. 12

work page 2005

[16] [16]

First exit times of sdes driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006

Peter Imkeller and Ilya Pavlyukevich. First exit times of sdes driven by stable Lévy processes.Stochastic Processes and their Applications, 116(4):611–642, 2006

work page 2006

[17] [17]

Imkeller, I

P. Imkeller, I. Pavlyukevich, and T. Wetzel. The hierarchy of exit times of Lévy-driven Langevin equations. The European Physical Journal Special Topics, 191(1):211–222, 2010

work page 2010

[18] [18]

First exit times of non-linear dynamical systems in rd perturbed by multifractal Lévy noise.Journal of Statistical Physics, 141(1):94–119, 2010

Peter Imkeller, Ilya Pavlyukevich, and Michael Stauch. First exit times of non-linear dynamical systems in rd perturbed by multifractal Lévy noise.Journal of Statistical Physics, 141(1):94–119, 2010

work page 2010

[19] [19]

S. Yaida. Fluctuation-dissipation relations for stochastic gradient descent. InInternational Conference on Learning Representations, 2019

work page 2019

[20] [20]

B. Tzen, T. Liang, and M. Raginsky. Local Optimality and Generalization Guarantees for the Langevin Algorithm via Empirical Metastability. InProceedings of the 2018 Conference on Learning Theory, 2018

work page 2018

[21] [21]

J. Duan. An Introduction to Stochastic Dynamics. Cambridge University Press, New York, 2015

work page 2015

[22] [22]

J. M. Chambers, C. L. Mallows, and B. W. Stuck. A method for simulating stable random variables. Journal of the american statistical association, 71(354):340–344, 1976

work page 1976

[23] [23]

Chapman and Hall/CRC, 2003

Peter Tankov.Financial modelling with jump processes. Chapman and Hall/CRC, 2003

work page 2003

[24] [24]

Metastability in reversible diﬀu- sion processes I: Sharp asymptotics for capacities and exit times.Journal of the European Mathematical Society, 6(4):399–424, 2004

Anton Bovier, Michael Eckhoﬀ, Véronique Gayrard, and Markus Klein. Metastability in reversible diﬀu- sion processes I: Sharp asymptotics for capacities and exit times.Journal of the European Mathematical Society, 6(4):399–424, 2004

work page 2004

[25] [25]

First exit times of solutions of stochastic diﬀerential equations driven by multiplicative Lévy noise with heavy tails.Stochastics and Dynamics, 11(02n03):495–519, 2011

Ilya Pavlyukevich. First exit times of solutions of stochastic diﬀerential equations driven by multiplicative Lévy noise with heavy tails.Stochastics and Dynamics, 11(02n03):495–519, 2011

work page 2011

[26] [26]

Kramers' law: Validity, derivations and generalisations

Nils Berglund. Kramers’ law: Validity, derivations and generalisations.arXiv preprint arXiv:1106.5799, 2011

work page internal anchor Pith review Pith/arXiv arXiv 2011

[27] [27]

Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights.Journal of Statistical Physics, 161(1):171–196, 2015

Toralf Burghoﬀ and Ilya Pavlyukevich. Spectral Analysis for a Discrete Metastable System Driven by Lévy Flights.Journal of Statistical Physics, 161(1):171–196, 2015

work page 2015

[28] [28]

Pathwise uniqueness for singular sdes driven by stable processes.Osaka Journal of Mathematics, 49(2):421–447, 2012

Enrico Priola et al. Pathwise uniqueness for singular sdes driven by stable processes.Osaka Journal of Mathematics, 49(2):421–447, 2012

work page 2012

[29] [29]

On weak uniqueness and distributional properties of a solution to an sde withα-stable noise

Alexei M Kulik. On weak uniqueness and distributional properties of a solution to an sde withα-stable noise. Stochastic Processes and their Applications, 129(2):473–506, 2019

work page 2019

[30] [30]

Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative L\'{e}vy Noises via Coupling

Mingjie Liang and Jian Wang. Gradient Estimates and Ergodicity for SDEs Driven by Multiplicative Lévy Noises via Coupling.arXiv preprint arXiv:1801.05936, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[31] [31]

Raginsky, A

M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In Proceedings of the 2017 Conference on Learning Theory, volume 65, pages 1674–1703, 2017. 13

work page 2017

[32] [32]

Global convergence of Langevin dynamics based algorithms for nonconvex optimization

Pan Xu, Jinghui Chen, Difan Zou, and Quanquan Gu. Global convergence of Langevin dynamics based algorithms for nonconvex optimization. InAdvances in Neural Information Processing Systems, pages 3125–3136, 2018

work page 2018

[33] [33]

M. A. Erdogdu, L. Mackey, and O. Shamir. Global Non-convex Optimization with Discretized Diﬀusions. In Advances in Neural Information Processing Systems, pages 9693–9702, 2018

work page 2018

[34] [34]

Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization.arXiv e-prints, page arXiv:1812.07725, Dec 2018

Xuefeng Gao, Mert Gurbuzbalaban, and Lingjiong Zhu. Breaking Reversibility Accelerates Langevin Dynamics for Global Non-Convex Optimization.arXiv e-prints, page arXiv:1812.07725, Dec 2018

work page arXiv 2018

[35] [35]

Xuefeng Gao, Mert Gürbüzbalaban, and Lingjiong Zhu. Global Convergence of Stochastic Gradient Hamiltonian Monte Carlo for Non-Convex Stochastic Optimization: Non-Asymptotic Performance Bounds and Momentum-Based Acceleration.arXiv e-prints, page arXiv:1809.04618, Sep 2018

work page arXiv 2018

[36] [36]

Convergence Analysis of Two-layer Neural Networks with ReLU Activation

Yuanzhi Li and Yang Yuan. Convergence Analysis of Two-layer Neural Networks with ReLU Activation. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 597–607. Curran Associates, Inc., 2017

work page 2017

[37] [37]

Eigenvalues of the Hessian in Deep Learning: Singularity and Beyond

Levent Sagun, Leon Bottou, and Yann LeCun. Eigenvalues of the hessian in deep learning: Singularity and beyond. arXiv preprint arXiv:1611.07476, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016

[38] [38]

The Full Spectrum of Deepnet Hessians at Scale: Dynamics with SGD Training and Sample Size

Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size.arXiv preprint arXiv:1811.07062, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[39] [39]

On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes.Stochastics, 90(4):569–604, 2018

R Mikulevičius and Fanhui Xu. On the rate of convergence of strong Euler approximation for SDEs driven by lévy processes.Stochastics, 90(4):569–604, 2018

work page 2018

[40] [40]

A. S. Dalalyan. Theoretical guarantees for approximate sampling from smooth and log-concave densities. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):651–676, 2017

work page 2017

[41] [41]

Existence of densities for stable-like driven sde’s with Hölder continuous coeﬃcients.Journal of Functional Analysis, 264(8):1757–1778, 2013

Arnaud Debussche and Nicolas Fournier. Existence of densities for stable-like driven sde’s with Hölder continuous coeﬃcients.Journal of Functional Analysis, 264(8):1757–1778, 2013

work page 2013

[42] [42]

Courier Corporation, 2002

Torgny Lindvall.Lectures on the coupling method. Courier Corporation, 2002

work page 2002

[43] [43]

Weak reﬂection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015

Erhan Bayraktar, Sergey Nadtochiy, et al. Weak reﬂection principle for Lévy processes.The Annals of Applied Probability, 25(6):3251–3294, 2015

work page 2015

[44] [44]

Ergodicity of stochastic differential equations with jumps and singular coefficients

Longjie Xie and Xicheng Zhang. Ergodicity of stochastic diﬀerential equations with jumps and singular coeﬃcients. arXiv preprint arXiv:1705.07402, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[45] [45]

Moments and Absolute Moments of the Normal Distribution

Andreas Winkelbauer. Moments and absolute moments of the normal distribution.arXiv preprint arXiv:1209.4340, 2012. 14 7 Appendix 7.1 Proof of Theorem 2 Proof. Note that (W 1,...,W K)∈A is equivalent to¯τ0,a(ε)>K . Hence, from Lemma 4, the remaining task is to upper-boundP[(W (η),...,W (Kη))∈A]: P[(W (η),...,W (Kη))∈A]≤P[(W (η),...,W (Kη))∈A∩B] + P[(W (η),...

work page internal anchor Pith review Pith/arXiv arXiv 2012