On the diffusion approximation of nonconvex stochastic gradient descent

Chris Junchi Li; Jian-Guo Liu; Lei Li; Wenqing Hu

arxiv: 1705.07562 · v2 · pith:ZLHBMU3Gnew · submitted 2017-05-22 · 📊 stat.ML · cs.LG

On the diffusion approximation of nonconvex stochastic gradient descent

Wenqing Hu , Chris Junchi Li , Lei Li , Jian-Guo Liu This is my paper

classification 📊 stat.ML cs.LG

keywords diffusionsizetheorybatchminimizersapproximatingdescentgradient

0 comments

read the original abstract

We study the Stochastic Gradient Descent (SGD) method in nonconvex optimization problems from the point of view of approximating diffusion processes. We prove rigorously that the diffusion process can approximate the SGD algorithm weakly using the weak form of master equation for probability evolution. In the small step size regime and the presence of omnidirectional noise, our weak approximating diffusion process suggests the following dynamics for the SGD iteration starting from a local minimizer (resp.~saddle point): it escapes in a number of iterations exponentially (resp.~almost linearly) dependent on the inverse stepsize. The results are obtained using the theory for random perturbations of dynamical systems (theory of large deviations for local minimizers and theory of exiting for unstable stationary points). In addition, we discuss the effects of batch size for the deep neural networks, and we find that small batch size is helpful for SGD algorithms to escape unstable stationary points and sharp minimizers. Our theory indicates that one should increase the batch size at later stage for the SGD to be trapped in flat minimizers for better generalization.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

First Exit Time Analysis of Stochastic Gradient Descent Under Heavy-Tailed Gradient Noise
stat.ML 2019-06 unverdicted novelty 7.0

Derives explicit step-size conditions ensuring the metastability behavior of discrete SGD under heavy-tailed noise approximates its continuous SDE limit.
Why SGD is not Brownian Motion: A New Perspective on Stochastic Dynamics
cs.LG 2026-05 unverdicted novelty 6.0

SGD is reformulated via a master equation from discrete updates, producing a discrete Fokker-Planck equation that predicts non-stationary variance growth proportional to learning rate in flat Hessian directions.