A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Qihang Lin; Tianbao Yang; Yan Yan; Yi Yang; Zhe Li

arxiv: 1808.10396 · v1 · pith:YB2SCWL6new · submitted 2018-08-30 · 💻 cs.LG · stat.ML

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

Yan Yan , Tianbao Yang , Zhe Li , Qihang Lin , Yi Yang This is my paper

classification 💻 cs.LG stat.ML

keywords stochasticanalysismomentumconvergencedeepgeneralizationgradientmethod

0 comments

read the original abstract

Stochastic momentum methods have been widely adopted in training deep neural networks. However, their theoretical analysis of convergence of the training objective and the generalization error for prediction is still under-explored. This paper aims to bridge the gap between practice and theory by analyzing the stochastic gradient (SG) method, and the stochastic momentum methods including two famous variants, i.e., the stochastic heavy-ball (SHB) method and the stochastic variant of Nesterov's accelerated gradient (SNAG) method. We propose a framework that unifies the three variants. We then derive the convergence rates of the norm of gradient for the non-convex optimization problem, and analyze the generalization performance through the uniform stability approach. Particularly, the convergence analysis of the training objective exhibits that SHB and SNAG have no advantage over SG. However, the stability analysis shows that the momentum term can improve the stability of the learned model and hence improve the generalization performance. These theoretical insights verify the common wisdom and are also corroborated by our empirical analysis on deep learning.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Perfect Parallelization in Mini-Batch SGD with Classical Momentum Acceleration
cs.LG 2026-05 unverdicted novelty 6.0

Classical momentum acceleration in mini-batch SGD for quadratics is proportional to batch size up to saturation, enabling perfect parallelization under minimal noise assumptions.
Adaptive Memory Momentum via a Model-Based Framework for Deep Learning Optimization
cs.LG 2025-10 unverdicted novelty 6.0

Presents a model-based proximal framework for adaptive momentum in first-order optimizers by using a two-plane approximation of the objective to dynamically set the memory coefficient online.