A Walk with SGD

Chen Xing , Devansh Arpit , Christos Tsirigotis , Yoshua Bengio

Authors on Pith no claims yet

classification 📊 stat.ML cs.LG

keywords valleyfloorlossfindtraininggeneralizationheightlarge

read the original abstract

We present novel empirical observations regarding how stochastic gradient descent (SGD) navigates the loss landscape of over-parametrized deep neural networks (DNNs). These observations expose the qualitatively different roles of learning rate and batch-size in DNN optimization and generalization. Specifically we study the DNN loss surface along the trajectory of SGD by interpolating the loss surface between parameters from consecutive \textit{iterations} and tracking various metrics during training. We find that the loss interpolation between parameters before and after each training iteration's update is roughly convex with a minimum (\textit{valley floor}) in between for most of the training. Based on this and other metrics, we deduce that for most of the training update steps, SGD moves in valley like regions of the loss surface by jumping from one valley wall to another at a height above the valley floor. This 'bouncing between walls at a height' mechanism helps SGD traverse larger distance for small batch sizes and large learning rates which we find play qualitatively different roles in the dynamics. While a large learning rate maintains a large height from the valley floor, a small batch size injects noise facilitating exploration. We find this mechanism is crucial for generalization because the valley floor has barriers and this exploration above the valley floor allows SGD to quickly travel far away from the initialization point (without being affected by barriers) and find flatter regions, corresponding to better generalization.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Origin of Edge of Stability
cs.LG 2026-04 unverdicted novelty 7.0

Full-batch gradient descent forces the largest Hessian eigenvalue to exactly 2/η via the edge coupling functional, its criticality condition, and the mean value theorem with no gap.
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
cs.LG 2026-04 unverdicted novelty 7.0

Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
cs.LG 2026-03 unverdicted novelty 7.0

Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
SGD at the Edge of Stability: The Stochastic Sharpness Gap
cs.LG 2026-04 unverdicted novelty 6.0

SGD stabilizes sharpness below 2/η with equilibrium gap ΔS = η β σ_u²/(4α) due to noise-enhanced stochastic self-stabilization.
Generalization at the Edge of Stability
cs.LG 2026-04 unverdicted novelty 6.0

Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...
(How) Learning Rates Regulate Catastrophic Overtraining
cs.LG 2026-04 unverdicted novelty 5.0

Learning rate decay during SFT increases pretrained model sharpness, which exacerbates catastrophic forgetting and causes overtraining in LLMs.