pith. sign in

arxiv: 2209.15594 · v2 · pith:HUZWNRWWnew · submitted 2022-09-30 · 💻 cs.LG · cs.IT· math.IT· math.OC· stat.ML

Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability

classification 💻 cs.LG cs.ITmath.ITmath.OCstat.ML
keywords descentgradientstabilitytrainingedgelosssharpnessself-stabilization
0
0 comments X
read the original abstract

Traditional analyses of gradient descent show that when the largest eigenvalue of the Hessian, also known as the sharpness $S(\theta)$, is bounded by $2/\eta$, training is "stable" and the training loss decreases monotonically. Recent works, however, have observed that this assumption does not hold when training modern neural networks with full batch or large batch gradient descent. Most recently, Cohen et al. (2021) observed two important phenomena. The first, dubbed progressive sharpening, is that the sharpness steadily increases throughout training until it reaches the instability cutoff $2/\eta$. The second, dubbed edge of stability, is that the sharpness hovers at $2/\eta$ for the remainder of training while the loss continues decreasing, albeit non-monotonically. We demonstrate that, far from being chaotic, the dynamics of gradient descent at the edge of stability can be captured by a cubic Taylor expansion: as the iterates diverge in direction of the top eigenvector of the Hessian due to instability, the cubic term in the local Taylor expansion of the loss function causes the curvature to decrease until stability is restored. This property, which we call self-stabilization, is a general property of gradient descent and explains its behavior at the edge of stability. A key consequence of self-stabilization is that gradient descent at the edge of stability implicitly follows projected gradient descent (PGD) under the constraint $S(\theta) \le 2/\eta$. Our analysis provides precise predictions for the loss, sharpness, and deviation from the PGD trajectory throughout training, which we verify both empirically in a number of standard settings and theoretically under mild conditions. Our analysis uncovers the mechanism for gradient descent's implicit bias towards stability.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Non-normal spectral signatures of instability in neural network training dynamics

    cs.LG 2026-05 unverdicted novelty 7.0

    Non-normality in linearized optimizer update operators yields a pseudospectral bound where κ(V) warns of transient amplification before spectral radius indicates instability.

  2. AMUSE: Anytime Muon with Stable Gradient Evaluation

    cs.LG 2026-05 unverdicted novelty 7.0

    AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

  3. Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

    cs.LG 2026-04 unverdicted novelty 7.0

    Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.

  4. Large Spikes in Stochastic Gradient Descent: A Large-Deviations View

    cs.LG 2026-03 unverdicted novelty 7.0

    Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.

  5. Does Weight Decay Enhance Training Stability?

    cs.LG 2026-05 conditional novelty 6.0

    Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding m...

  6. Generalization at the Edge of Stability

    cs.LG 2026-04 unverdicted novelty 6.0

    Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization e...

  7. Mapping Whisper Representations to Human ECoG Responses with Interpretable Time-Resolved Neural Encoding

    q-bio.NC 2026-06 unverdicted novelty 5.0

    The paper introduces a time-resolved neural encoder combining Whisper embeddings with recurrent temporal modeling and soft attention to predict ECoG responses, finding strongest alignment in intermediate layers and an...

  8. There Will Be a Scientific Theory of Deep Learning

    stat.ML 2026-04 unverdicted novelty 2.0

    A mechanics of the learning process is emerging in deep learning theory, characterized by dynamics, coarse statistics, and falsifiable predictions across idealized settings, limits, laws, hyperparameters, and universa...