Zeroth-order methods achieve mean-square stability when the step size satisfies a condition involving the entire Hessian spectrum, with full-batch ZO optimizers operating at the edge of stability and large steps regularizing the Hessian trace.
Edge of stochastic stability: Revisiting the edge of stability for sgd
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 5years
2026 5roles
extension 1polarities
extend 1representative citing papers
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.
citing papers explorer
-
Zeroth-Order Optimization at the Edge of Stability
Zeroth-order methods achieve mean-square stability when the step size satisfies a condition involving the entire Hessian spectrum, with full-batch ZO optimizers operating at the edge of stability and large steps regularizing the Hessian trace.
-
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Momentum SGD exhibits two distinct EoSS regimes for batch sharpness, stabilizing at 2(1-β)/η for small batches and 2(1+β)/η for large batches, aligning with linear stability thresholds.
-
Large Spikes in Stochastic Gradient Descent: A Large-Deviations View
Large loss spikes in SGD are polynomially likely and serve as the dominant mechanism for escaping sharp minima toward flatter solutions in the NTK regime.
-
Does Weight Decay Enhance Training Stability?
Weight decay slows progressive sharpening at the edge of stability, inducing damped oscillations in CNNs and a phase transition to sub-2/η sharpness in MLPs driven by parameter-sharpness gradient alignment, yielding more stable NTK dynamics.
-
Generalization at the Edge of Stability
Training at the edge of stability causes neural network optimizers to converge on fractal attractors whose effective dimension, measured via a new sharpness dimension from the Hessian spectrum, bounds generalization error in a way not captured by prior trace or norm measures.