Recognition: unknown
SGD at the Edge of Stability: The Stochastic Sharpness Gap
Pith reviewed 2026-05-10 00:55 UTC · model grok-4.3
The pith
Gradient noise in SGD stabilizes sharpness below 2/η by strengthening the cubic self-stabilization force.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For SGD the sharpness stabilizes below 2/η because gradient noise strengthens the cubic sharpness-reducing force. Following the projected-gradient view of the deterministic case, the authors define stochastic predicted dynamics and prove a coupling theorem that bounds the deviation of actual SGD trajectories. They obtain the explicit equilibrium gap ΔS = η β σ_u²/(4α), where α is the progressive sharpening rate, β is the self-stabilization strength, and σ_u² is the gradient noise variance projected onto the top eigenvector. The formula predicts flatter solutions for smaller batches and recovers the deterministic edge of stability when the batch size equals the full dataset.
What carries the argument
Stochastic self-stabilization, in which gradient noise injects variance into the oscillatory motion along the top Hessian eigenvector and thereby amplifies the cubic term that reduces sharpness.
If this is right
- The sharpness gap increases linearly with the projected gradient noise variance, which grows as batch size decreases.
- When the batch size equals the full dataset size the gap vanishes and the model recovers the deterministic edge of stability at 2/η.
- The mechanism supplies an implicit regularization effect that favors flatter minima precisely when noise is present.
- The cubic term in the loss is the load-bearing ingredient that converts added variance into a downward shift of the equilibrium sharpness.
Where Pith is reading between the lines
- The gap formula offers a direct way to predict how changes in batch size or noise level will affect the flatness of the solution reached by SGD.
- The same stochastic-coupling approach could be applied to other noisy first-order methods to derive analogous sharpness gaps.
- Low-dimensional quadratic models with controlled noise could serve as a minimal testbed for the predicted dynamics and the coupling theorem.
Load-bearing premise
Gradient noise injects variance specifically into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force.
What would settle it
Measure the equilibrium sharpness for several batch sizes, estimate α, β, and σ_u² from the training trajectory, and check whether the observed gap equals η β σ_u²/(4α).
Figures
read the original abstract
When training neural networks with full-batch gradient descent (GD) and step size $\eta$, the largest eigenvalue of the Hessian -- the sharpness $S(\boldsymbol{\theta})$ -- rises to $2/\eta$ and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(\boldsymbol{\theta})\leq 2/\eta$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/\eta$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/\eta$. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $\Delta S = \eta \beta \sigma_{\boldsymbol{u}}^{2}/(4\alpha)$, where $\alpha$ is the progressive sharpening rate, $\beta$ is the self-stabilization strength, and $\sigma_{ \boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript extends the self-stabilization framework of Damian et al. (2023) from full-batch GD to mini-batch SGD. It argues that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium sharpness below 2/η. A stochastic coupling theorem is proved to bound SGD deviation from a moving projected gradient descent trajectory, and a closed-form equilibrium gap is derived: ΔS = η β σ_u²/(4α), where α is the progressive sharpening rate, β the self-stabilization strength, and σ_u² the projected noise variance. The formula is claimed to predict flatter solutions for smaller batch sizes and to recover the GD limit as the batch size approaches the full dataset.
Significance. If the central derivation holds, the work supplies the first mechanistic account of the batch-size dependence of sharpness at the edge of stability. The closed-form expression and the explicit recovery of the deterministic limit constitute clear strengths. The result would help explain why SGD tends to converge to flatter minima than GD and could inform analyses of implicit regularization in stochastic optimization.
major comments (3)
- [stochastic coupling theorem and equilibrium derivation] The stochastic coupling theorem is invoked to justify replacing SGD trajectories by the predicted dynamics when averaging the cubic term, yet the manuscript provides no explicit scaling of the deviation bound with batch size or η. If the accumulated deviation along the top eigenvector u is O(η) or larger, the effective cubic coefficient deviates from β and the equilibrium calculation does not close exactly at the claimed ΔS.
- [assumption on fixed eigenvector] Derivation of the gap formula assumes the top eigenvector u remains fixed over each oscillation period so that the strengthened cubic force can be averaged. No bound is given on the timescale of eigenvector drift relative to the oscillation period; drift on a comparable scale would invalidate the averaging step and the closed-form expression.
- [parameter definitions] The quantities α, β, and σ_u² are defined inside the same dynamical model used to obtain ΔS. The gap formula is therefore a re-expression of model parameters rather than an independent prediction; the manuscript should supply a procedure for estimating these quantities from the loss landscape to render the claim falsifiable.
minor comments (2)
- Ensure consistent notation for the projected noise variance between the abstract (σ_u²) and the main text; define the projection operator explicitly when first introduced.
- The claim that the formula recovers GD when the batch size equals the full dataset should be accompanied by an explicit limiting argument showing how σ_u² or β vanishes in that regime.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The comments highlight important points for rigorizing the stochastic coupling, timescale separation, and empirical applicability of our results. We address each major comment below and will revise the manuscript accordingly to incorporate the requested clarifications, bounds, and estimation procedures.
read point-by-point responses
-
Referee: The stochastic coupling theorem is invoked to justify replacing SGD trajectories by the predicted dynamics when averaging the cubic term, yet the manuscript provides no explicit scaling of the deviation bound with batch size or η. If the accumulated deviation along the top eigenvector u is O(η) or larger, the effective cubic coefficient deviates from β and the equilibrium calculation does not close exactly at the claimed ΔS.
Authors: We agree that the scaling must be stated explicitly. The proof of the stochastic coupling theorem (Appendix B) yields a per-step deviation bound of order O(η √(1/B)) in the projected direction, where B denotes batch size; over an oscillation period of length O(1), the accumulated deviation is therefore o(1) for small η and moderate B. This perturbation affects only higher-order terms and leaves the leading cubic coefficient β unchanged, so the equilibrium derivation for ΔS remains valid. We will move this scaling analysis into the main text and add a short lemma confirming the o(1) accumulation. revision: yes
-
Referee: Derivation of the gap formula assumes the top eigenvector u remains fixed over each oscillation period so that the strengthened cubic force can be averaged. No bound is given on the timescale of eigenvector drift relative to the oscillation period; drift on a comparable scale would invalidate the averaging step and the closed-form expression.
Authors: The concern is well-taken. The oscillation period scales as O(η) while eigenvector drift is controlled by the progressive-sharpening rate α. We will insert a new lemma (Section 3.2) showing that the change in u over one period is bounded by O(α η), which vanishes as η → 0 and is negligible compared with the O(1) oscillation amplitude for the step sizes used in practice. This justifies the fixed-u averaging to the order required for the closed-form ΔS. revision: yes
-
Referee: The quantities α, β, and σ_u² are defined inside the same dynamical model used to obtain ΔS. The gap formula is therefore a re-expression of model parameters rather than an independent prediction; the manuscript should supply a procedure for estimating these quantities from the loss landscape to render the claim falsifiable.
Authors: We accept that the formula is a re-expression within the model and that explicit estimation procedures are needed for falsifiability. We will add a new subsection (Section 4.3) describing concrete procedures: α is recovered from the slope of sharpness growth during the progressive-sharpening phase; β is obtained via third-order finite differences along the top eigenvector (computed by Hessian-vector products); and σ_u² is the empirical variance of mini-batch gradients projected onto u. We will also include a small-scale experiment on a quadratic-plus-cubic toy model demonstrating that the predicted ΔS matches the observed gap when these quantities are estimated from the landscape. revision: yes
Circularity Check
No significant circularity; derivation introduces new stochastic analysis
full rationale
The paper extends the deterministic self-stabilization framework of the cited prior work by defining stochastic predicted dynamics, proving a coupling theorem that bounds SGD deviation from a moving PGD trajectory, and deriving the equilibrium gap from the noise-strengthened cubic term under stated assumptions on eigenvector fixation and projected variance. The closed-form expression uses model parameters α, β, and σ_u² to state the result of that averaging step, but does not reduce the central claim to a re-expression or fit of its own inputs by construction. The coupling bound and the explicit effect of gradient noise on the sharpness-reducing force constitute independent theoretical content not present in the inputs. No load-bearing step matches any enumerated circularity pattern.
Axiom & Free-Parameter Ledger
free parameters (3)
- α
- β
- σ_u²
axioms (1)
- domain assumption The self-stabilization mechanism driven by third-order loss structure extends to SGD when gradient noise is present.
invented entities (1)
-
stochastic self-stabilization
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Ahn and J
K. Ahn and J. Zhang and S. Sra , title =. Proceedings of the International Conference on Machine Learning (ICML) , year =
-
[2]
Andreyev and P
A. Andreyev and P. Beneventano , title =. Preprint , year =
-
[3]
Arora and Z
S. Arora and Z. Li and A. Panigrahi , title =. Proceedings of the International Conference on Machine Learning (ICML) , year =
-
[4]
Blanc and N
G. Blanc and N. Gupta and G. Valiant and P. Valiant , title =. Proceedings of the Conference on Learning Theory (COLT) , year =
- [5]
-
[6]
Damian and T
A. Damian and T. Ma and J. D. Lee , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[7]
Foret and A
P. Foret and A. Kleiner and H. Mobahi and B. Neyshabur , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[8]
Gilmer and B
J. Gilmer and B. Ghorbani and A. Garg and others , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[9]
Jastrzebski and Z
S. Jastrzebski and Z. Kenton and N. Ballas and others , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[10]
Jastrzebski and M
S. Jastrzebski and M. Szymczak and S. Fort and others , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[11]
Jiang and B
Y. Jiang and B. Neyshabur and H. Mobahi and D. Krishnan and S. Bengio , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[12]
N. S. Keskar and D. Mudigere and J. Nocedal and M. Smelyanskiy and P. T. P. Tang , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[13]
Lee and C
S. Lee and C. Jang , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[15]
Li and C
Q. Li and C. Tai and W. E , title =. Journal of Machine Learning Research (JMLR) , year =
-
[16]
Li and T
Z. Li and T. Wang and S. Arora , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =
-
[17]
Li and T
Z. Li and T. Wang and S. Arora , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[18]
Lyu and Z
K. Lyu and Z. Li and S. Arora , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[19]
Ma and L
S. Ma and L. Ying , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[20]
Mandt and M
S. Mandt and M. D. Hoffman and D. M. Blei , title =. Journal of Machine Learning Research (JMLR) , volume =
-
[21]
Wu and C
L. Wu and C. Ma and W. E , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
-
[23]
International Conference on Learning Representations , year=
Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author=. International Conference on Learning Representations , year=
-
[24]
International Conference on Learning Representations , year=
Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability , author=. International Conference on Learning Representations , year=
-
[26]
International Conference on Machine Learning , year=
Understanding Gradient Descent on the Edge of Stability in Deep Learning , author=. International Conference on Machine Learning , year=
-
[27]
International Conference on Learning Representations , year=
Understanding Edge-of-Stability Training Dynamics with a Minimalist Example , author=. International Conference on Learning Representations , year=
-
[28]
Analyzing Sharpness along
Li, Zhouzi and Wang, Zixuan and Li, Jian , booktitle=. Analyzing Sharpness along
-
[29]
International Conference on Machine Learning , year=
Beyond the Edge of Stability via Two-step Gradient Updates , author=. International Conference on Machine Learning , year=
-
[30]
International Conference on Machine Learning , year=
Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond , author=. International Conference on Machine Learning , year=
-
[31]
International Conference on Machine Learning , year=
Second-Order Regression Models Exhibit Progressive Sharpening to the Edge of Stability , author=. International Conference on Machine Learning , year=
-
[32]
International Conference on Learning Representations , year=
Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos , author=. International Conference on Learning Representations , year=
-
[33]
International Conference on Learning Representations , year=
Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect , author=. International Conference on Learning Representations , year=
-
[34]
International Conference on Learning Representations , year=
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. International Conference on Learning Representations , year=
-
[35]
and Kindermans, Pieter-Jan and Ying, Chris and Le, Quoc V
Smith, Samuel L. and Kindermans, Pieter-Jan and Ying, Chris and Le, Quoc V. , journal=. Three Factors Influencing Minima in
-
[36]
Wu, Lei and Ma, Chao and others , booktitle=. How
-
[37]
International Conference on Learning Representations , year=
On the Origin of Implicit Regularization in Stochastic Gradient Descent , author=. International Conference on Learning Representations , year=
-
[38]
The Alignment Property of
Wu, Lei and Wang, Mingze and Su, Weijie , booktitle=. The Alignment Property of
-
[39]
Conference on Learning Theory , year=
Shape Matters: Understanding the Implicit Bias of the Noise Covariance , author=. Conference on Learning Theory , year=
-
[40]
, booktitle=
Damian, Alexandru and Ma, Tengyu and Lee, Jason D. , booktitle=. Label Noise
-
[41]
Conference on Learning Theory , year=
Neural Networks can Learn Representations with Gradient Descent , author=. Conference on Learning Theory , year=
-
[43]
Second-order regression models exhibit progressive sharpening to the edge of stability
Atish Agarwala, Fabian Pedregosa, and Jeffrey Pennington. Second-order regression models exhibit progressive sharpening to the edge of stability. In International Conference on Machine Learning, 2023
2023
-
[44]
Andreyev and P
A. Andreyev and P. Beneventano. Edge of stochastic stability. Preprint, 2025
2025
-
[45]
Understanding gradient descent on the edge of stability in deep learning
Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, 2022
2022
-
[46]
Beyond the edge of stability via two-step gradient updates
Lei Chen and Joan Bruna. Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning, 2023
2023
-
[47]
Cohen, Simran Kaur, Yuanzhi Li, J
Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021
2021
-
[48]
Alexandru Damian, Tengyu Ma, and Jason D. Lee. Label noise SGD provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, 2021
2021
-
[49]
Lee, and Mahdi Soltanolkotabi
Alexandru Damian, Jason D. Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, 2022
2022
-
[50]
Alexandru Damian, Eshaan Nichani, and Jason D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. In International Conference on Learning Representations, 2023
2023
-
[51]
HaoChen, Colin Wei, Jason D
Jeff Z. HaoChen, Colin Wei, Jason D. Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, 2021
2021
-
[52]
Jastrzebski, Z
S. Jastrzebski, Z. Kenton, N. Ballas, et al. On the relation between the sharpest directions of DNN loss and the SGD step length. In Proceedings of the International Conference on Learning Representations (ICLR), 2019
2019
-
[53]
Jastrzebski, M
S. Jastrzebski, M. Szymczak, S. Fort, et al. The break-even point on optimization trajectories of deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2020
2020
-
[54]
Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos
Dayal Singh Kalra, Tianyu He, and Maissam Barkeshli. Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos. In International Conference on Learning Representations, 2025
2025
-
[55]
On large-batch training for deep learning: Generalization gap and sharp minima
Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017
2017
-
[56]
Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond
Itai Kreisler, Mor Shpigel Nacson, Daniel Soudry, and Yair Carmon. Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond. In International Conference on Machine Learning, 2023
2023
-
[57]
Lee and C
S. Lee and C. Jang. A new characterization of the edge of stability based on a sharpness measure aware of batch gradient distribution. In Proceedings of the International Conference on Learning Representations (ICLR), 2023
2023
-
[58]
Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020
-
[59]
Analyzing sharpness along GD trajectory: Progressive sharpening and edge of stability
Zhouzi Li, Zixuan Wang, and Jian Li. Analyzing sharpness along GD trajectory: Progressive sharpening and edge of stability. In Advances in Neural Information Processing Systems, 2022
2022
-
[60]
arXiv preprint arXiv:1711.04623 , year=
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Three factors influencing minima in SGD . arXiv preprint arXiv:1711.04623, 2017
-
[61]
Smith, Benoit Dherin, David G
Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2021
2021
-
[62]
Yuqing Wang, Zhenghao Xu, Tuo Zhao, and Molei Tao. Good regularity creates large learning rate implicit biases: Edge of stability, balancing, and catapult. arXiv preprint arXiv:2310.17087, 2023
-
[63]
How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective
Lei Wu, Chao Ma, et al. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, 2018
2018
-
[64]
The alignment property of SGD noise and how it helps select flat minima: A stability analysis
Lei Wu, Mingze Wang, and Weijie Su. The alignment property of SGD noise and how it helps select flat minima: A stability analysis. In Advances in Neural Information Processing Systems, 2022
2022
- [65]
-
[66]
Understanding edge-of-stability training dynamics with a minimalist example
Xingyu Zhu, Zixuan Wang, Xiang Wang, Mo Zhou, and Rong Ge. Understanding edge-of-stability training dynamics with a minimalist example. In International Conference on Learning Representations, 2023
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.