pith. machine review for the scientific record. sign in

arxiv: 2604.21016 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.AI· math.OC

Recognition: unknown

SGD at the Edge of Stability: The Stochastic Sharpness Gap

Authors on Pith no claims yet

Pith reviewed 2026-05-10 00:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.OC
keywords edge of stabilitystochastic gradient descentsharpnessself-stabilizationgradient noiseimplicit regularizationflat minima
0
0 comments X

The pith

Gradient noise in SGD stabilizes sharpness below 2/η by strengthening the cubic self-stabilization force.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explains why mini-batch SGD reaches flatter minima than full-batch gradient descent. It extends the deterministic self-stabilization theory by showing that gradient noise adds variance to the oscillations along the top Hessian eigenvector. This extra variance amplifies the third-order term that reduces sharpness, shifting the equilibrium below the 2/η threshold. The authors prove a stochastic coupling theorem and derive a closed-form gap formula that grows as batch size shrinks and vanishes when noise disappears.

Core claim

For SGD the sharpness stabilizes below 2/η because gradient noise strengthens the cubic sharpness-reducing force. Following the projected-gradient view of the deterministic case, the authors define stochastic predicted dynamics and prove a coupling theorem that bounds the deviation of actual SGD trajectories. They obtain the explicit equilibrium gap ΔS = η β σ_u²/(4α), where α is the progressive sharpening rate, β is the self-stabilization strength, and σ_u² is the gradient noise variance projected onto the top eigenvector. The formula predicts flatter solutions for smaller batches and recovers the deterministic edge of stability when the batch size equals the full dataset.

What carries the argument

Stochastic self-stabilization, in which gradient noise injects variance into the oscillatory motion along the top Hessian eigenvector and thereby amplifies the cubic term that reduces sharpness.

If this is right

  • The sharpness gap increases linearly with the projected gradient noise variance, which grows as batch size decreases.
  • When the batch size equals the full dataset size the gap vanishes and the model recovers the deterministic edge of stability at 2/η.
  • The mechanism supplies an implicit regularization effect that favors flatter minima precisely when noise is present.
  • The cubic term in the loss is the load-bearing ingredient that converts added variance into a downward shift of the equilibrium sharpness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The gap formula offers a direct way to predict how changes in batch size or noise level will affect the flatness of the solution reached by SGD.
  • The same stochastic-coupling approach could be applied to other noisy first-order methods to derive analogous sharpness gaps.
  • Low-dimensional quadratic models with controlled noise could serve as a minimal testbed for the predicted dynamics and the coupling theorem.

Load-bearing premise

Gradient noise injects variance specifically into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force.

What would settle it

Measure the equilibrium sharpness for several batch sizes, estimate α, β, and σ_u² from the training trajectory, and check whether the observed gap equals η β σ_u²/(4α).

Figures

Figures reproduced from arXiv: 2604.21016 by Afroditi Kolomvaki, Anastasios Kyrillidis, Fangshuo Liao.

Figure 1
Figure 1. Figure 1: Sharpness λmax(∇2L) during training for different batch sizes. (A) FC-Tanh + MSE: smaller batch sizes produce markedly lower equilibrium sharpness. The dashed black curve shows full-batch GD. (B) CNN + MSE: the same pattern holds with a convolutional architecture, with an even larger gap (Seq ≈ 131 at b = 50 vs. SGD ≈ 202). Shaded regions indicate ±1 standard deviation across 5 seeds. With the assumptions … view at source ↗
Figure 2
Figure 2. Figure 2: Log-log plot of the sharpness gap ∆S versus batch size b for FC-Tanh. Fitted slope = −1.27 (theory: −1), R2 = 0.98. The dotted line shows the theoretical ∆S ∝ 1/b reference slope. The shaded band indicates the 95% confidence interval on the fit. Error bars indicate ±1 standard deviation across 3–5 seeds. The inset shows raw equilibrium sharpness Seq versus b. The lower panel shows residuals from the log-lo… view at source ↗
Figure 3
Figure 3. Figure 3: Projected gradient noise variance σ 2 u = VarB[⟨gB, u⟩] versus batch size on a log-log scale. Fit￾ted slope = −1.21 (theory: −1), R2 = 0.979. The CNN architecture with MSE loss exhibits an even stronger effect: the sharpness gap at b = 50 is ∆S ≈ 71 (versus ∆S ≈ 17 for FC-Tanh), with GD reaching SGD ≈ 202. The larger gap likely reflects the CNN’s richer parameter space and stronger gra￾dient noise along th… view at source ↗
Figure 4
Figure 4. Figure 4: Anatomy of the sharpness gap. (A) Measured ∆S across batch sizes with the fitted power-law ∆S ∝ b −1.27 . (B) Comparison of measured ∆S (blue) with the rescaled product βσ2 u (teal) for each batch size. The close tracking confirms that the gap is driven by the product of gradient noise and self-stabilization strength. 6 Related Work Edge of Stability. The Edge of Stability (EoS) phenomenon was systematical… view at source ↗
Figure 5
Figure 5. Figure 5: Simulation of the ˆx 2 t , yˆt dynamic given in Lemma 4.6. We choose βs→t = β = 1, and δ 2 t = δ 2 = 0.5. Also, we choose η = 0.01, κt = 0, γs→t = 0, and ζt ∼ N (0, σ2 u) for difference choice of σu. We choose initial value ˆxt = 1 and ˆyt = 0. Trajectories are averaged over 100 runs. relationship between batch size and learning rate in controlling sharpness was further studied by Smith et al. [2017], who … view at source ↗
Figure 6
Figure 6. Figure 6: Batch sharpness versus full-batch sharpness during training for batch sizes [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Landscape quantities during training (b = 200, η = 0.01). (A) Sharpness with the 2/η threshold. (B) Projected noise variance σ 2 u . (C) Self-stabilization strength β = ∥∇S ⊥∥ 2 . (D) Progressive sharpening rate α; negative values reflect measurement at the oscillating iterate rather than the PGD reference trajectory (see text). Raw measurements are shown as semi-transparent dots; solid lines show Savitzky… view at source ↗
Figure 8
Figure 8. Figure 8: (A) Equilibrium sharpness versus batch size for FC-Tanh + MSE (η = 0.01). The dashed line marks the GD baseline SGD = 207.8. (B) Equilibrium sharpness for GD (gray) and SGD with b = 200 (blue) across learning rates. SGD consistently equilibrates below GD. 0 3 6 9 12 15 18 t 0.0 0.2 0.4 0.6 0.8 1.0 Value ×10 4 [(1 + yt])2x 2 t ] (1 + [yt])2 [x 2 t ] [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Dynamic of the two quantities involved in [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: The role of the loss function. (A) CNN + CE: sharpness peaks at ≈ 95 and collapses to ≈ 15, far below 2/η = 200, with no batch-size dependence. Edge-of-stability dynamics never emerge. (B) CNN + MSE (same architecture): clean EoS behavior with pronounced batch-size-dependent sharpness gap. Shaded regions indicate ±1 standard deviation across 3 seeds. that the scaling predictions (which test the functional… view at source ↗
read the original abstract

When training neural networks with full-batch gradient descent (GD) and step size $\eta$, the largest eigenvalue of the Hessian -- the sharpness $S(\boldsymbol{\theta})$ -- rises to $2/\eta$ and hovers there, a phenomenon termed the Edge of Stability (EoS). \citet{damian2023selfstab} showed that this behavior is explained by a self-stabilization mechanism driven by third-order structure of the loss, and that GD implicitly follows projected gradient descent (PGD) on the constraint $ S(\boldsymbol{\theta})\leq 2/\eta$. For mini-batch stochastic gradient descent (SGD), the sharpness stabilizes below $2/\eta$, with the gap widening as the batch size decreases; yet no theoretical explanation exists for this suppression. We introduce stochastic self-stabilization, extending the self-stabilization framework to SGD. Our key insight is that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium below $2/\eta$. Following the approach of \citet{damian2023selfstab}, we define stochastic predicted dynamics relative to a moving projected gradient descent trajectory and prove a stochastic coupling theorem that bounds the deviation of SGD from these predictions. We derive a closed-form equilibrium sharpness gap: $\Delta S = \eta \beta \sigma_{\boldsymbol{u}}^{2}/(4\alpha)$, where $\alpha$ is the progressive sharpening rate, $\beta$ is the self-stabilization strength, and $\sigma_{ \boldsymbol{u}}^{2}$ is the gradient noise variance projected onto the top eigenvector. This formula predicts that smaller batch sizes yield flatter solutions and recovers GD when the batch equals the full dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript extends the self-stabilization framework of Damian et al. (2023) from full-batch GD to mini-batch SGD. It argues that gradient noise injects variance into the oscillatory dynamics along the top Hessian eigenvector, strengthening the cubic sharpness-reducing force and shifting the equilibrium sharpness below 2/η. A stochastic coupling theorem is proved to bound SGD deviation from a moving projected gradient descent trajectory, and a closed-form equilibrium gap is derived: ΔS = η β σ_u²/(4α), where α is the progressive sharpening rate, β the self-stabilization strength, and σ_u² the projected noise variance. The formula is claimed to predict flatter solutions for smaller batch sizes and to recover the GD limit as the batch size approaches the full dataset.

Significance. If the central derivation holds, the work supplies the first mechanistic account of the batch-size dependence of sharpness at the edge of stability. The closed-form expression and the explicit recovery of the deterministic limit constitute clear strengths. The result would help explain why SGD tends to converge to flatter minima than GD and could inform analyses of implicit regularization in stochastic optimization.

major comments (3)
  1. [stochastic coupling theorem and equilibrium derivation] The stochastic coupling theorem is invoked to justify replacing SGD trajectories by the predicted dynamics when averaging the cubic term, yet the manuscript provides no explicit scaling of the deviation bound with batch size or η. If the accumulated deviation along the top eigenvector u is O(η) or larger, the effective cubic coefficient deviates from β and the equilibrium calculation does not close exactly at the claimed ΔS.
  2. [assumption on fixed eigenvector] Derivation of the gap formula assumes the top eigenvector u remains fixed over each oscillation period so that the strengthened cubic force can be averaged. No bound is given on the timescale of eigenvector drift relative to the oscillation period; drift on a comparable scale would invalidate the averaging step and the closed-form expression.
  3. [parameter definitions] The quantities α, β, and σ_u² are defined inside the same dynamical model used to obtain ΔS. The gap formula is therefore a re-expression of model parameters rather than an independent prediction; the manuscript should supply a procedure for estimating these quantities from the loss landscape to render the claim falsifiable.
minor comments (2)
  1. Ensure consistent notation for the projected noise variance between the abstract (σ_u²) and the main text; define the projection operator explicitly when first introduced.
  2. The claim that the formula recovers GD when the batch size equals the full dataset should be accompanied by an explicit limiting argument showing how σ_u² or β vanishes in that regime.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments highlight important points for rigorizing the stochastic coupling, timescale separation, and empirical applicability of our results. We address each major comment below and will revise the manuscript accordingly to incorporate the requested clarifications, bounds, and estimation procedures.

read point-by-point responses
  1. Referee: The stochastic coupling theorem is invoked to justify replacing SGD trajectories by the predicted dynamics when averaging the cubic term, yet the manuscript provides no explicit scaling of the deviation bound with batch size or η. If the accumulated deviation along the top eigenvector u is O(η) or larger, the effective cubic coefficient deviates from β and the equilibrium calculation does not close exactly at the claimed ΔS.

    Authors: We agree that the scaling must be stated explicitly. The proof of the stochastic coupling theorem (Appendix B) yields a per-step deviation bound of order O(η √(1/B)) in the projected direction, where B denotes batch size; over an oscillation period of length O(1), the accumulated deviation is therefore o(1) for small η and moderate B. This perturbation affects only higher-order terms and leaves the leading cubic coefficient β unchanged, so the equilibrium derivation for ΔS remains valid. We will move this scaling analysis into the main text and add a short lemma confirming the o(1) accumulation. revision: yes

  2. Referee: Derivation of the gap formula assumes the top eigenvector u remains fixed over each oscillation period so that the strengthened cubic force can be averaged. No bound is given on the timescale of eigenvector drift relative to the oscillation period; drift on a comparable scale would invalidate the averaging step and the closed-form expression.

    Authors: The concern is well-taken. The oscillation period scales as O(η) while eigenvector drift is controlled by the progressive-sharpening rate α. We will insert a new lemma (Section 3.2) showing that the change in u over one period is bounded by O(α η), which vanishes as η → 0 and is negligible compared with the O(1) oscillation amplitude for the step sizes used in practice. This justifies the fixed-u averaging to the order required for the closed-form ΔS. revision: yes

  3. Referee: The quantities α, β, and σ_u² are defined inside the same dynamical model used to obtain ΔS. The gap formula is therefore a re-expression of model parameters rather than an independent prediction; the manuscript should supply a procedure for estimating these quantities from the loss landscape to render the claim falsifiable.

    Authors: We accept that the formula is a re-expression within the model and that explicit estimation procedures are needed for falsifiability. We will add a new subsection (Section 4.3) describing concrete procedures: α is recovered from the slope of sharpness growth during the progressive-sharpening phase; β is obtained via third-order finite differences along the top eigenvector (computed by Hessian-vector products); and σ_u² is the empirical variance of mini-batch gradients projected onto u. We will also include a small-scale experiment on a quadratic-plus-cubic toy model demonstrating that the predicted ΔS matches the observed gap when these quantities are estimated from the landscape. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation introduces new stochastic analysis

full rationale

The paper extends the deterministic self-stabilization framework of the cited prior work by defining stochastic predicted dynamics, proving a coupling theorem that bounds SGD deviation from a moving PGD trajectory, and deriving the equilibrium gap from the noise-strengthened cubic term under stated assumptions on eigenvector fixation and projected variance. The closed-form expression uses model parameters α, β, and σ_u² to state the result of that averaging step, but does not reduce the central claim to a re-expression or fit of its own inputs by construction. The coupling bound and the explicit effect of gradient noise on the sharpness-reducing force constitute independent theoretical content not present in the inputs. No load-bearing step matches any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

3 free parameters · 1 axioms · 1 invented entities

The central claim rests on extending the deterministic self-stabilization mechanism (third-order loss structure) to the stochastic case via noise injection, together with the assumption that the moving projected gradient descent trajectory remains a valid reference for coupling.

free parameters (3)
  • α
    Progressive sharpening rate appearing in the gap formula; defined from the underlying dynamics.
  • β
    Self-stabilization strength appearing in the gap formula; defined from the underlying dynamics.
  • σ_u²
    Projected gradient noise variance appearing in the gap formula; measured from the stochastic process.
axioms (1)
  • domain assumption The self-stabilization mechanism driven by third-order loss structure extends to SGD when gradient noise is present.
    Invoked when defining stochastic self-stabilization and the coupling to the moving PGD trajectory.
invented entities (1)
  • stochastic self-stabilization no independent evidence
    purpose: Mechanism by which gradient noise strengthens the cubic sharpness-reducing force.
    New concept introduced to account for the observed suppression of sharpness below 2/η.

pith-pipeline@v0.9.0 · 5628 in / 1408 out tokens · 48040 ms · 2026-05-10T00:55:47.152904+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 5 canonical work pages

  1. [1]

    Ahn and J

    K. Ahn and J. Zhang and S. Sra , title =. Proceedings of the International Conference on Machine Learning (ICML) , year =

  2. [2]

    Andreyev and P

    A. Andreyev and P. Beneventano , title =. Preprint , year =

  3. [3]

    Arora and Z

    S. Arora and Z. Li and A. Panigrahi , title =. Proceedings of the International Conference on Machine Learning (ICML) , year =

  4. [4]

    Blanc and N

    G. Blanc and N. Gupta and G. Valiant and P. Valiant , title =. Proceedings of the Conference on Learning Theory (COLT) , year =

  5. [5]

    Chen and J

    L. Chen and J. Bruna and K. Lyu , title =. arXiv:2206.04172 , year =

  6. [6]

    Damian and T

    A. Damian and T. Ma and J. D. Lee , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  7. [7]

    Foret and A

    P. Foret and A. Kleiner and H. Mobahi and B. Neyshabur , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  8. [8]

    Gilmer and B

    J. Gilmer and B. Ghorbani and A. Garg and others , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  9. [9]

    Jastrzebski and Z

    S. Jastrzebski and Z. Kenton and N. Ballas and others , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  10. [10]

    Jastrzebski and M

    S. Jastrzebski and M. Szymczak and S. Fort and others , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  11. [11]

    Jiang and B

    Y. Jiang and B. Neyshabur and H. Mobahi and D. Krishnan and S. Bengio , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  12. [12]

    N. S. Keskar and D. Mudigere and J. Nocedal and M. Smelyanskiy and P. T. P. Tang , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  13. [13]

    Lee and C

    S. Lee and C. Jang , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  14. [15]

    Li and C

    Q. Li and C. Tai and W. E , title =. Journal of Machine Learning Research (JMLR) , year =

  15. [16]

    Li and T

    Z. Li and T. Wang and S. Arora , title =. Proceedings of the International Conference on Learning Representations (ICLR) , year =

  16. [17]

    Li and T

    Z. Li and T. Wang and S. Arora , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  17. [18]

    Lyu and Z

    K. Lyu and Z. Li and S. Arora , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  18. [19]

    Ma and L

    S. Ma and L. Ying , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  19. [20]

    Mandt and M

    S. Mandt and M. D. Hoffman and D. M. Blei , title =. Journal of Machine Learning Research (JMLR) , volume =

  20. [21]

    Wu and C

    L. Wu and C. Ma and W. E , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =

  21. [23]

    International Conference on Learning Representations , year=

    Gradient Descent on Neural Networks Typically Occurs at the Edge of Stability , author=. International Conference on Learning Representations , year=

  22. [24]

    International Conference on Learning Representations , year=

    Self-Stabilization: The Implicit Bias of Gradient Descent at the Edge of Stability , author=. International Conference on Learning Representations , year=

  23. [26]

    International Conference on Machine Learning , year=

    Understanding Gradient Descent on the Edge of Stability in Deep Learning , author=. International Conference on Machine Learning , year=

  24. [27]

    International Conference on Learning Representations , year=

    Understanding Edge-of-Stability Training Dynamics with a Minimalist Example , author=. International Conference on Learning Representations , year=

  25. [28]

    Analyzing Sharpness along

    Li, Zhouzi and Wang, Zixuan and Li, Jian , booktitle=. Analyzing Sharpness along

  26. [29]

    International Conference on Machine Learning , year=

    Beyond the Edge of Stability via Two-step Gradient Updates , author=. International Conference on Machine Learning , year=

  27. [30]

    International Conference on Machine Learning , year=

    Gradient Descent Monotonically Decreases the Sharpness of Gradient Flow Solutions in Scalar Networks and Beyond , author=. International Conference on Machine Learning , year=

  28. [31]

    International Conference on Machine Learning , year=

    Second-Order Regression Models Exhibit Progressive Sharpening to the Edge of Stability , author=. International Conference on Machine Learning , year=

  29. [32]

    International Conference on Learning Representations , year=

    Universal Sharpness Dynamics in Neural Network Training: Fixed Point Analysis, Edge of Stability, and Route to Chaos , author=. International Conference on Learning Representations , year=

  30. [33]

    International Conference on Learning Representations , year=

    Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect , author=. International Conference on Learning Representations , year=

  31. [34]

    International Conference on Learning Representations , year=

    On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , author=. International Conference on Learning Representations , year=

  32. [35]

    and Kindermans, Pieter-Jan and Ying, Chris and Le, Quoc V

    Smith, Samuel L. and Kindermans, Pieter-Jan and Ying, Chris and Le, Quoc V. , journal=. Three Factors Influencing Minima in

  33. [36]

    Wu, Lei and Ma, Chao and others , booktitle=. How

  34. [37]

    International Conference on Learning Representations , year=

    On the Origin of Implicit Regularization in Stochastic Gradient Descent , author=. International Conference on Learning Representations , year=

  35. [38]

    The Alignment Property of

    Wu, Lei and Wang, Mingze and Su, Weijie , booktitle=. The Alignment Property of

  36. [39]

    Conference on Learning Theory , year=

    Shape Matters: Understanding the Implicit Bias of the Noise Covariance , author=. Conference on Learning Theory , year=

  37. [40]

    , booktitle=

    Damian, Alexandru and Ma, Tengyu and Lee, Jason D. , booktitle=. Label Noise

  38. [41]

    Conference on Learning Theory , year=

    Neural Networks can Learn Representations with Gradient Descent , author=. Conference on Learning Theory , year=

  39. [43]

    Second-order regression models exhibit progressive sharpening to the edge of stability

    Atish Agarwala, Fabian Pedregosa, and Jeffrey Pennington. Second-order regression models exhibit progressive sharpening to the edge of stability. In International Conference on Machine Learning, 2023

  40. [44]

    Andreyev and P

    A. Andreyev and P. Beneventano. Edge of stochastic stability. Preprint, 2025

  41. [45]

    Understanding gradient descent on the edge of stability in deep learning

    Sanjeev Arora, Zhiyuan Li, and Abhishek Panigrahi. Understanding gradient descent on the edge of stability in deep learning. In International Conference on Machine Learning, 2022

  42. [46]

    Beyond the edge of stability via two-step gradient updates

    Lei Chen and Joan Bruna. Beyond the edge of stability via two-step gradient updates. In International Conference on Machine Learning, 2023

  43. [47]

    Cohen, Simran Kaur, Yuanzhi Li, J

    Jeremy M. Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. In International Conference on Learning Representations, 2021

  44. [48]

    Alexandru Damian, Tengyu Ma, and Jason D. Lee. Label noise SGD provably prefers flat global minimizers. In Advances in Neural Information Processing Systems, 2021

  45. [49]

    Lee, and Mahdi Soltanolkotabi

    Alexandru Damian, Jason D. Lee, and Mahdi Soltanolkotabi. Neural networks can learn representations with gradient descent. In Conference on Learning Theory, 2022

  46. [50]

    Alexandru Damian, Eshaan Nichani, and Jason D. Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability. In International Conference on Learning Representations, 2023

  47. [51]

    HaoChen, Colin Wei, Jason D

    Jeff Z. HaoChen, Colin Wei, Jason D. Lee, and Tengyu Ma. Shape matters: Understanding the implicit bias of the noise covariance. In Conference on Learning Theory, 2021

  48. [52]

    Jastrzebski, Z

    S. Jastrzebski, Z. Kenton, N. Ballas, et al. On the relation between the sharpest directions of DNN loss and the SGD step length. In Proceedings of the International Conference on Learning Representations (ICLR), 2019

  49. [53]

    Jastrzebski, M

    S. Jastrzebski, M. Szymczak, S. Fort, et al. The break-even point on optimization trajectories of deep neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2020

  50. [54]

    Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos

    Dayal Singh Kalra, Tianyu He, and Maissam Barkeshli. Universal sharpness dynamics in neural network training: Fixed point analysis, edge of stability, and route to chaos. In International Conference on Learning Representations, 2025

  51. [55]

    On large-batch training for deep learning: Generalization gap and sharp minima

    Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017

  52. [56]

    Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond

    Itai Kreisler, Mor Shpigel Nacson, Daniel Soudry, and Yair Carmon. Gradient descent monotonically decreases the sharpness of gradient flow solutions in scalar networks and beyond. In International Conference on Machine Learning, 2023

  53. [57]

    Lee and C

    S. Lee and C. Jang. A new characterization of the edge of stability based on a sharpness measure aware of batch gradient distribution. In Proceedings of the International Conference on Learning Representations (ICLR), 2023

  54. [58]

    arXiv:2003.02218 , year=

    Aitor Lewkowycz, Yasaman Bahri, Ethan Dyer, Jascha Sohl-Dickstein, and Guy Gur-Ari. The large learning rate phase of deep learning: the catapult mechanism. arXiv preprint arXiv:2003.02218, 2020

  55. [59]

    Analyzing sharpness along GD trajectory: Progressive sharpening and edge of stability

    Zhouzi Li, Zixuan Wang, and Jian Li. Analyzing sharpness along GD trajectory: Progressive sharpening and edge of stability. In Advances in Neural Information Processing Systems, 2022

  56. [60]

    arXiv preprint arXiv:1711.04623 , year=

    Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, and Quoc V. Le. Three factors influencing minima in SGD . arXiv preprint arXiv:1711.04623, 2017

  57. [61]

    Smith, Benoit Dherin, David G

    Samuel L. Smith, Benoit Dherin, David G. T. Barrett, and Soham De. On the origin of implicit regularization in stochastic gradient descent. In International Conference on Learning Representations, 2021

  58. [62]

    Good regularity creates large learning rate implicit biases: Edge of stability, balancing, and catapult

    Yuqing Wang, Zhenghao Xu, Tuo Zhao, and Molei Tao. Good regularity creates large learning rate implicit biases: Edge of stability, balancing, and catapult. arXiv preprint arXiv:2310.17087, 2023

  59. [63]

    How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective

    Lei Wu, Chao Ma, et al. How SGD selects the global minima in over-parameterized learning: A dynamical stability perspective. In Advances in Neural Information Processing Systems, 2018

  60. [64]

    The alignment property of SGD noise and how it helps select flat minima: A stability analysis

    Lei Wu, Mingze Wang, and Weijie Su. The alignment property of SGD noise and how it helps select flat minima: A stability analysis. In Advances in Neural Information Processing Systems, 2022

  61. [65]

    C. Xing, D. Arpit, C. Tsirigotis, and Y. Bengio. A walk with SGD . arXiv:1802.08770, 2018

  62. [66]

    Understanding edge-of-stability training dynamics with a minimalist example

    Xingyu Zhu, Zixuan Wang, Xiang Wang, Mo Zhou, and Rong Ge. Understanding edge-of-stability training dynamics with a minimalist example. In International Conference on Learning Representations, 2023