Recognition: unknown
The Origin of Edge of Stability
Pith reviewed 2026-05-10 01:20 UTC · model grok-4.3
The pith
Gradient descent on neural networks is forced exactly to the edge of stability at curvature 2/η from arbitrary initialization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The edge coupling is a functional on consecutive iterate pairs whose coefficient is fixed uniquely by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary 2/η. A second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward 2/η. The mean value theorem localizes each expression to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point the problem reduces to a function of the half-amplitude alone, which determines
What carries the argument
The edge coupling functional on consecutive parameter pairs, with coefficient fixed by the gradient descent update; differencing its criticality condition and second-order expanding the loss change, then localizing both via the mean value theorem to the exact Hessian.
If this is right
- The same coupling classifies fixed points and period-two orbits by setting its gradients to zero.
- Near a fixed point the dynamics reduce to a function of half-amplitude that decides which directions support period-two orbits and on which side of the critical learning rate they appear.
- The forcing holds for the exact Hessian eigenvalue rather than any averaged quantity.
- Both the recurrence and the loss-change formula involve different Hessian averages yet localize to the same interior point via the mean value theorem.
Where Pith is reading between the lines
- Similar coupling constructions could be written for other first-order methods to predict their stability thresholds.
- Adding momentum or adaptive steps would likely modify the edge coupling and shift the observed edge.
- The period-two orbit analysis offers a concrete way to predict when training begins oscillating in particular directions.
- The derivation suggests the edge of stability is a structural consequence of the gradient descent update rule itself.
Load-bearing premise
The edge coupling functional, whose coefficient is fixed by the gradient descent update, has a criticality condition whose differencing and second-order expansion apply directly to the neural network loss from arbitrary initialization.
What would settle it
A stable training run on any loss where the largest Hessian eigenvalue settles away from exactly 2/η for the chosen learning rate would falsify the exact-forcing claim.
Figures
read the original abstract
Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/\eta$, where $\eta$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/\eta$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/\eta$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/\eta$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that full-batch gradient descent on neural networks is driven to the Edge of Stability (largest Hessian eigenvalue exactly equal to 2/η) from arbitrary initialization by the introduction of an 'edge coupling' functional on consecutive iterate pairs. The coefficient of this functional is fixed by the GD update rule; differencing its criticality condition produces a step recurrence whose stability boundary is 2/η, while a second-order expansion produces a loss-change formula. Telescoping the latter and applying the mean-value theorem to both resulting Hessian averages is asserted to localize them to the true Hessian at an interior point of each step segment, thereby forcing the eigenvalue exactly to the threshold with no gap. The same functional is used to classify fixed points and period-two orbits, reducing near fixed points to a function of half-amplitude that determines which directions support period-two behavior and on which side of the critical learning rate they appear.
Significance. If the central derivation is correct, the work supplies the missing 'origin' explanation for why trajectories are forced onto the edge rather than merely self-regulated near it, using only a parameter-free functional and standard calculus. This would be a substantive advance over prior accounts that establish stability but not the forcing mechanism from arbitrary initialization. The reduction of the period-two analysis to a scalar function of half-amplitude is a clean technical contribution that could be useful for further study of oscillatory behavior.
major comments (2)
- [section deriving loss-change formula and step recurrence] The central claim of 'exact forcing … with no gap' rests on the assertion that the mean-value theorem applied to the two distinct Hessian averages (one from the differenced criticality condition, one from the second-order loss-change expansion) localizes both to the current Hessian eigenvalue. However, the MVT supplies an interior point for each average separately; nothing in the derivation forces these interior points to coincide or pins the eigenvalue exactly at the current iterate rather than at some nearby ξ. This step is load-bearing for the 'from arbitrary initialization' and 'no gap' claims and is not obvious for non-quadratic losses (see the paragraph following the definition of the edge coupling and the subsequent telescoping-sum argument).
- [introduction of the edge coupling functional] The weakest assumption—that the criticality condition of the edge coupling applies directly to the neural-network loss from arbitrary initialization—is used without explicit verification that the functional remains well-defined and that its gradient vanishes in a manner compatible with the GD trajectory for non-convex, non-quadratic losses. This needs to be stated as an assumption or proved for the class of losses considered.
minor comments (2)
- [definition of edge coupling] Notation for the edge coupling functional and its two gradients should be introduced with a single displayed equation rather than scattered across paragraphs; this would improve readability when the differencing and expansion steps are later referenced.
- [related work] The manuscript would benefit from a short table or diagram contrasting the new edge-coupling argument with prior self-regulation accounts (e.g., which step each explains and which assumptions each relaxes).
Simulated Author's Rebuttal
We thank the referee for the careful reading and constructive comments, which help clarify the presentation of our central claims. We address each major comment below.
read point-by-point responses
-
Referee: [section deriving loss-change formula and step recurrence] The central claim of 'exact forcing … with no gap' rests on the assertion that the mean-value theorem applied to the two distinct Hessian averages (one from the differenced criticality condition, one from the second-order loss-change expansion) localizes both to the current Hessian eigenvalue. However, the MVT supplies an interior point for each average separately; nothing in the derivation forces these interior points to coincide or pins the eigenvalue exactly at the current iterate rather than at some nearby ξ. This step is load-bearing for the 'from arbitrary initialization' and 'no gap' claims and is not obvious for non-quadratic losses (see the paragraph following the definition of the edge coupling and the subsequent telescoping-sum argument).
Authors: We appreciate the referee's identification of this subtlety in the MVT applications. The two Hessian averages are defined over the identical step interval between consecutive iterates. While the MVT yields (potentially distinct) interior points for each average, the step recurrence obtained by differencing the criticality condition and the telescoping loss-change formula together enforce a consistency requirement on the curvature. Any sustained gap below 2/η would violate the stability boundary of the recurrence, forcing the largest eigenvalue to the threshold at the iterates. For non-quadratic losses the localization remains to interior points, yet the iterative nature of the trajectory propagates the forcing from arbitrary initialization. We will revise the relevant section to explicitly distinguish the interior points from the iterate locations and add a short discussion of continuity of the Hessian along the path to clarify why the eigenvalue at the current iterate is pinned. This constitutes a partial revision for improved rigor and clarity. revision: partial
-
Referee: [introduction of the edge coupling functional] The weakest assumption—that the criticality condition of the edge coupling applies directly to the neural-network loss from arbitrary initialization—is used without explicit verification that the functional remains well-defined and that its gradient vanishes in a manner compatible with the GD trajectory for non-convex, non-quadratic losses. This needs to be stated as an assumption or proved for the class of losses considered.
Authors: We agree that the applicability of the edge coupling should be stated explicitly rather than left implicit. The functional is defined on consecutive iterate pairs with its coefficient fixed by the GD update rule, after which the criticality condition is imposed formally. In the revised manuscript we will insert a dedicated paragraph immediately after the definition of the edge coupling, stating that we assume the functional is well-defined for twice-differentiable losses and that its gradient vanishes in a manner compatible with the GD trajectory. This renders the foundational assumption transparent without claiming a general proof for arbitrary non-convex losses; all subsequent derivations follow from this assumption. revision: yes
Circularity Check
No circularity: derivation introduces new functional fixed by GD rule then applies standard calculus and MVT
full rationale
The paper defines the edge coupling functional with coefficient fixed by the gradient-descent update, then derives the step recurrence via differencing its criticality condition and the loss-change formula via second-order expansion. These steps use ordinary calculus on the newly introduced object. The mean-value theorem is invoked to localize the two distinct Hessian averages to interior points, but this is a standard theorem application rather than a reduction of the target result to the inputs by construction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the abstract or outline. The chain remains independent of the claimed 2/η forcing.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math The mean value theorem applies to the loss and gradient functions along each gradient-descent step segment.
invented entities (1)
-
edge coupling
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cohen and Simran Kaur and Yuanzhi Li and J
Jeremy M. Cohen and Simran Kaur and Yuanzhi Li and J. Zico Kolter and Ameet Talwalkar , title =. International Conference on Learning Representations , year =
-
[2]
Lee , title =
Alex Damian and Eshaan Nichani and Jason D. Lee , title =. International Conference on Learning Representations , year =
-
[3]
David G. T. Barrett and Benoit Dherin , title =. International Conference on Learning Representations , year =
-
[4]
International Conference on Machine Learning , year =
Moritz Hardt and Ben Recht and Yoram Singer , title =. International Conference on Machine Learning , year =
-
[5]
Yurii Nesterov , title =
-
[6]
Ernst Hairer and Christian Lubich and Gerhard Wanner , title =
-
[7]
Journal of Machine Learning Research , volume =
Olivier Bousquet and Andr. Journal of Machine Learning Research , volume =
-
[8]
Three Factors Influencing Minima in SGD
Stanis. arXiv preprint arXiv:1711.04623 , year =
-
[9]
Advances in Neural Information Processing Systems , year =
Lei Wu and Chao Ma and Weinan E , title =. Advances in Neural Information Processing Systems , year =
-
[10]
Smith and Benoit Dherin and David G
Samuel L. Smith and Benoit Dherin and David G. T. Barrett and Soham De , title =. International Conference on Learning Representations , year =
-
[11]
International Conference on Machine Learning , year =
Kwangjun Ahn and Jingzhao Zhang and Suvrit Sra , title =. International Conference on Machine Learning , year =
-
[12]
International Conference on Machine Learning , year =
Sanjeev Arora and Zhiyuan Li and Abhishek Panigrahi , title =. International Conference on Machine Learning , year =
-
[13]
Advances in Neural Information Processing Systems , year =
Mathieu Even and Scott Pesme and Suriya Gunasekar and Nicolas Flammarion , title =. Advances in Neural Information Processing Systems , year =
-
[14]
Advances in Neural Information Processing Systems , year =
Kaifeng Lyu and Zhiyuan Li and Sanjeev Arora , title =. Advances in Neural Information Processing Systems , year =
-
[15]
Wright , title =
Jorge Nocedal and Stephen J. Wright , title =
-
[16]
International Conference on Learning Representations , year =
Kaifeng Lyu and Jian Li , title =. International Conference on Learning Representations , year =
-
[17]
Conference on Learning Theory , year =
Guy Blanc and Neha Gupta and Gregory Valiant and Paul Valiant , title =. Conference on Learning Theory , year =
-
[18]
International Conference on Learning Representations , year =
Daniel Kunin and Javier Sagastuy-Bre. International Conference on Learning Representations , year =
-
[19]
Marsden and Matthew West , title =
Jerrold E. Marsden and Matthew West , title =. Acta Numerica , volume =
-
[20]
Schaeffer , title =
Martin Golubitsky and David G. Schaeffer , title =
-
[21]
Polyak , title =
Boris T. Polyak , title =. USSR Computational Mathematics and Mathematical Physics , volume =
-
[22]
Advances in Neural Information Processing Systems , year =
Arthur Jacot and Franck Gabriel and Cl. Advances in Neural Information Processing Systems , year =
-
[23]
Arnol'd , title =
Vladimir I. Arnol'd , title =
-
[24]
Comptes Rendus de l'Acad
Augustin-Louis Cauchy , title =. Comptes Rendus de l'Acad
-
[25]
Chen Xing and Devansh Arpit and Christos Tsirigotis and Yoshua Bengio , title =. arXiv preprint arXiv:1802.08770 , year =
-
[26]
SIAM Journal on Optimization , volume =
Pierre-Antoine Absil and Robert Mahony and Ben Andrews , title =. SIAM Journal on Optimization , volume =
-
[27]
International Conference on Learning Representations , year =
Zhiyuan Li and Sanjeev Arora , title =. International Conference on Learning Representations , year =
-
[28]
Advances in Neural Information Processing Systems , year =
Behnam Neyshabur and Ruslan Salakhutdinov and Nathan Srebro , title =. Advances in Neural Information Processing Systems , year =
-
[29]
Du and Xiyu Zhai and Barnab
Simon S. Du and Xiyu Zhai and Barnab. International Conference on Learning Representations , year =
-
[30]
Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , title =
Jaehoon Lee and Lechao Xiao and Samuel S. Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , title =. Advances in Neural Information Processing Systems , year =
-
[31]
International Conference on Machine Learning , year =
Behrooz Ghorbani and Shankar Krishnan and Ying Xiao , title =. International Conference on Machine Learning , year =
-
[32]
Journal of Machine Learning Research , volume =
Vardan Papyan , title =. Journal of Machine Learning Research , volume =
-
[33]
International Conference on Learning Representations , year =
Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang , title =. International Conference on Learning Representations , year =
-
[34]
Neural Computation , volume =
Sepp Hochreiter and J. Neural Computation , volume =
-
[35]
International Conference on Learning Representations , year =
Pierre Foret and Ariel Kleiner and Hossein Mobahi and Behnam Neyshabur , title =. International Conference on Learning Representations , year =
-
[36]
Advances in Neural Information Processing Systems , year =
Yuanzhi Li and Colin Wei and Tengyu Ma , title =. Advances in Neural Information Processing Systems , year =
-
[37]
Bulletin de la Soci\'
Jean-Jacques Moreau , title =. Bulletin de la Soci\'
-
[38]
Tyrrell Rockafellar , title =
R. Tyrrell Rockafellar , title =. SIAM Journal on Control and Optimization , volume =
-
[39]
Aitor Lewkowycz and Yasaman Bahri and Ethan Dyer and Jascha Sohl-Dickstein and Guy Gur-Ari , title =. arXiv preprint arXiv:2003.02218 , year =
-
[40]
Dahl and Zachary Nado and Orhan Firat , title =
Justin Gilmer and Behrooz Ghorbani and Ankush Garg and Sneha Kudugunta and Behnam Neyshabur and David Cardoze and George E. Dahl and Zachary Nado and Orhan Firat , title =. International Conference on Learning Representations , year =
-
[41]
International Conference on Machine Learning , year =
Atish Agarwala and Fabian Pedregosa and Jeffrey Pennington , title =. International Conference on Machine Learning , year =
-
[42]
Advances in Neural Information Processing Systems , volume =
Chao Ma and Lexing Ying , title =. Advances in Neural Information Processing Systems , volume =
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.