arxiv: 2604.20446 · v1 · submitted 2026-04-22 · 💻 cs.LG · stat.ML

Recognition: unknown

The Origin of Edge of Stability

Elon Litman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:20 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords edge of stabilitygradient descentHessian eigenvalueneural network lossfull-batch optimizationperiod-two orbitsmean value theorem

0 comments

The pith

Gradient descent on neural networks is forced exactly to the edge of stability at curvature 2/η from arbitrary initialization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Full-batch gradient descent on neural networks consistently drives the largest Hessian eigenvalue of the loss to exactly 2/η, where η is the learning rate. Earlier accounts showed how the system self-regulates once near this boundary but did not explain the attraction from arbitrary starting points. The paper introduces an edge coupling functional on consecutive parameter pairs whose coefficient is fixed by the gradient descent update itself. Differencing the criticality condition of this functional produces a recurrence whose stability boundary is 2/η. A second-order expansion of the loss change along the same functional produces a telescoping sum that forces integrated curvature to the same threshold. The mean value theorem then maps both expressions back to the true Hessian at an interior point of each step, enforcing the eigenvalue exactly with no averaging gap.

Core claim

The edge coupling is a functional on consecutive iterate pairs whose coefficient is fixed uniquely by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary 2/η. A second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward 2/η. The mean value theorem localizes each expression to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point the problem reduces to a function of the half-amplitude alone, which determines

What carries the argument

The edge coupling functional on consecutive parameter pairs, with coefficient fixed by the gradient descent update; differencing its criticality condition and second-order expanding the loss change, then localizing both via the mean value theorem to the exact Hessian.

If this is right

The same coupling classifies fixed points and period-two orbits by setting its gradients to zero.
Near a fixed point the dynamics reduce to a function of half-amplitude that decides which directions support period-two orbits and on which side of the critical learning rate they appear.
The forcing holds for the exact Hessian eigenvalue rather than any averaged quantity.
Both the recurrence and the loss-change formula involve different Hessian averages yet localize to the same interior point via the mean value theorem.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar coupling constructions could be written for other first-order methods to predict their stability thresholds.
Adding momentum or adaptive steps would likely modify the edge coupling and shift the observed edge.
The period-two orbit analysis offers a concrete way to predict when training begins oscillating in particular directions.
The derivation suggests the edge of stability is a structural consequence of the gradient descent update rule itself.

Load-bearing premise

The edge coupling functional, whose coefficient is fixed by the gradient descent update, has a criticality condition whose differencing and second-order expansion apply directly to the neural network loss from arbitrary initialization.

What would settle it

A stable training run on any loss where the largest Hessian eigenvalue settles away from exactly 2/η for the chosen learning rate would falsify the exact-forcing claim.

Figures

Figures reproduced from arXiv: 2604.20446 by Elon Litman.

**Figure 1.** Figure 1: Edge of Stability on a 3-layer MLP (CIFAR-10). Full-batch GD, 𝜂 = 0.5, GELU activations, MSE loss. Dotted line marks 𝑡𝑐, the first step at which e𝑟𝑘 ≈ 2/𝜂. a, Effective curvature e𝑟𝑘 (blue) and sharpness 𝜆max (green); both saturate near 2/𝜂 = 4 (dashed). b, Training loss (solid) and 5-step running mean (dashed). Inset: detrended loss 𝐿𝑘 − 𝐿¯ 𝑘 showing oscillation. to large-scale training instabilities [Gil… view at source ↗

**Figure 2.** Figure 2: Continuous onset of period-doubling in a two-layer linear network (Proposition 2.6). 𝑝 = 5, ℎ = 3, 𝑑 = 10, rank-3 target, 𝑛 = 200 samples. a, Period-two amplitude 2∥𝑎(𝜂) ∥ vs. 𝜂 − 𝜂𝑐 (log-log). Observed amplitude (dots) tracks the √ 𝜂 − 𝜂𝑐 scaling predicted by Corollary 2.5 (dashed); Q ⊥(𝑢𝑐) < 0 shows that the branch appears for 𝜂 > 𝜂𝑐. b, Pitchfork diagram: projection ⟨𝑤𝑘 − 𝑤, 𝑢 ¯ 𝑐⟩ vs. 𝜂. Branches emerg… view at source ↗

**Figure 3.** Figure 3: Validation of Theorems 2.2 and 4.1. Four learning rates, 4,000 steps, shared initialization. a, Weighted average curvature converges to 2/𝜂 (dashed). b, Actual Δ𝐿𝑘 vs. the proxy − 1 2𝜂 𝑑 ⊤ 𝑘 (𝑤𝑘+2 − 𝑤𝑘). Its tightness confirms that 𝑟¯𝑘 and e𝑟𝑘 nearly coincide on this run. Proof sketch. Decompose 2/𝜂 −e𝑟𝑘 into positive and negative parts; Markov-type estimates yield the window bounds. See Section A. Panel (… view at source ↗

**Figure 4.** Figure 4: Two-step return ratio. ∥𝑤𝑘+2 − 𝑤𝑘 ∥/∥𝑑𝑘 ∥ vs. training step for two learning rates (solid: rolling median; faint: raw). Before EoS onset the ratio exceeds 1, reflecting the progressive sharpening phase in which consecutive steps reinforce rather than reverse. At EoS onset, the ratio drops from 𝑂(1) toward ∼ 0.15, indicating approximate (but not exact) period-two behavior; by Corollary 4.4, this directly bo… view at source ↗

read the original abstract

Full-batch gradient descent on neural networks drives the largest Hessian eigenvalue to the threshold $2/\eta$, where $\eta$ is the learning rate. This phenomenon, the Edge of Stability, has resisted a unified explanation: existing accounts establish self-regulation near the edge but do not explain why the trajectory is forced toward $2/\eta$ from arbitrary initialization. We introduce the edge coupling, a functional on consecutive iterate pairs whose coefficient is uniquely fixed by the gradient-descent update. Differencing its criticality condition yields a step recurrence with stability boundary $2/\eta$, and a second-order expansion yields a loss-change formula whose telescoping sum forces curvature toward $2/\eta$. The two formulas involve different Hessian averages, but the mean value theorem localizes each to the true Hessian at an interior point of the step segment, yielding exact forcing of the Hessian eigenvalue with no gap. Setting both gradients of the edge coupling to zero classifies fixed points and period-two orbits; near a fixed point, the problem reduces to a function of the half-amplitude alone, which determines which directions support period-two orbits and on which side of the critical learning rate they appear.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The edge coupling offers a fresh way to derive forcing toward the stability threshold from arbitrary starts, but the no-gap exactness via separate MVT localizations is not obviously true for non-quadratic losses.

read the letter

The punchline is that this paper constructs an edge coupling functional on consecutive GD iterates whose coefficient is fixed by the update rule, then uses differencing of its criticality condition plus a second-order loss expansion to produce a step recurrence and a telescoping sum that together drive the largest Hessian eigenvalue to 2/η. The mean-value theorem is invoked to localize the two Hessian averages to interior points, with the claim that this yields exact forcing with no gap from arbitrary initialization. That mechanism, plus the later reduction near fixed points to a half-amplitude function that classifies period-two orbits, is the actual new content. It improves on prior self-regulation results by turning passive staying-near-the-edge into active driving from anywhere, and the orbit analysis looks like a useful byproduct for understanding when those cycles appear on which side of the critical learning rate. The construction avoids obvious circularity and sticks to standard calculus steps. The soft spot is the localization argument itself. The two formulas involve distinct averages along the line segment, so MVT supplies possibly different interior points for each; nothing in the outline forces those points to coincide or pins the eigenvalue exactly at the current iterate rather than at some nearby ξ. For quadratic losses the distinction may collapse, but neural-network losses are not quadratic, so the “exact” and “no gap” claim needs an extra step that is not visible in the given text. Without explicit assumptions, full derivations, or numerical checks on simple non-quadratic cases, it is hard to judge whether the gap is truly closed. This is for readers who work on optimization dynamics and stability in deep learning. Anyone already following the edge-of-stability literature will find the functional and the orbit classification worth seeing. It deserves a serious referee because the idea is original, the logical chain is coherent on its face, and the potential payoff for algorithm design and analysis is real, even if the localization detail will probably draw questions.

Referee Report

2 major / 2 minor

Summary. The paper claims that full-batch gradient descent on neural networks is driven to the Edge of Stability (largest Hessian eigenvalue exactly equal to 2/η) from arbitrary initialization by the introduction of an 'edge coupling' functional on consecutive iterate pairs. The coefficient of this functional is fixed by the GD update rule; differencing its criticality condition produces a step recurrence whose stability boundary is 2/η, while a second-order expansion produces a loss-change formula. Telescoping the latter and applying the mean-value theorem to both resulting Hessian averages is asserted to localize them to the true Hessian at an interior point of each step segment, thereby forcing the eigenvalue exactly to the threshold with no gap. The same functional is used to classify fixed points and period-two orbits, reducing near fixed points to a function of half-amplitude that determines which directions support period-two behavior and on which side of the critical learning rate they appear.

Significance. If the central derivation is correct, the work supplies the missing 'origin' explanation for why trajectories are forced onto the edge rather than merely self-regulated near it, using only a parameter-free functional and standard calculus. This would be a substantive advance over prior accounts that establish stability but not the forcing mechanism from arbitrary initialization. The reduction of the period-two analysis to a scalar function of half-amplitude is a clean technical contribution that could be useful for further study of oscillatory behavior.

major comments (2)

[section deriving loss-change formula and step recurrence] The central claim of 'exact forcing … with no gap' rests on the assertion that the mean-value theorem applied to the two distinct Hessian averages (one from the differenced criticality condition, one from the second-order loss-change expansion) localizes both to the current Hessian eigenvalue. However, the MVT supplies an interior point for each average separately; nothing in the derivation forces these interior points to coincide or pins the eigenvalue exactly at the current iterate rather than at some nearby ξ. This step is load-bearing for the 'from arbitrary initialization' and 'no gap' claims and is not obvious for non-quadratic losses (see the paragraph following the definition of the edge coupling and the subsequent telescoping-sum argument).
[introduction of the edge coupling functional] The weakest assumption—that the criticality condition of the edge coupling applies directly to the neural-network loss from arbitrary initialization—is used without explicit verification that the functional remains well-defined and that its gradient vanishes in a manner compatible with the GD trajectory for non-convex, non-quadratic losses. This needs to be stated as an assumption or proved for the class of losses considered.

minor comments (2)

[definition of edge coupling] Notation for the edge coupling functional and its two gradients should be introduced with a single displayed equation rather than scattered across paragraphs; this would improve readability when the differencing and expansion steps are later referenced.
[related work] The manuscript would benefit from a short table or diagram contrasting the new edge-coupling argument with prior self-regulation accounts (e.g., which step each explains and which assumptions each relaxes).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments, which help clarify the presentation of our central claims. We address each major comment below.

read point-by-point responses

Referee: [section deriving loss-change formula and step recurrence] The central claim of 'exact forcing … with no gap' rests on the assertion that the mean-value theorem applied to the two distinct Hessian averages (one from the differenced criticality condition, one from the second-order loss-change expansion) localizes both to the current Hessian eigenvalue. However, the MVT supplies an interior point for each average separately; nothing in the derivation forces these interior points to coincide or pins the eigenvalue exactly at the current iterate rather than at some nearby ξ. This step is load-bearing for the 'from arbitrary initialization' and 'no gap' claims and is not obvious for non-quadratic losses (see the paragraph following the definition of the edge coupling and the subsequent telescoping-sum argument).

Authors: We appreciate the referee's identification of this subtlety in the MVT applications. The two Hessian averages are defined over the identical step interval between consecutive iterates. While the MVT yields (potentially distinct) interior points for each average, the step recurrence obtained by differencing the criticality condition and the telescoping loss-change formula together enforce a consistency requirement on the curvature. Any sustained gap below 2/η would violate the stability boundary of the recurrence, forcing the largest eigenvalue to the threshold at the iterates. For non-quadratic losses the localization remains to interior points, yet the iterative nature of the trajectory propagates the forcing from arbitrary initialization. We will revise the relevant section to explicitly distinguish the interior points from the iterate locations and add a short discussion of continuity of the Hessian along the path to clarify why the eigenvalue at the current iterate is pinned. This constitutes a partial revision for improved rigor and clarity. revision: partial
Referee: [introduction of the edge coupling functional] The weakest assumption—that the criticality condition of the edge coupling applies directly to the neural-network loss from arbitrary initialization—is used without explicit verification that the functional remains well-defined and that its gradient vanishes in a manner compatible with the GD trajectory for non-convex, non-quadratic losses. This needs to be stated as an assumption or proved for the class of losses considered.

Authors: We agree that the applicability of the edge coupling should be stated explicitly rather than left implicit. The functional is defined on consecutive iterate pairs with its coefficient fixed by the GD update rule, after which the criticality condition is imposed formally. In the revised manuscript we will insert a dedicated paragraph immediately after the definition of the edge coupling, stating that we assume the functional is well-defined for twice-differentiable losses and that its gradient vanishes in a manner compatible with the GD trajectory. This renders the foundational assumption transparent without claiming a general proof for arbitrary non-convex losses; all subsequent derivations follow from this assumption. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation introduces new functional fixed by GD rule then applies standard calculus and MVT

full rationale

The paper defines the edge coupling functional with coefficient fixed by the gradient-descent update, then derives the step recurrence via differencing its criticality condition and the loss-change formula via second-order expansion. These steps use ordinary calculus on the newly introduced object. The mean-value theorem is invoked to localize the two distinct Hessian averages to interior points, but this is a standard theorem application rather than a reduction of the target result to the inputs by construction. No self-citations, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the abstract or outline. The chain remains independent of the claimed 2/η forcing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The argument rests on the standard mean value theorem applied to the loss and gradient along line segments, plus the newly introduced edge coupling whose form is dictated by the gradient-descent step.

axioms (1)

standard math The mean value theorem applies to the loss and gradient functions along each gradient-descent step segment.
Invoked to localize the averaged Hessians appearing in the recurrence and loss-change formulas to the true Hessian at an interior point.

invented entities (1)

edge coupling no independent evidence
purpose: Functional on consecutive iterate pairs whose criticality condition encodes the stability boundary.
Newly defined in the paper; its coefficient is fixed by the gradient-descent update rule.

pith-pipeline@v0.9.0 · 5487 in / 1322 out tokens · 41480 ms · 2026-05-10T01:20:55.549336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 3 canonical work pages

[1]

Cohen and Simran Kaur and Yuanzhi Li and J

Jeremy M. Cohen and Simran Kaur and Yuanzhi Li and J. Zico Kolter and Ameet Talwalkar , title =. International Conference on Learning Representations , year =
[2]

Lee , title =

Alex Damian and Eshaan Nichani and Jason D. Lee , title =. International Conference on Learning Representations , year =
[3]

David G. T. Barrett and Benoit Dherin , title =. International Conference on Learning Representations , year =
[4]

International Conference on Machine Learning , year =

Moritz Hardt and Ben Recht and Yoram Singer , title =. International Conference on Machine Learning , year =
[5]

Yurii Nesterov , title =
[6]

Ernst Hairer and Christian Lubich and Gerhard Wanner , title =
[7]

Journal of Machine Learning Research , volume =

Olivier Bousquet and Andr. Journal of Machine Learning Research , volume =
[8]

Three Factors Influencing Minima in SGD

Stanis. arXiv preprint arXiv:1711.04623 , year =

work page Pith review arXiv
[9]

Advances in Neural Information Processing Systems , year =

Lei Wu and Chao Ma and Weinan E , title =. Advances in Neural Information Processing Systems , year =
[10]

Smith and Benoit Dherin and David G

Samuel L. Smith and Benoit Dherin and David G. T. Barrett and Soham De , title =. International Conference on Learning Representations , year =
[11]

International Conference on Machine Learning , year =

Kwangjun Ahn and Jingzhao Zhang and Suvrit Sra , title =. International Conference on Machine Learning , year =
[12]

International Conference on Machine Learning , year =

Sanjeev Arora and Zhiyuan Li and Abhishek Panigrahi , title =. International Conference on Machine Learning , year =
[13]

Advances in Neural Information Processing Systems , year =

Mathieu Even and Scott Pesme and Suriya Gunasekar and Nicolas Flammarion , title =. Advances in Neural Information Processing Systems , year =
[14]

Advances in Neural Information Processing Systems , year =

Kaifeng Lyu and Zhiyuan Li and Sanjeev Arora , title =. Advances in Neural Information Processing Systems , year =
[15]

Wright , title =

Jorge Nocedal and Stephen J. Wright , title =
[16]

International Conference on Learning Representations , year =

Kaifeng Lyu and Jian Li , title =. International Conference on Learning Representations , year =
[17]

Conference on Learning Theory , year =

Guy Blanc and Neha Gupta and Gregory Valiant and Paul Valiant , title =. Conference on Learning Theory , year =
[18]

International Conference on Learning Representations , year =

Daniel Kunin and Javier Sagastuy-Bre. International Conference on Learning Representations , year =
[19]

Marsden and Matthew West , title =

Jerrold E. Marsden and Matthew West , title =. Acta Numerica , volume =
[20]

Schaeffer , title =

Martin Golubitsky and David G. Schaeffer , title =
[21]

Polyak , title =

Boris T. Polyak , title =. USSR Computational Mathematics and Mathematical Physics , volume =
[22]

Advances in Neural Information Processing Systems , year =

Arthur Jacot and Franck Gabriel and Cl. Advances in Neural Information Processing Systems , year =
[23]

Arnol'd , title =

Vladimir I. Arnol'd , title =
[24]

Comptes Rendus de l'Acad

Augustin-Louis Cauchy , title =. Comptes Rendus de l'Acad
[25]

A Walk with SGD

Chen Xing and Devansh Arpit and Christos Tsirigotis and Yoshua Bengio , title =. arXiv preprint arXiv:1802.08770 , year =

work page Pith review arXiv
[26]

SIAM Journal on Optimization , volume =

Pierre-Antoine Absil and Robert Mahony and Ben Andrews , title =. SIAM Journal on Optimization , volume =
[27]

International Conference on Learning Representations , year =

Zhiyuan Li and Sanjeev Arora , title =. International Conference on Learning Representations , year =
[28]

Advances in Neural Information Processing Systems , year =

Behnam Neyshabur and Ruslan Salakhutdinov and Nathan Srebro , title =. Advances in Neural Information Processing Systems , year =
[29]

Du and Xiyu Zhai and Barnab

Simon S. Du and Xiyu Zhai and Barnab. International Conference on Learning Representations , year =
[30]

Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , title =

Jaehoon Lee and Lechao Xiao and Samuel S. Schoenholz and Yasaman Bahri and Roman Novak and Jascha Sohl-Dickstein and Jeffrey Pennington , title =. Advances in Neural Information Processing Systems , year =
[31]

International Conference on Machine Learning , year =

Behrooz Ghorbani and Shankar Krishnan and Ying Xiao , title =. International Conference on Machine Learning , year =
[32]

Journal of Machine Learning Research , volume =

Vardan Papyan , title =. Journal of Machine Learning Research , volume =
[33]

International Conference on Learning Representations , year =

Nitish Shirish Keskar and Dheevatsa Mudigere and Jorge Nocedal and Mikhail Smelyanskiy and Ping Tak Peter Tang , title =. International Conference on Learning Representations , year =
[34]

Neural Computation , volume =

Sepp Hochreiter and J. Neural Computation , volume =
[35]

International Conference on Learning Representations , year =

Pierre Foret and Ariel Kleiner and Hossein Mobahi and Behnam Neyshabur , title =. International Conference on Learning Representations , year =
[36]

Advances in Neural Information Processing Systems , year =

Yuanzhi Li and Colin Wei and Tengyu Ma , title =. Advances in Neural Information Processing Systems , year =
[37]

Bulletin de la Soci\'

Jean-Jacques Moreau , title =. Bulletin de la Soci\'
[38]

Tyrrell Rockafellar , title =

R. Tyrrell Rockafellar , title =. SIAM Journal on Control and Optimization , volume =
[39]

The large learning rate phase of deep learning: the catapult mechanism.arXiv preprint arXiv:2003.02218, 2020

Aitor Lewkowycz and Yasaman Bahri and Ethan Dyer and Jascha Sohl-Dickstein and Guy Gur-Ari , title =. arXiv preprint arXiv:2003.02218 , year =

work page arXiv 2003
[40]

Dahl and Zachary Nado and Orhan Firat , title =

Justin Gilmer and Behrooz Ghorbani and Ankush Garg and Sneha Kudugunta and Behnam Neyshabur and David Cardoze and George E. Dahl and Zachary Nado and Orhan Firat , title =. International Conference on Learning Representations , year =
[41]

International Conference on Machine Learning , year =

Atish Agarwala and Fabian Pedregosa and Jeffrey Pennington , title =. International Conference on Machine Learning , year =
[42]

Advances in Neural Information Processing Systems , volume =

Chao Ma and Lexing Ying , title =. Advances in Neural Information Processing Systems , volume =