pith. sign in

arxiv: 2605.16622 · v1 · pith:E4ZNFZQVnew · submitted 2026-05-15 · 💻 cs.LG · math.OC· stat.ML

Does Weight Decay Enhance Training Stability?

Pith reviewed 2026-05-20 19:45 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords weight decayedge of stabilityprogressive sharpeningtraining dynamicsphase transitionCNNMLPNTK
0
0 comments X

The pith

Weight decay slows progressive sharpening and triggers architecture-dependent phase transitions at the edge of stability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether weight decay stabilizes training dynamics in deep neural networks beyond its classical regularization effect. It establishes that weight decay consistently slows progressive sharpening of the loss landscape during optimization. The work identifies a clear difference by architecture: convolutional networks see reduced oscillations when operating near the edge of stability, while multilayer perceptrons exhibit a phase transition that holds sharpness well below the usual 2/η threshold. A mathematical framework connects these behaviors to the alignment between the parameter vector and the sharpness gradient. These parameter-space effects also produce measurable stability gains when viewed through function-space search using the neural tangent kernel.

Core claim

Weight decay robustly slows progressive sharpening. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical 2/η boundary. The global alignment of the parameter vector and the sharpness gradient is identified as the mechanistic driver of the phase transition. These phenomena translate into stability in terms of search in function-space as measured by the NTK, showing that curvature thresholds obtained from convex or quadratic heuristics may not be reliable stability diagnostics under regularization.

What carries the argument

The global alignment of the parameter vector and the sharpness gradient, which serves as the driver of the MLP phase transition that keeps sharpness below the 2/η boundary.

If this is right

  • Weight decay provides a controllable way to reduce progressive sharpening across different neural network trainings.
  • CNNs and MLPs require different weight decay settings to achieve stable behavior at the edge of stability.
  • In MLPs, sufficiently large weight decay keeps sharpness stably below the conventional stability limit.
  • Stability gains appear not only in parameter space but also in function-space dynamics tracked by the NTK.
  • Curvature-based rules for detecting instability need revision when weight decay or similar regularization is active.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Tuning weight decay separately for convolutional versus fully connected layers could improve overall training reliability.
  • The alignment mechanism may extend to other regularizers or adaptive optimizers and could be monitored as a practical stability signal.
  • Similar phase transitions might appear in newer architectures such as transformers when weight decay is varied.
  • The framework offers a route to test whether disrupting alignment experimentally removes the observed MLP transition.

Load-bearing premise

That the observed alignment between the parameter vector and the sharpness gradient is the causal driver of the MLP phase transition rather than a side effect of other dynamics.

What would settle it

An experiment that artificially reduces or breaks the alignment between the parameter vector and sharpness gradient in an MLP while keeping weight decay fixed, then checks whether the sharpness phase transition below 2/η still occurs.

Figures

Figures reproduced from arXiv: 2605.16622 by Amir Kolic, Marius Saether, Pierfrancesco Beneventano, Tomaso Poggio.

Figure 1
Figure 1. Figure 1: The evolution of the sharpness for varying weight decay values [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Weight decay dampens the oscillations at the EoS. On the left, the dampening for a toy loss model along with a visible γ = 0.1 shift in the stabilization threshold. On the right, a CNN trained on a cifar10-5k subset, showing the behavior of a dampened harmonic oscillator for γ = 0.01. Both had a learning rate of η = 0.01, and the dashed line is at 2 η . The chaotic behavior of γ = 0 after step ≈ 1000 for t… view at source ↗
Figure 3
Figure 3. Figure 3: A diagram illustrating three intertwined notions of training stability and weight decay’s [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical evaluation of weight decay’s effect on Edge of Stability (EoS) dynamics for [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The model predicts the dynamics of xt and yt on the left and in the middle column, respectively. The xt, yt phase space is shown on the right. Increasing the weight decay introduces a stronger dampening factor, and we see that the sharpness oscillations decay faster. The yt resting threshold is also shifted by −γ. This formulation exactly predicts two of our empirical observations. Firstly, weight decay da… view at source ↗
Figure 6
Figure 6. Figure 6: An illustration of a simplified mental model of EoS introduces optimization along a [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The sharpness dynamics (top) and the evolution of the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The top eigenvalue of the empirical NTK for an MLP, [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training dynamics for an MLP with ReLU activations trained with full batch gradient [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Sharpness trajectories for an MLP (left, [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Stabilizing sharpness as a function of γ (reproduced from [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Evolution of eigenvalues of the loss Hessian. CNN with ReLU trained on a 5k subset of [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Sharpness (top) and cy(t), c crit y (bottom) during training of an MLP with ReLU on a 5k subset of cifar10, with η = 0.02 and full batch gradient descent. For small γ (left), c crit y is larger than cy until the sharpness reaches 2/η − γ. For large γ (right), the 1 γ scaling keeps c crit y small, allowing cy(t) to cross it before the sharpness reaches 2/η − γ [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Sharpness (top) and cy(t), c crit y (bottom) during training of an CNN with ReLU on a 5k subset of cifar10, with η = 0.02 and full batch gradient descent. For both values of γ, cy stays below c crit y until the sharpness reaches 2/η 15 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Convergence of the normalized NTK top eigenvalue [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: MLP with MSE loss trained with full batch gradient descent, [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
read the original abstract

In modern deep learning, weight decay is often credited with "stabilizing" training dynamics, diverging from its classical role as a static regularization penalty. We investigate a fundamental question: *does weight decay stabilize training dynamics, and if so, through which mechanism?* Indeed, training stability is understood through different but related notions in the literature. We consider how weight decay affects the parameter-space dynamics and loss sharpness by analyzing its effects at the \emph{Edge of Stability} (EoS). We show that weight decay robustly slows *progressive sharpening}. Furthermore, we uncover a striking architecture-dependent phase transition. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical $\frac{2}{\eta}$ boundary. We develop a mathematical framework that accurately models these phenomena and identify the global alignment of the parameter vector and the sharpness gradient as the mechanistic driver of the phase transition. Importantly, we show that these phenomena translate into stability in terms of search in function-space (NTK). Last, this shows that curvature thresholds obtained from convex/quadratic heuristics may not be reliable stability diagnostics under regularization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper investigates the effects of weight decay on training stability at the Edge of Stability (EoS). It claims that weight decay robustly slows progressive sharpening, reveals an architecture-dependent phase transition (dampening oscillations in CNNs but causing sharpness to stabilize below the 2/η threshold in MLPs), develops a mathematical framework that models these phenomena, and identifies the global alignment of the parameter vector with the sharpness gradient as the mechanistic driver of the MLP transition. The work further links these dynamics to improved stability in function space via the NTK and argues that curvature thresholds from convex heuristics are unreliable under regularization.

Significance. If the framework holds and the alignment mechanism is shown to be causal rather than correlative, the results would refine understanding of weight decay beyond static regularization, offering mechanistic explanations for its stabilizing role in non-convex optimization. The architecture-specific phase transitions and NTK implications provide concrete, testable predictions that could guide regularization choices in practice and highlight limitations of quadratic stability diagnostics.

major comments (2)
  1. [§4 and §5.2] §4 (Mathematical Framework) and §5.2 (MLP phase transition analysis): The identification of global alignment between the parameter vector and sharpness gradient as the causal driver is not isolated from other simultaneous effects of weight decay, such as direct modulation of parameter norms or alterations to the Hessian spectrum via the L2 term. No intervention (e.g., constrained optimization preserving alignment while varying decay) is described to break this correlation, leaving open whether alignment is the driver or a downstream correlate.
  2. [§3.1] §3.1 and Eq. (alignment definition): The framework's modeling of the phase transition relies on the alignment quantity without reported error bounds or sensitivity analysis showing robustness to small perturbations in the sharpness gradient estimate; this is load-bearing for the claim that the framework 'accurately models' the observed stabilization below 2/η.
minor comments (2)
  1. [Figure 4] Figure 4 (CNN oscillation damping): The y-axis scaling and oscillation amplitude comparison across weight decay values would benefit from explicit normalization to the no-decay baseline for clearer visual assessment of the dampening effect.
  2. [§2.2] Notation in §2.2: The definition of 'progressive sharpening' is introduced without a precise mathematical expression linking it to the maximum eigenvalue trajectory; a short equation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help refine our analysis of weight decay's role in stabilizing training at the Edge of Stability. We respond point-by-point to the major comments below, offering clarifications based on the manuscript's framework and indicating where revisions will strengthen the presentation.

read point-by-point responses
  1. Referee: [§4 and §5.2] §4 (Mathematical Framework) and §5.2 (MLP phase transition analysis): The identification of global alignment between the parameter vector and sharpness gradient as the causal driver is not isolated from other simultaneous effects of weight decay, such as direct modulation of parameter norms or alterations to the Hessian spectrum via the L2 term. No intervention (e.g., constrained optimization preserving alignment while varying decay) is described to break this correlation, leaving open whether alignment is the driver or a downstream correlate.

    Authors: Our continuous-time framework in §4 derives the sharpness evolution equation under weight decay, where the alignment term between the parameter vector and sharpness gradient appears explicitly as the factor that induces the sub-2/η stabilization in MLPs. This derivation accounts for the L2 penalty's direct contribution to the loss and Hessian while showing that the phase transition arises specifically from the alignment-driven modification to the sharpness flow, rather than norm modulation in isolation. Empirical matches between the model predictions and observed dynamics across architectures support alignment as the mechanistic driver. We acknowledge that an explicit interventional study (e.g., constrained optimization holding alignment fixed while varying decay) would provide stronger causal separation. In revision we will add a dedicated paragraph in §5.2 discussing confounding effects of weight decay and clarifying the framework's isolation of the alignment mechanism, while noting interventional validation as future work. revision: partial

  2. Referee: [§3.1] §3.1 and Eq. (alignment definition): The framework's modeling of the phase transition relies on the alignment quantity without reported error bounds or sensitivity analysis showing robustness to small perturbations in the sharpness gradient estimate; this is load-bearing for the claim that the framework 'accurately models' the observed stabilization below 2/η.

    Authors: We agree that quantifying robustness of the alignment estimate is valuable given its central role. The alignment is obtained via finite-difference approximation of the sharpness gradient; while multi-seed consistency is shown empirically, formal bounds and sensitivity checks were omitted. In the revised manuscript we will include analytic error bounds on the finite-difference approximation and add a sensitivity study that perturbs the sharpness gradient estimate with controlled noise levels (e.g., additive Gaussian perturbations of varying magnitude). These additions will demonstrate that the high alignment values and the predicted sub-2/η stabilization remain stable, thereby reinforcing the framework's modeling accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper grounds its claims in direct empirical measurements of training dynamics at the Edge of Stability across architectures, then introduces a separate mathematical framework to reproduce the observed sharpening slowdown and phase transition. The alignment between parameter vector and sharpness gradient is derived as an explanatory variable inside that framework rather than being presupposed by the input data or by any self-referential definition. No equations reduce a prediction to a fitted quantity by construction, no load-bearing result rests solely on self-citation, and no ansatz is imported without independent justification. The derivation therefore remains self-contained against external experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis relies on standard assumptions from optimization theory about loss landscapes and edge-of-stability behavior. No new free parameters or invented entities are explicitly introduced in the abstract; the mathematical framework appears to be a derivation rather than a fitted model.

axioms (2)
  • domain assumption Training dynamics can be analyzed via progressive sharpening and the edge-of-stability threshold of 2/η
    Invoked throughout the abstract as the baseline for observing effects of weight decay.
  • domain assumption The neural tangent kernel provides a valid lens for function-space stability
    Used to translate parameter-space findings into stability claims.

pith-pipeline@v0.9.0 · 5755 in / 1459 out tokens · 38798 ms · 2026-05-20T19:45:49.179127+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 4 internal anchors

  1. [1]

    A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl., 5:1035–1038, 1963

  2. [2]

    Hoerl and Robert W

    Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

  3. [3]

    Comparing biases for minimal network construction with back-propagation

    Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. In D. Touretzky, editor,Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988

  4. [4]

    A simple weight decay can improve generalization

    Anders Krogh and John Hertz. A simple weight decay can improve generalization. In J. Moody, S. Hanson, and R.P. Lippmann, editors,Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991

  5. [5]

    The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

  6. [6]

    How neural networks learn the support is an implicit regularization effect of sgd.arXiv preprint arXiv:2406.11110, 2024

    Pierfrancesco Beneventano, Andrea Pinto, and Tomaso Poggio. How neural networks learn the support is an implicit regularization effect of sgd.arXiv preprint arXiv:2406.11110, 2024

  7. [7]

    Mirror, mirror of the flow: How does regularization shape implicit bias?arXiv preprint arXiv:2504.12883, 2025

    Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Mirror, mirror of the flow: How does regularization shape implicit bias?arXiv preprint arXiv:2504.12883, 2025

  8. [8]

    arXiv preprint arXiv:2206.05794 , year=

    Tomer Galanti, Zachary S Siegel, Aparna Gupte, and Tomaso Poggio. Sgd and weight decay secretly minimize the rank of your neural network.arXiv preprint arXiv:2206.05794, 2022

  9. [9]

    Towards better generalization: Weight decay induces low-rank bias for neural networks.arXiv preprint arXiv:2410.02176, 2024

    Ke Chen, Chugang Yi, and Haizhao Yang. Towards better generalization: Weight decay induces low-rank bias for neural networks.arXiv preprint arXiv:2410.02176, 2024

  10. [10]

    arXiv preprint arXiv:2402.03991 , year=

    Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, and Francesco Tudisco. Provable emergence of deep neural collapse and low-rank bias in l2-regularized nonlinear networks.arXiv preprint arXiv:2402.03991, 2024

  11. [11]

    Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024

    David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R Walter. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024

  12. [12]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017

  13. [13]

    Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

    Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

  14. [14]

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.CoRR, abs/1806.07572, 2018

  15. [15]

    A Walk with SGD

    Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770, 2018

  16. [16]

    On the relation between the sharpest directions of dnn loss and the sgd step length

    Stanisław Jastrz˛ ebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031, 2018

  17. [17]

    Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof J. Geras. The break-even point on optimization trajectories of deep neural networks.CoRR, abs/2002.09572, 2020

  18. [18]

    org/abs/2103.00065

    Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.CoRR, abs/2103.00065, 2021. 10

  19. [19]

    Alex Damian, Eshaan Nichani, and Jason D Lee

    Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022

  20. [20]

    Edge of stochastic stability: Revisiting the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

    Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisiting the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

  21. [21]

    Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

    Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, and Pierfrancesco Beneventano. Momentum further constrains sharpness at the edge of stochastic stability.arXiv preprint arXiv:2604.14108, 2026

  22. [22]

    Non-euclidean gradient descent operates at the edge of stability.arXiv preprint arXiv:2603.05002, 2026

    Rustem Islamov, Michael Crawshaw, Jeremy Cohen, and Robert Gower. Non-euclidean gradient descent operates at the edge of stability.arXiv preprint arXiv:2603.05002, 2026

  23. [23]

    , title =

    Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022

  24. [24]

    Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems, 37:23191–23223, 2024

    Francesco d’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems, 37:23191–23223, 2024

  25. [25]

    L2 Regularization versus Batch and Weight Normalization

    Twan van Laarhoven. L2 regularization versus batch and weight normalization.CoRR, abs/1706.05350, 2017

  26. [26]

    On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective.Advances in Neural Information Processing Systems, 36:1208–1228, 2023

    Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective.Advances in Neural Information Processing Systems, 36:1208–1228, 2023

  27. [27]

    Weight decay scheduling and knowledge distillation for active learning

    Juseung Yun, Byungjoo Kim, and Junmo Kim. Weight decay scheduling and knowledge distillation for active learning. InEuropean Conference on Computer Vision, pages 431–447. Springer, 2020

  28. [28]

    Aditya Sharad Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence.Advances in Neural Information Processing Systems, 32, 2019

  29. [29]

    Understanding decoupled and early weight decay

    Johan Bjorck, Kilian Q Weinberger, and Carla Gomes. Understanding decoupled and early weight decay. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6777–6785, 2021

  30. [30]

    Rotational equilibrium: How weight decay balances learning across neural networks.arXiv preprint arXiv:2305.17212, 2023

    Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks.arXiv preprint arXiv:2305.17212, 2023

  31. [31]

    Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,

    Jeremy M Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D Lee. Under- standing optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206, 2024

  32. [32]

    Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems, 35:34689–34708, 2022

    Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems, 35:34689–34708, 2022

  33. [33]

    Super consistency of neural network landscapes and learning rate transfer.Advances in Neural Information Processing Systems, 37:102696–102743, 2024

    Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, and Antonio Orvieto. Super consistency of neural network landscapes and learning rate transfer.Advances in Neural Information Processing Systems, 37:102696–102743, 2024

  34. [34]

    Understanding the evolution of the neural tangent kernel at the edge of stability.arXiv preprint arXiv:2507.12837, 2025

    Kaiqi Jiang, Jeremy Cohen, and Yuanzhi Li. Understanding the evolution of the neural tangent kernel at the edge of stability.arXiv preprint arXiv:2507.12837, 2025

  35. [35]

    Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer, 2026

    Clarissa Lauditi, Cengiz Pehlevan, and Blake Bordelon. Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer, 2026

  36. [36]

    Weight decay may matter more than mup for learning rate transfer in practice.arXiv preprint arXiv:2510.19093, 2025

    Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen. Weight decay may matter more than mup for learning rate transfer in practice.arXiv preprint arXiv:2510.19093, 2025

  37. [37]

    Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978

    James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978. 11 A Empirical Results A.1 EoS behaviour at lower sharpness threshold Figure 9 shows an MLP trained with stepsize η= 0.02 and weight decay γ= 0.02 . The sharpness stabilizes around 80, far below the weig...

  38. [38]

    The sharpness trajectory is consistent across seeds, suggesting that the observed phenomenon of sharpness stabilizing far below2/η−γis not an artifact of a particular initialization. 0 2000 4000 6000 8000 10000 12000 14000 Step 20 40 60 80 100Sharpness Mean sharpness ±2 std 2 η Figure 16: MLP with MSE loss trained with full batch gradient descent, η= 0.02...

  39. [39]

    Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV

    shows that throughout Phase III, ∥vt+1∥2 >∥v t∥2 at each step, driven by the η2 correction term ∆tη2 n λ1⟨Et, q1⟩2 which ispositivewhenever ∆t >0 (i.e., whenever λ1 n c2 t > 2 η ). Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV . Under Assumption 4 (∥Et2 ∥2 ≤ ∥E t1 ∥2) and the condition∆ t1 ≥Ω( δ2 η ), one obtainsα t2 > α t1. U...