Does Weight Decay Enhance Training Stability?
Pith reviewed 2026-05-20 19:45 UTC · model grok-4.3
The pith
Weight decay slows progressive sharpening and triggers architecture-dependent phase transitions at the edge of stability.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Weight decay robustly slows progressive sharpening. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical 2/η boundary. The global alignment of the parameter vector and the sharpness gradient is identified as the mechanistic driver of the phase transition. These phenomena translate into stability in terms of search in function-space as measured by the NTK, showing that curvature thresholds obtained from convex or quadratic heuristics may not be reliable stability diagnostics under regularization.
What carries the argument
The global alignment of the parameter vector and the sharpness gradient, which serves as the driver of the MLP phase transition that keeps sharpness below the 2/η boundary.
If this is right
- Weight decay provides a controllable way to reduce progressive sharpening across different neural network trainings.
- CNNs and MLPs require different weight decay settings to achieve stable behavior at the edge of stability.
- In MLPs, sufficiently large weight decay keeps sharpness stably below the conventional stability limit.
- Stability gains appear not only in parameter space but also in function-space dynamics tracked by the NTK.
- Curvature-based rules for detecting instability need revision when weight decay or similar regularization is active.
Where Pith is reading between the lines
- Tuning weight decay separately for convolutional versus fully connected layers could improve overall training reliability.
- The alignment mechanism may extend to other regularizers or adaptive optimizers and could be monitored as a practical stability signal.
- Similar phase transitions might appear in newer architectures such as transformers when weight decay is varied.
- The framework offers a route to test whether disrupting alignment experimentally removes the observed MLP transition.
Load-bearing premise
That the observed alignment between the parameter vector and the sharpness gradient is the causal driver of the MLP phase transition rather than a side effect of other dynamics.
What would settle it
An experiment that artificially reduces or breaks the alignment between the parameter vector and sharpness gradient in an MLP while keeping weight decay fixed, then checks whether the sharpness phase transition below 2/η still occurs.
Figures
read the original abstract
In modern deep learning, weight decay is often credited with "stabilizing" training dynamics, diverging from its classical role as a static regularization penalty. We investigate a fundamental question: *does weight decay stabilize training dynamics, and if so, through which mechanism?* Indeed, training stability is understood through different but related notions in the literature. We consider how weight decay affects the parameter-space dynamics and loss sharpness by analyzing its effects at the \emph{Edge of Stability} (EoS). We show that weight decay robustly slows *progressive sharpening}. Furthermore, we uncover a striking architecture-dependent phase transition. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical $\frac{2}{\eta}$ boundary. We develop a mathematical framework that accurately models these phenomena and identify the global alignment of the parameter vector and the sharpness gradient as the mechanistic driver of the phase transition. Importantly, we show that these phenomena translate into stability in terms of search in function-space (NTK). Last, this shows that curvature thresholds obtained from convex/quadratic heuristics may not be reliable stability diagnostics under regularization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates the effects of weight decay on training stability at the Edge of Stability (EoS). It claims that weight decay robustly slows progressive sharpening, reveals an architecture-dependent phase transition (dampening oscillations in CNNs but causing sharpness to stabilize below the 2/η threshold in MLPs), develops a mathematical framework that models these phenomena, and identifies the global alignment of the parameter vector with the sharpness gradient as the mechanistic driver of the MLP transition. The work further links these dynamics to improved stability in function space via the NTK and argues that curvature thresholds from convex heuristics are unreliable under regularization.
Significance. If the framework holds and the alignment mechanism is shown to be causal rather than correlative, the results would refine understanding of weight decay beyond static regularization, offering mechanistic explanations for its stabilizing role in non-convex optimization. The architecture-specific phase transitions and NTK implications provide concrete, testable predictions that could guide regularization choices in practice and highlight limitations of quadratic stability diagnostics.
major comments (2)
- [§4 and §5.2] §4 (Mathematical Framework) and §5.2 (MLP phase transition analysis): The identification of global alignment between the parameter vector and sharpness gradient as the causal driver is not isolated from other simultaneous effects of weight decay, such as direct modulation of parameter norms or alterations to the Hessian spectrum via the L2 term. No intervention (e.g., constrained optimization preserving alignment while varying decay) is described to break this correlation, leaving open whether alignment is the driver or a downstream correlate.
- [§3.1] §3.1 and Eq. (alignment definition): The framework's modeling of the phase transition relies on the alignment quantity without reported error bounds or sensitivity analysis showing robustness to small perturbations in the sharpness gradient estimate; this is load-bearing for the claim that the framework 'accurately models' the observed stabilization below 2/η.
minor comments (2)
- [Figure 4] Figure 4 (CNN oscillation damping): The y-axis scaling and oscillation amplitude comparison across weight decay values would benefit from explicit normalization to the no-decay baseline for clearer visual assessment of the dampening effect.
- [§2.2] Notation in §2.2: The definition of 'progressive sharpening' is introduced without a precise mathematical expression linking it to the maximum eigenvalue trajectory; a short equation would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which help refine our analysis of weight decay's role in stabilizing training at the Edge of Stability. We respond point-by-point to the major comments below, offering clarifications based on the manuscript's framework and indicating where revisions will strengthen the presentation.
read point-by-point responses
-
Referee: [§4 and §5.2] §4 (Mathematical Framework) and §5.2 (MLP phase transition analysis): The identification of global alignment between the parameter vector and sharpness gradient as the causal driver is not isolated from other simultaneous effects of weight decay, such as direct modulation of parameter norms or alterations to the Hessian spectrum via the L2 term. No intervention (e.g., constrained optimization preserving alignment while varying decay) is described to break this correlation, leaving open whether alignment is the driver or a downstream correlate.
Authors: Our continuous-time framework in §4 derives the sharpness evolution equation under weight decay, where the alignment term between the parameter vector and sharpness gradient appears explicitly as the factor that induces the sub-2/η stabilization in MLPs. This derivation accounts for the L2 penalty's direct contribution to the loss and Hessian while showing that the phase transition arises specifically from the alignment-driven modification to the sharpness flow, rather than norm modulation in isolation. Empirical matches between the model predictions and observed dynamics across architectures support alignment as the mechanistic driver. We acknowledge that an explicit interventional study (e.g., constrained optimization holding alignment fixed while varying decay) would provide stronger causal separation. In revision we will add a dedicated paragraph in §5.2 discussing confounding effects of weight decay and clarifying the framework's isolation of the alignment mechanism, while noting interventional validation as future work. revision: partial
-
Referee: [§3.1] §3.1 and Eq. (alignment definition): The framework's modeling of the phase transition relies on the alignment quantity without reported error bounds or sensitivity analysis showing robustness to small perturbations in the sharpness gradient estimate; this is load-bearing for the claim that the framework 'accurately models' the observed stabilization below 2/η.
Authors: We agree that quantifying robustness of the alignment estimate is valuable given its central role. The alignment is obtained via finite-difference approximation of the sharpness gradient; while multi-seed consistency is shown empirically, formal bounds and sensitivity checks were omitted. In the revised manuscript we will include analytic error bounds on the finite-difference approximation and add a sensitivity study that perturbs the sharpness gradient estimate with controlled noise levels (e.g., additive Gaussian perturbations of varying magnitude). These additions will demonstrate that the high alignment values and the predicted sub-2/η stabilization remain stable, thereby reinforcing the framework's modeling accuracy. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper grounds its claims in direct empirical measurements of training dynamics at the Edge of Stability across architectures, then introduces a separate mathematical framework to reproduce the observed sharpening slowdown and phase transition. The alignment between parameter vector and sharpness gradient is derived as an explanatory variable inside that framework rather than being presupposed by the input data or by any self-referential definition. No equations reduce a prediction to a fitted quantity by construction, no load-bearing result rests solely on self-citation, and no ansatz is imported without independent justification. The derivation therefore remains self-contained against external experimental benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Training dynamics can be analyzed via progressive sharpening and the edge-of-stability threshold of 2/η
- domain assumption The neural tangent kernel provides a valid lens for function-space stability
Reference graph
Works this paper leans on
-
[1]
A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl., 5:1035–1038, 1963
work page 1963
-
[2]
Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970
work page 1970
-
[3]
Comparing biases for minimal network construction with back-propagation
Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. In D. Touretzky, editor,Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988
work page 1988
-
[4]
A simple weight decay can improve generalization
Anders Krogh and John Hertz. A simple weight decay can improve generalization. In J. Moody, S. Hanson, and R.P. Lippmann, editors,Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991
work page 1991
-
[5]
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018
work page 2018
-
[6]
Pierfrancesco Beneventano, Andrea Pinto, and Tomaso Poggio. How neural networks learn the support is an implicit regularization effect of sgd.arXiv preprint arXiv:2406.11110, 2024
-
[7]
Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Mirror, mirror of the flow: How does regularization shape implicit bias?arXiv preprint arXiv:2504.12883, 2025
-
[8]
arXiv preprint arXiv:2206.05794 , year=
Tomer Galanti, Zachary S Siegel, Aparna Gupte, and Tomaso Poggio. Sgd and weight decay secretly minimize the rank of your neural network.arXiv preprint arXiv:2206.05794, 2022
-
[9]
Ke Chen, Chugang Yi, and Haizhao Yang. Towards better generalization: Weight decay induces low-rank bias for neural networks.arXiv preprint arXiv:2410.02176, 2024
-
[10]
arXiv preprint arXiv:2402.03991 , year=
Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, and Francesco Tudisco. Provable emergence of deep neural collapse and low-rank bias in l2-regularized nonlinear networks.arXiv preprint arXiv:2402.03991, 2024
-
[11]
David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R Walter. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024
-
[12]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021
work page 2021
-
[14]
Neural Tangent Kernel: Convergence and Generalization in Neural Networks
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.CoRR, abs/1806.07572, 2018
-
[15]
Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[16]
On the relation between the sharpest directions of dnn loss and the sgd step length
Stanisław Jastrz˛ ebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031, 2018
- [17]
-
[18]
Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.CoRR, abs/2103.00065, 2021. 10
-
[19]
Alex Damian, Eshaan Nichani, and Jason D Lee
Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022
-
[20]
Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisiting the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024
-
[21]
Momentum Further Constrains Sharpness at the Edge of Stochastic Stability
Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, and Pierfrancesco Beneventano. Momentum further constrains sharpness at the edge of stochastic stability.arXiv preprint arXiv:2604.14108, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Rustem Islamov, Michael Crawshaw, Jeremy Cohen, and Robert Gower. Non-euclidean gradient descent operates at the edge of stability.arXiv preprint arXiv:2603.05002, 2026
- [23]
-
[24]
Francesco d’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems, 37:23191–23223, 2024
work page 2024
-
[25]
L2 Regularization versus Batch and Weight Normalization
Twan van Laarhoven. L2 regularization versus batch and weight normalization.CoRR, abs/1706.05350, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[26]
Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective.Advances in Neural Information Processing Systems, 36:1208–1228, 2023
work page 2023
-
[27]
Weight decay scheduling and knowledge distillation for active learning
Juseung Yun, Byungjoo Kim, and Junmo Kim. Weight decay scheduling and knowledge distillation for active learning. InEuropean Conference on Computer Vision, pages 431–447. Springer, 2020
work page 2020
-
[28]
Aditya Sharad Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[29]
Understanding decoupled and early weight decay
Johan Bjorck, Kilian Q Weinberger, and Carla Gomes. Understanding decoupled and early weight decay. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6777–6785, 2021
work page 2021
-
[30]
Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks.arXiv preprint arXiv:2305.17212, 2023
-
[31]
Understanding optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206,
Jeremy M Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D Lee. Under- standing optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206, 2024
-
[32]
Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems, 35:34689–34708, 2022
work page 2022
-
[33]
Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, and Antonio Orvieto. Super consistency of neural network landscapes and learning rate transfer.Advances in Neural Information Processing Systems, 37:102696–102743, 2024
work page 2024
-
[34]
Kaiqi Jiang, Jeremy Cohen, and Yuanzhi Li. Understanding the evolution of the neural tangent kernel at the edge of stability.arXiv preprint arXiv:2507.12837, 2025
-
[35]
Clarissa Lauditi, Cengiz Pehlevan, and Blake Bordelon. Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer, 2026
work page 2026
-
[36]
Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen. Weight decay may matter more than mup for learning rate transfer in practice.arXiv preprint arXiv:2510.19093, 2025
-
[37]
Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978
James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978. 11 A Empirical Results A.1 EoS behaviour at lower sharpness threshold Figure 9 shows an MLP trained with stepsize η= 0.02 and weight decay γ= 0.02 . The sharpness stabilizes around 80, far below the weig...
work page 1978
-
[38]
The sharpness trajectory is consistent across seeds, suggesting that the observed phenomenon of sharpness stabilizing far below2/η−γis not an artifact of a particular initialization. 0 2000 4000 6000 8000 10000 12000 14000 Step 20 40 60 80 100Sharpness Mean sharpness ±2 std 2 η Figure 16: MLP with MSE loss trained with full batch gradient descent, η= 0.02...
work page 2000
-
[39]
Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV
shows that throughout Phase III, ∥vt+1∥2 >∥v t∥2 at each step, driven by the η2 correction term ∆tη2 n λ1⟨Et, q1⟩2 which ispositivewhenever ∆t >0 (i.e., whenever λ1 n c2 t > 2 η ). Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV . Under Assumption 4 (∥Et2 ∥2 ≤ ∥E t1 ∥2) and the condition∆ t1 ≥Ω( δ2 η ), one obtainsα t2 > α t1. U...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.