Does Weight Decay Enhance Training Stability?

Amir Kolic; Marius Saether; Pierfrancesco Beneventano; Tomaso Poggio

REVIEW 2 major objections 2 minor 1 cited by

Reviewed by Pith at T0; open to challenge.

T0 means a machine referee read the full paper against a public rubric. The mark states how deep the mechanical check went, never who wrote it. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

Weight decay slows progressive sharpening and triggers architecture-dependent phase transitions at the edge of stability.

2026-05-20 19:45 UTC pith:E4ZNFZQV

load-bearing objection Weight decay slows progressive sharpening and triggers an architecture-dependent phase transition in MLPs via parameter-sharpness alignment, but the causal isolation of that alignment still needs work. the 2 major comments →

arxiv 2605.16622 v1 pith:E4ZNFZQV submitted 2026-05-15 cs.LG math.OCstat.ML

Does Weight Decay Enhance Training Stability?

Marius Saether , Amir Kolic , Tomaso Poggio , Pierfrancesco Beneventano This is my paper

classification cs.LG math.OCstat.ML

keywords weight decayedge of stabilityprogressive sharpeningtraining dynamicsphase transitionCNNMLPNTK

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates whether weight decay stabilizes training dynamics in deep neural networks beyond its classical regularization effect. It establishes that weight decay consistently slows progressive sharpening of the loss landscape during optimization. The work identifies a clear difference by architecture: convolutional networks see reduced oscillations when operating near the edge of stability, while multilayer perceptrons exhibit a phase transition that holds sharpness well below the usual 2/η threshold. A mathematical framework connects these behaviors to the alignment between the parameter vector and the sharpness gradient. These parameter-space effects also produce measurable stability gains when viewed through function-space search using the neural tangent kernel.

Core claim

Weight decay robustly slows progressive sharpening. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical 2/η boundary. The global alignment of the parameter vector and the sharpness gradient is identified as the mechanistic driver of the phase transition. These phenomena translate into stability in terms of search in function-space as measured by the NTK, showing that curvature thresholds obtained from convex or quadratic heuristics may not be reliable stability diagnostics under regularization.

What carries the argument

The global alignment of the parameter vector and the sharpness gradient, which serves as the driver of the MLP phase transition that keeps sharpness below the 2/η boundary.

Load-bearing premise

That the observed alignment between the parameter vector and the sharpness gradient is the causal driver of the MLP phase transition rather than a side effect of other dynamics.

What would settle it

An experiment that artificially reduces or breaks the alignment between the parameter vector and sharpness gradient in an MLP while keeping weight decay fixed, then checks whether the sharpness phase transition below 2/η still occurs.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Weight decay provides a controllable way to reduce progressive sharpening across different neural network trainings.
CNNs and MLPs require different weight decay settings to achieve stable behavior at the edge of stability.
In MLPs, sufficiently large weight decay keeps sharpness stably below the conventional stability limit.
Stability gains appear not only in parameter space but also in function-space dynamics tracked by the NTK.
Curvature-based rules for detecting instability need revision when weight decay or similar regularization is active.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Tuning weight decay separately for convolutional versus fully connected layers could improve overall training reliability.
The alignment mechanism may extend to other regularizers or adaptive optimizers and could be monitored as a practical stability signal.
Similar phase transitions might appear in newer architectures such as transformers when weight decay is varied.
The framework offers a route to test whether disrupting alignment experimentally removes the observed MLP transition.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

Weight decay slows progressive sharpening and triggers an architecture-dependent phase transition in MLPs via parameter-sharpness alignment, but the causal isolation of that alignment still needs work.

read the letter

The core thing to know is that this paper finds weight decay does more than regularize: it slows sharpening during training and produces different stability effects depending on architecture. In CNNs it damps EoS oscillations; in MLPs it pushes sharpness to stabilize well below the usual 2/η threshold through a reported alignment between the parameter vector and the sharpness gradient. They also link this to better-behaved NTK dynamics, which is a useful bridge to function-space stability. That architecture split and the alignment story are the genuinely new pieces relative to prior EoS work. The empirical patterns look consistent with what they describe, and the framework they build reproduces the observed behavior without obvious circularity in the reported results. Credit to them for checking both MLP and CNN cases and for moving beyond simple curvature heuristics. The main soft spot is the causal status of the alignment mechanism. Weight decay changes norms, effective step sizes, and the loss surface at the same time, so alignment could easily be a downstream correlate rather than the driver. The stress-test note is right to flag the lack of an explicit intervention that would hold alignment fixed while varying decay or vice versa. Without that, or without a clear argument ruling out direct spectral effects from the L2 term, the mechanistic claim rests more on modeling fit than on disproof of alternatives. The derivations appear internally consistent on the abstract and reported claims, but a referee would want to see the full error analysis and controls. This paper is for readers already working on edge-of-stability dynamics or regularization effects in deep nets. It gives them concrete new observations and a modeling approach worth testing on other architectures. The thinking is clear and engaged with the literature, so it deserves a serious referee even if the causality section needs tightening. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper investigates the effects of weight decay on training stability at the Edge of Stability (EoS). It claims that weight decay robustly slows progressive sharpening, reveals an architecture-dependent phase transition (dampening oscillations in CNNs but causing sharpness to stabilize below the 2/η threshold in MLPs), develops a mathematical framework that models these phenomena, and identifies the global alignment of the parameter vector with the sharpness gradient as the mechanistic driver of the MLP transition. The work further links these dynamics to improved stability in function space via the NTK and argues that curvature thresholds from convex heuristics are unreliable under regularization.

Significance. If the framework holds and the alignment mechanism is shown to be causal rather than correlative, the results would refine understanding of weight decay beyond static regularization, offering mechanistic explanations for its stabilizing role in non-convex optimization. The architecture-specific phase transitions and NTK implications provide concrete, testable predictions that could guide regularization choices in practice and highlight limitations of quadratic stability diagnostics.

major comments (2)

[§4 and §5.2] §4 (Mathematical Framework) and §5.2 (MLP phase transition analysis): The identification of global alignment between the parameter vector and sharpness gradient as the causal driver is not isolated from other simultaneous effects of weight decay, such as direct modulation of parameter norms or alterations to the Hessian spectrum via the L2 term. No intervention (e.g., constrained optimization preserving alignment while varying decay) is described to break this correlation, leaving open whether alignment is the driver or a downstream correlate.
[§3.1] §3.1 and Eq. (alignment definition): The framework's modeling of the phase transition relies on the alignment quantity without reported error bounds or sensitivity analysis showing robustness to small perturbations in the sharpness gradient estimate; this is load-bearing for the claim that the framework 'accurately models' the observed stabilization below 2/η.

minor comments (2)

[Figure 4] Figure 4 (CNN oscillation damping): The y-axis scaling and oscillation amplitude comparison across weight decay values would benefit from explicit normalization to the no-decay baseline for clearer visual assessment of the dampening effect.
[§2.2] Notation in §2.2: The definition of 'progressive sharpening' is introduced without a precise mathematical expression linking it to the maximum eigenvalue trajectory; a short equation would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help refine our analysis of weight decay's role in stabilizing training at the Edge of Stability. We respond point-by-point to the major comments below, offering clarifications based on the manuscript's framework and indicating where revisions will strengthen the presentation.

read point-by-point responses

Referee: [§4 and §5.2] §4 (Mathematical Framework) and §5.2 (MLP phase transition analysis): The identification of global alignment between the parameter vector and sharpness gradient as the causal driver is not isolated from other simultaneous effects of weight decay, such as direct modulation of parameter norms or alterations to the Hessian spectrum via the L2 term. No intervention (e.g., constrained optimization preserving alignment while varying decay) is described to break this correlation, leaving open whether alignment is the driver or a downstream correlate.

Authors: Our continuous-time framework in §4 derives the sharpness evolution equation under weight decay, where the alignment term between the parameter vector and sharpness gradient appears explicitly as the factor that induces the sub-2/η stabilization in MLPs. This derivation accounts for the L2 penalty's direct contribution to the loss and Hessian while showing that the phase transition arises specifically from the alignment-driven modification to the sharpness flow, rather than norm modulation in isolation. Empirical matches between the model predictions and observed dynamics across architectures support alignment as the mechanistic driver. We acknowledge that an explicit interventional study (e.g., constrained optimization holding alignment fixed while varying decay) would provide stronger causal separation. In revision we will add a dedicated paragraph in §5.2 discussing confounding effects of weight decay and clarifying the framework's isolation of the alignment mechanism, while noting interventional validation as future work. revision: partial
Referee: [§3.1] §3.1 and Eq. (alignment definition): The framework's modeling of the phase transition relies on the alignment quantity without reported error bounds or sensitivity analysis showing robustness to small perturbations in the sharpness gradient estimate; this is load-bearing for the claim that the framework 'accurately models' the observed stabilization below 2/η.

Authors: We agree that quantifying robustness of the alignment estimate is valuable given its central role. The alignment is obtained via finite-difference approximation of the sharpness gradient; while multi-seed consistency is shown empirically, formal bounds and sensitivity checks were omitted. In the revised manuscript we will include analytic error bounds on the finite-difference approximation and add a sensitivity study that perturbs the sharpness gradient estimate with controlled noise levels (e.g., additive Gaussian perturbations of varying magnitude). These additions will demonstrate that the high alignment values and the predicted sub-2/η stabilization remain stable, thereby reinforcing the framework's modeling accuracy. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper grounds its claims in direct empirical measurements of training dynamics at the Edge of Stability across architectures, then introduces a separate mathematical framework to reproduce the observed sharpening slowdown and phase transition. The alignment between parameter vector and sharpness gradient is derived as an explanatory variable inside that framework rather than being presupposed by the input data or by any self-referential definition. No equations reduce a prediction to a fitted quantity by construction, no load-bearing result rests solely on self-citation, and no ansatz is imported without independent justification. The derivation therefore remains self-contained against external experimental benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis relies on standard assumptions from optimization theory about loss landscapes and edge-of-stability behavior. No new free parameters or invented entities are explicitly introduced in the abstract; the mathematical framework appears to be a derivation rather than a fitted model.

axioms (2)

domain assumption Training dynamics can be analyzed via progressive sharpening and the edge-of-stability threshold of 2/η
Invoked throughout the abstract as the baseline for observing effects of weight decay.
domain assumption The neural tangent kernel provides a valid lens for function-space stability
Used to translate parameter-space findings into stability claims.

pith-pipeline@v0.9.0 · 5755 in / 1459 out tokens · 38798 ms · 2026-05-20T19:45:49.179127+00:00 · methodology

0 comments

read the original abstract

In modern deep learning, weight decay is often credited with "stabilizing" training dynamics, diverging from its classical role as a static regularization penalty. We investigate a fundamental question: *does weight decay stabilize training dynamics, and if so, through which mechanism?* Indeed, training stability is understood through different but related notions in the literature. We consider how weight decay affects the parameter-space dynamics and loss sharpness by analyzing its effects at the \emph{Edge of Stability} (EoS). We show that weight decay robustly slows *progressive sharpening}. Furthermore, we uncover a striking architecture-dependent phase transition. In CNNs, weight decay dampens the oscillations at the EoS, while in MLPs, increasing weight decay causes a phase transition in which the sharpness stabilizes at a threshold significantly below the theoretical $\frac{2}{\eta}$ boundary. We develop a mathematical framework that accurately models these phenomena and identify the global alignment of the parameter vector and the sharpness gradient as the mechanistic driver of the phase transition. Importantly, we show that these phenomena translate into stability in terms of search in function-space (NTK). Last, this shows that curvature thresholds obtained from convex/quadratic heuristics may not be reliable stability diagnostics under regularization.

Figures

Figures reproduced from arXiv: 2605.16622 by Amir Kolic, Marius Saether, Pierfrancesco Beneventano, Tomaso Poggio.

**Figure 2.** Figure 2: Weight decay dampens the oscillations at the EoS. On the left, the dampening for a toy loss model along with a visible γ = 0.1 shift in the stabilization threshold. On the right, a CNN trained on a cifar10-5k subset, showing the behavior of a dampened harmonic oscillator for γ = 0.01. Both had a learning rate of η = 0.01, and the dashed line is at 2 η . The chaotic behavior of γ = 0 after step ≈ 1000 for t… view at source ↗

**Figure 3.** Figure 3: A diagram illustrating three intertwined notions of training stability and weight decay’s [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Empirical evaluation of weight decay’s effect on Edge of Stability (EoS) dynamics for [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The model predicts the dynamics of xt and yt on the left and in the middle column, respectively. The xt, yt phase space is shown on the right. Increasing the weight decay introduces a stronger dampening factor, and we see that the sharpness oscillations decay faster. The yt resting threshold is also shifted by −γ. This formulation exactly predicts two of our empirical observations. Firstly, weight decay da… view at source ↗

**Figure 6.** Figure 6: An illustration of a simplified mental model of EoS introduces optimization along a [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: The sharpness dynamics (top) and the evolution of the [PITH_FULL_IMAGE:figures/full_fig_p007_7.png] view at source ↗

**Figure 8.** Figure 8: The top eigenvalue of the empirical NTK for an MLP, [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Training dynamics for an MLP with ReLU activations trained with full batch gradient [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Sharpness trajectories for an MLP (left, [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Stabilizing sharpness as a function of γ (reproduced from [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 12.** Figure 12: Evolution of eigenvalues of the loss Hessian. CNN with ReLU trained on a 5k subset of [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Sharpness (top) and cy(t), c crit y (bottom) during training of an MLP with ReLU on a 5k subset of cifar10, with η = 0.02 and full batch gradient descent. For small γ (left), c crit y is larger than cy until the sharpness reaches 2/η − γ. For large γ (right), the 1 γ scaling keeps c crit y small, allowing cy(t) to cross it before the sharpness reaches 2/η − γ [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Sharpness (top) and cy(t), c crit y (bottom) during training of an CNN with ReLU on a 5k subset of cifar10, with η = 0.02 and full batch gradient descent. For both values of γ, cy stays below c crit y until the sharpness reaches 2/η 15 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗

**Figure 15.** Figure 15: Convergence of the normalized NTK top eigenvalue [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗

**Figure 16.** Figure 16: MLP with MSE loss trained with full batch gradient descent, [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Edge of Stability Selectively Shapes Learning Across the Data Distribution
cs.LG 2026-06 unverdicted novelty 6.0

Edge of stability acts as a selective mechanism that amplifies learning on data groups with aligned persistent gradients while suppressing others.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 5 internal anchors

[1]

A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl., 5:1035–1038, 1963

work page 1963
[2]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

work page 1970
[3]

Comparing biases for minimal network construction with back-propagation

Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. In D. Touretzky, editor,Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988

work page 1988
[4]

A simple weight decay can improve generalization

Anders Krogh and John Hertz. A simple weight decay can improve generalization. In J. Moody, S. Hanson, and R.P. Lippmann, editors,Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991

work page 1991
[5]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

work page 2018
[6]

How neural networks learn the support is an implicit regularization effect of SGD.arXiv preprint arXiv:2406.11110,

Pierfrancesco Beneventano, Andrea Pinto, and Tomaso Poggio. How neural networks learn the support is an implicit regularization effect of sgd.arXiv preprint arXiv:2406.11110, 2024

work page arXiv 2024
[7]

Mirror, mirror of the flow: How does regularization shape implicit bias?arXiv preprint arXiv:2504.12883, 2025

Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Mirror, mirror of the flow: How does regularization shape implicit bias?arXiv preprint arXiv:2504.12883, 2025

work page arXiv 2025
[8]

Galanti, Z

Tomer Galanti, Zachary S Siegel, Aparna Gupte, and Tomaso Poggio. Sgd and weight decay secretly minimize the rank of your neural network.arXiv preprint arXiv:2206.05794, 2022

work page arXiv 2022
[9]

arXiv preprint arXiv:2410.02176 , year =

Ke Chen, Chugang Yi, and Haizhao Yang. Towards better generalization: Weight decay induces low-rank bias for neural networks.arXiv preprint arXiv:2410.02176, 2024

work page arXiv 2024
[10]

arXiv preprint arXiv:2402.03991 , year =

Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, and Francesco Tudisco. Provable emergence of deep neural collapse and low-rank bias in l2-regularized nonlinear networks.arXiv preprint arXiv:2402.03991, 2024

work page arXiv 2024
[11]

Yunis, K

David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R Walter. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024

work page arXiv 2024
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

work page 2021
[14]

Jacot, F

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.CoRR, abs/1806.07572, 2018

work page arXiv 2018
[15]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[16]

On the relation between the sharpest directions of DNN loss and the SGD step length

Stanisław Jastrz˛ ebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031, 2018

work page arXiv 2018
[17]

Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof J. Geras. The break-even point on optimization trajectories of deep neural networks.CoRR, abs/2002.09572, 2020

work page arXiv 2002
[18]

Gradient descent on neural networks typically occurs at the edge of stability,

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.CoRR, abs/2103.00065, 2021. 10

work page arXiv 2021
[19]

Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022

work page arXiv 2022
[20]

Edge of stochastic stability: Revisiting the edge of stability for SGD.arXiv preprint arXiv:2412.20553, 2024

Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisiting the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

work page arXiv 2024
[21]

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, and Pierfrancesco Beneventano. Momentum further constrains sharpness at the edge of stochastic stability.arXiv preprint arXiv:2604.14108, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[22]

Non-Euclidean Gradient Descent Operates at the Edge of Stability

Rustem Islamov, Michael Crawshaw, Jeremy Cohen, and Robert Gower. Non-euclidean gradient descent operates at the edge of stability.arXiv preprint arXiv:2603.05002, 2026

work page internal anchor Pith review arXiv 2026
[23]

arXiv preprint arXiv:2209.15594 , year=

Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022

work page arXiv 2022
[24]

Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems, 37:23191–23223, 2024

Francesco d’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems, 37:23191–23223, 2024

work page 2024
[25]

L2 Regularization versus Batch and Weight Normalization

Twan van Laarhoven. L2 regularization versus batch and weight normalization.CoRR, abs/1706.05350, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective.Advances in Neural Information Processing Systems, 36:1208–1228, 2023

Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective.Advances in Neural Information Processing Systems, 36:1208–1228, 2023

work page 2023
[27]

Weight decay scheduling and knowledge distillation for active learning

Juseung Yun, Byungjoo Kim, and Junmo Kim. Weight decay scheduling and knowledge distillation for active learning. InEuropean Conference on Computer Vision, pages 431–447. Springer, 2020

work page 2020
[28]

Aditya Sharad Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence.Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[29]

Understanding decoupled and early weight decay

Johan Bjorck, Kilian Q Weinberger, and Carla Gomes. Understanding decoupled and early weight decay. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6777–6785, 2021

work page 2021
[30]

International Conference on Machine Learning (ICML) , year=

Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks.arXiv preprint arXiv:2305.17212, 2023

work page arXiv 2023
[31]

arXiv preprint arXiv:2410.24206 , year=

Jeremy M Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D Lee. Under- standing optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206, 2024

work page arXiv 2024
[32]

Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems, 35:34689–34708, 2022

Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems, 35:34689–34708, 2022

work page 2022
[33]

Super consistency of neural network landscapes and learning rate transfer.Advances in Neural Information Processing Systems, 37:102696–102743, 2024

Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, and Antonio Orvieto. Super consistency of neural network landscapes and learning rate transfer.Advances in Neural Information Processing Systems, 37:102696–102743, 2024

work page 2024
[34]

Understanding the evolution of the neural tangent kernel at the edge of stability.arXiv preprint arXiv:2507.12837, 2025

Kaiqi Jiang, Jeremy Cohen, and Yuanzhi Li. Understanding the evolution of the neural tangent kernel at the edge of stability.arXiv preprint arXiv:2507.12837, 2025

work page arXiv 2025
[35]

Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer, 2026

Clarissa Lauditi, Cengiz Pehlevan, and Blake Bordelon. Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer, 2026

work page 2026
[36]

International Conference on Learning Representations (ICLR) , year=

Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen. Weight decay may matter more than mup for learning rate transfer in practice.arXiv preprint arXiv:2510.19093, 2025

work page arXiv 2025
[37]

Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978

James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978. 11 A Empirical Results A.1 EoS behaviour at lower sharpness threshold Figure 9 shows an MLP trained with stepsize η= 0.02 and weight decay γ= 0.02 . The sharpness stabilizes around 80, far below the weig...

work page 1978
[38]

The sharpness trajectory is consistent across seeds, suggesting that the observed phenomenon of sharpness stabilizing far below2/η−γis not an artifact of a particular initialization. 0 2000 4000 6000 8000 10000 12000 14000 Step 20 40 60 80 100Sharpness Mean sharpness ±2 std 2 η Figure 16: MLP with MSE loss trained with full batch gradient descent, η= 0.02...

work page 2000
[39]

Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV

shows that throughout Phase III, ∥vt+1∥2 >∥v t∥2 at each step, driven by the η2 correction term ∆tη2 n λ1⟨Et, q1⟩2 which ispositivewhenever ∆t >0 (i.e., whenever λ1 n c2 t > 2 η ). Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV . Under Assumption 4 (∥Et2 ∥2 ≤ ∥E t1 ∥2) and the condition∆ t1 ≥Ω( δ2 η ), one obtainsα t2 > α t1. U...

work page

[1] [1]

A. N. Tikhonov. Solution of incorrectly formulated problems and the regularization method. Soviet Math. Dokl., 5:1035–1038, 1963

work page 1963

[2] [2]

Hoerl and Robert W

Arthur E. Hoerl and Robert W. Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

work page 1970

[3] [3]

Comparing biases for minimal network construction with back-propagation

Stephen Hanson and Lorien Pratt. Comparing biases for minimal network construction with back-propagation. In D. Touretzky, editor,Advances in Neural Information Processing Systems, volume 1. Morgan-Kaufmann, 1988

work page 1988

[4] [4]

A simple weight decay can improve generalization

Anders Krogh and John Hertz. A simple weight decay can improve generalization. In J. Moody, S. Hanson, and R.P. Lippmann, editors,Advances in Neural Information Processing Systems, volume 4. Morgan-Kaufmann, 1991

work page 1991

[5] [5]

The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57, 2018

work page 2018

[6] [6]

How neural networks learn the support is an implicit regularization effect of SGD.arXiv preprint arXiv:2406.11110,

Pierfrancesco Beneventano, Andrea Pinto, and Tomaso Poggio. How neural networks learn the support is an implicit regularization effect of sgd.arXiv preprint arXiv:2406.11110, 2024

work page arXiv 2024

[7] [7]

Mirror, mirror of the flow: How does regularization shape implicit bias?arXiv preprint arXiv:2504.12883, 2025

Tom Jacobs, Chao Zhou, and Rebekka Burkholz. Mirror, mirror of the flow: How does regularization shape implicit bias?arXiv preprint arXiv:2504.12883, 2025

work page arXiv 2025

[8] [8]

Galanti, Z

Tomer Galanti, Zachary S Siegel, Aparna Gupte, and Tomaso Poggio. Sgd and weight decay secretly minimize the rank of your neural network.arXiv preprint arXiv:2206.05794, 2022

work page arXiv 2022

[9] [9]

arXiv preprint arXiv:2410.02176 , year =

Ke Chen, Chugang Yi, and Haizhao Yang. Towards better generalization: Weight decay induces low-rank bias for neural networks.arXiv preprint arXiv:2410.02176, 2024

work page arXiv 2024

[10] [10]

arXiv preprint arXiv:2402.03991 , year =

Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, and Francesco Tudisco. Provable emergence of deep neural collapse and low-rank bias in l2-regularized nonlinear networks.arXiv preprint arXiv:2402.03991, 2024

work page arXiv 2024

[11] [11]

Yunis, K

David Yunis, Kumar Kshitij Patel, Samuel Wheeler, Pedro Savarese, Gal Vardi, Karen Livescu, Michael Maire, and Matthew R Walter. Approaching deep learning through the spectral dynamics of weights.arXiv preprint arXiv:2408.11804, 2024

work page arXiv 2024

[12] [12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Fixing weight decay regularization in adam.CoRR, abs/1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[13] [13]

Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

work page 2021

[14] [14]

Jacot, F

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.CoRR, abs/1806.07572, 2018

work page arXiv 2018

[15] [15]

A Walk with SGD

Chen Xing, Devansh Arpit, Christos Tsirigotis, and Yoshua Bengio. A walk with sgd.arXiv preprint arXiv:1802.08770, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[16] [16]

On the relation between the sharpest directions of DNN loss and the SGD step length

Stanisław Jastrz˛ ebski, Zachary Kenton, Nicolas Ballas, Asja Fischer, Yoshua Bengio, and Amos Storkey. On the relation between the sharpest directions of dnn loss and the sgd step length. arXiv preprint arXiv:1807.05031, 2018

work page arXiv 2018

[17] [17]

Stanislaw Jastrzebski, Maciej Szymczak, Stanislav Fort, Devansh Arpit, Jacek Tabor, Kyunghyun Cho, and Krzysztof J. Geras. The break-even point on optimization trajectories of deep neural networks.CoRR, abs/2002.09572, 2020

work page arXiv 2002

[18] [18]

Gradient descent on neural networks typically occurs at the edge of stability,

Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.CoRR, abs/2103.00065, 2021. 10

work page arXiv 2021

[19] [19]

Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484,

Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022

work page arXiv 2022

[20] [20]

Edge of stochastic stability: Revisiting the edge of stability for SGD.arXiv preprint arXiv:2412.20553, 2024

Arseniy Andreyev and Pierfrancesco Beneventano. Edge of stochastic stability: Revisiting the edge of stability for sgd.arXiv preprint arXiv:2412.20553, 2024

work page arXiv 2024

[21] [21]

Momentum Further Constrains Sharpness at the Edge of Stochastic Stability

Arseniy Andreyev, Advikar Ananthkumar, Marc Walden, Tomaso Poggio, and Pierfrancesco Beneventano. Momentum further constrains sharpness at the edge of stochastic stability.arXiv preprint arXiv:2604.14108, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[22] [22]

Non-Euclidean Gradient Descent Operates at the Edge of Stability

Rustem Islamov, Michael Crawshaw, Jeremy Cohen, and Robert Gower. Non-euclidean gradient descent operates at the edge of stability.arXiv preprint arXiv:2603.05002, 2026

work page internal anchor Pith review arXiv 2026

[23] [23]

arXiv preprint arXiv:2209.15594 , year=

Alex Damian, Eshaan Nichani, and Jason D Lee. Self-stabilization: The implicit bias of gradient descent at the edge of stability.arXiv preprint arXiv:2209.15594, 2022

work page arXiv 2022

[24] [24]

Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems, 37:23191–23223, 2024

Francesco d’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Processing Systems, 37:23191–23223, 2024

work page 2024

[25] [25]

L2 Regularization versus Batch and Weight Normalization

Twan van Laarhoven. L2 regularization versus batch and weight normalization.CoRR, abs/1706.05350, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective.Advances in Neural Information Processing Systems, 36:1208–1228, 2023

Zeke Xie, Zhiqiang Xu, Jingzhao Zhang, Issei Sato, and Masashi Sugiyama. On the overlooked pitfalls of weight decay and how to mitigate them: A gradient-norm perspective.Advances in Neural Information Processing Systems, 36:1208–1228, 2023

work page 2023

[27] [27]

Weight decay scheduling and knowledge distillation for active learning

Juseung Yun, Byungjoo Kim, and Junmo Kim. Weight decay scheduling and knowledge distillation for active learning. InEuropean Conference on Computer Vision, pages 431–447. Springer, 2020

work page 2020

[28] [28]

Aditya Sharad Golatkar, Alessandro Achille, and Stefano Soatto. Time matters in regularizing deep networks: Weight decay and data augmentation affect early learning dynamics, matter little near convergence.Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[29] [29]

Understanding decoupled and early weight decay

Johan Bjorck, Kilian Q Weinberger, and Carla Gomes. Understanding decoupled and early weight decay. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 6777–6785, 2021

work page 2021

[30] [30]

International Conference on Machine Learning (ICML) , year=

Atli Kosson, Bettina Messmer, and Martin Jaggi. Rotational equilibrium: How weight decay balances learning across neural networks.arXiv preprint arXiv:2305.17212, 2023

work page arXiv 2023

[31] [31]

arXiv preprint arXiv:2410.24206 , year=

Jeremy M Cohen, Alex Damian, Ameet Talwalkar, J Zico Kolter, and Jason D Lee. Under- standing optimization in deep learning with central flows.arXiv preprint arXiv:2410.24206, 2024

work page arXiv 2024

[32] [32]

Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems, 35:34689–34708, 2022

Kaifeng Lyu, Zhiyuan Li, and Sanjeev Arora. Understanding the generalization benefit of normalization layers: Sharpness reduction.Advances in Neural Information Processing Systems, 35:34689–34708, 2022

work page 2022

[33] [33]

Super consistency of neural network landscapes and learning rate transfer.Advances in Neural Information Processing Systems, 37:102696–102743, 2024

Lorenzo Noci, Alexandru Meterez, Thomas Hofmann, and Antonio Orvieto. Super consistency of neural network landscapes and learning rate transfer.Advances in Neural Information Processing Systems, 37:102696–102743, 2024

work page 2024

[34] [34]

Understanding the evolution of the neural tangent kernel at the edge of stability.arXiv preprint arXiv:2507.12837, 2025

Kaiqi Jiang, Jeremy Cohen, and Yuanzhi Li. Understanding the evolution of the neural tangent kernel at the edge of stability.arXiv preprint arXiv:2507.12837, 2025

work page arXiv 2025

[35] [35]

Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer, 2026

Clarissa Lauditi, Cengiz Pehlevan, and Blake Bordelon. Spectral dynamics in deep networks: Feature learning, outlier escape, and learning rate transfer, 2026

work page 2026

[36] [36]

International Conference on Learning Representations (ICLR) , year=

Atli Kosson, Jeremy Welborn, Yang Liu, Martin Jaggi, and Xi Chen. Weight decay may matter more than mup for learning rate transfer in practice.arXiv preprint arXiv:2510.19093, 2025

work page arXiv 2025

[37] [37]

Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978

James R Bunch, Christopher P Nielsen, and Danny C Sorensen. Rank-one modification of the symmetric eigenproblem.Numerische Mathematik, 31(1):31–48, 1978. 11 A Empirical Results A.1 EoS behaviour at lower sharpness threshold Figure 9 shows an MLP trained with stepsize η= 0.02 and weight decay γ= 0.02 . The sharpness stabilizes around 80, far below the weig...

work page 1978

[38] [38]

The sharpness trajectory is consistent across seeds, suggesting that the observed phenomenon of sharpness stabilizing far below2/η−γis not an artifact of a particular initialization. 0 2000 4000 6000 8000 10000 12000 14000 Step 20 40 60 80 100Sharpness Mean sharpness ±2 std 2 η Figure 16: MLP with MSE loss trained with full batch gradient descent, η= 0.02...

work page 2000

[39] [39]

Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV

shows that throughout Phase III, ∥vt+1∥2 >∥v t∥2 at each step, driven by the η2 correction term ∆tη2 n λ1⟨Et, q1⟩2 which ispositivewhenever ∆t >0 (i.e., whenever λ1 n c2 t > 2 η ). Moreover, Theorem 1(B) provides anoverallincrease across Phases III and IV . Under Assumption 4 (∥Et2 ∥2 ≤ ∥E t1 ∥2) and the condition∆ t1 ≥Ω( δ2 η ), one obtainsα t2 > α t1. U...

work page