Correction of Decoupled Weight Decay
Pith reviewed 2026-05-17 00:27 UTC · model grok-4.3
The pith
Decoupled weight decay should scale with the square of the learning rate to stabilize weight norms.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Decoupled weight decay ∝ γ² results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers, and decoupled weight decay ∝ γ² leads to stable weight and gradient norms that allow better control of the training dynamics and improved model performance.
What carries the argument
The assumption that updates become independent of the weights at steady state, used to derive the γ² scaling for decoupled weight decay that stabilizes norms.
If this is right
- Weight and gradient norms remain stable throughout training.
- Training dynamics can be controlled more directly by the choice of learning rate and weight decay.
- The optimal momentum-dependent effective learning rate transfers across different runs and models.
- Model performance improves when the γ² scaling is used instead of the conventional linear scaling.
Where Pith is reading between the lines
- The independence assumption could be tested directly by measuring correlation between updates and weights late in training.
- The same scaling rule may apply to other adaptive optimizers beyond AdamW and Scion.
- Practitioners training large models could adopt the γ² rule to reduce the need for manual norm monitoring.
Load-bearing premise
Updates become independent of the weights at steady state.
What would settle it
An experiment showing that weight norms still drift when decoupled weight decay is set proportional to γ², or that removing the perpendicular component of the update changes training dynamics substantially.
Figures
read the original abstract
Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto \gamma^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto \gamma^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto \gamma^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript argues that the conventional choice of decoupled weight decay proportional to the learning rate γ should instead be set proportional to γ² to produce stable weight norms. This follows from modeling the weight-norm evolution under the assumption that, at steady state, the update vector becomes statistically independent of the current weight vector irrespective of the optimizer. The authors further derive that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is governed by a momentum-dependent effective learning rate whose optimal value transfers across settings, and they report that the γ² scaling yields stable weight and gradient norms together with improved training dynamics and model performance.
Significance. If the independence assumption is valid and the empirical results hold under rigorous controls, the work supplies a theoretically motivated correction to a widely used hyper-parameter that could improve stability and controllability in large-scale optimizers such as AdamW and Scion.
major comments (2)
- [Derivation of weight-norm stability (abstract and main derivation)] The derivation that decoupled weight decay ∝ γ² produces stable weight norm rests entirely on the unverified premise that updates become independent of weights at steady state. This assumption is invoked to eliminate cross terms in the norm-evolution equation, yet the manuscript provides neither a formal proof nor a quantitative measurement (e.g., empirical covariance or correlation statistics) that the covariance vanishes for the optimizers considered.
- [Empirical verification section] The empirical verification of TUC characterization and norm stability is described without error bars, statistical significance tests, or explicit exclusion criteria for runs, which weakens the support for the claim that the γ² rule improves performance over the conventional γ scaling.
minor comments (2)
- [Scion optimizer analysis] Define the precise expression for the momentum-dependent effective learning rate used in the Scion TUC analysis.
- [Introduction] Clarify whether the orthogonality argument mentioned in the abstract is retained or superseded by the independence assumption.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional empirical support and statistical reporting as outlined.
read point-by-point responses
-
Referee: [Derivation of weight-norm stability (abstract and main derivation)] The derivation that decoupled weight decay ∝ γ² produces stable weight norm rests entirely on the unverified premise that updates become independent of weights at steady state. This assumption is invoked to eliminate cross terms in the norm-evolution equation, yet the manuscript provides neither a formal proof nor a quantitative measurement (e.g., empirical covariance or correlation statistics) that the covariance vanishes for the optimizers considered.
Authors: The independence assumption is presented as a modeling simplification justified by the stochastic and high-dimensional nature of training, where updates at steady state tend to decorrelate from weights. While a general formal proof is not provided (as it would require restrictive assumptions on the loss landscape not holding for arbitrary optimizers), we will add quantitative empirical verification. A new subsection will report correlation coefficients and covariance norms between update and weight vectors across training for AdamW and Scion, confirming the cross terms are small at steady state. This strengthens the derivation without overclaiming universality. revision: yes
-
Referee: [Empirical verification section] The empirical verification of TUC characterization and norm stability is described without error bars, statistical significance tests, or explicit exclusion criteria for runs, which weakens the support for the claim that the γ² rule improves performance over the conventional γ scaling.
Authors: We agree that the empirical section would be strengthened by greater statistical rigor. The revised manuscript will include error bars (standard deviation over multiple independent runs with different seeds), p-values from t-tests or equivalent for performance comparisons, and explicit statements of run inclusion/exclusion criteria (e.g., divergence thresholds). These additions will better substantiate the observed benefits of γ² scaling on norms and performance. revision: yes
Circularity Check
Derivation of ∝ γ² for weight-norm stability rests on untested independence of updates from weights at steady state
specific steps
-
self definitional
[Abstract]
"we derive that decoupled weight decay ∝ γ² results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer"
The stable-norm outcome is produced directly by substituting the independence assumption into the norm-evolution equation and canceling cross terms; the proportionality therefore follows tautologically from the premise rather than from external data or a separate derivation.
full rationale
The paper's central derivation obtains the ∝ γ² proportionality for stable weight norm by algebraic rearrangement under the modeling assumption that updates become independent of weights at steady state. This assumption is invoked to cancel cross terms in the norm evolution equation but is presented as 'simple' without formal proof or quantitative verification that covariance vanishes. The resulting claim is therefore equivalent to the premise by construction rather than an independent first-principles result. Other claims (TUC under Scion) receive empirical checks, but the load-bearing weight-norm stability result does not.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption updates become independent of the weights at steady state
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2205.01580. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ne6zeqLFCZ. Ekin ...
-
[2]
doi: 10.1038/s41586-025-09422-z
URLhttps://proceedings.neurips.cc/paper_files/paper/2020/ file/d85b63ef0ccb114d0a3bb7b7d808028f-Paper.pdf. Francesco D’Angelo, Maksym Andriushchenko, Aditya Vardhan Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Pro- cessing Systems, 37:23191–23223, 2024. Aaron Defazio. Why gradients rapid...
-
[3]
The baseline ScionC withγ= 0.01, α= 0.1, η= 4×10 −4, thereforeλ= 0.04and C2 l = 2.375
-
[4]
Theα= 0.01→1.0ScionC linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay with the same maximum learning rate γ= 0.01
-
[5]
The results are mostly expected if we consider the effective learning rateγ eff over time (Fig
Theα= 0.01→1.0linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay but only scalesλ∝γ, ignoring the momentum schedule. The results are mostly expected if we consider the effective learning rateγ eff over time (Fig. 11). γeff decays early at the beginning of theα= 0.01→1.0ScionC experiment, so the top-...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.