Correction of Decoupled Weight Decay

Jason Chuan-Chih Chou

arxiv: 2512.08217 · v3 · submitted 2025-12-09 · 💻 cs.LG

Correction of Decoupled Weight Decay

Jason Chuan-Chih Chou This is my paper

Pith reviewed 2026-05-17 00:27 UTC · model grok-4.3

classification 💻 cs.LG

keywords decoupled weight decayweight norm stabilityAdamWScion optimizerlearning rate scalingeffective learning ratetraining dynamics

0 comments

The pith

Decoupled weight decay should scale with the square of the learning rate to stabilize weight norms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the common choice of setting decoupled weight decay proportional to the learning rate itself. Instead it derives that proportionality to the square of the learning rate produces stable weight norms once updates no longer depend on the current weights. The same steady-state independence assumption also lets the authors characterize the total update contribution for the Scion optimizer through a momentum-adjusted effective learning rate. These scalings give tighter control over both weight and gradient norms and improve final model performance across optimizers.

Core claim

Decoupled weight decay ∝ γ² results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers, and decoupled weight decay ∝ γ² leads to stable weight and gradient norms that allow better control of the training dynamics and improved model performance.

What carries the argument

The assumption that updates become independent of the weights at steady state, used to derive the γ² scaling for decoupled weight decay that stabilizes norms.

If this is right

Weight and gradient norms remain stable throughout training.
Training dynamics can be controlled more directly by the choice of learning rate and weight decay.
The optimal momentum-dependent effective learning rate transfers across different runs and models.
Model performance improves when the γ² scaling is used instead of the conventional linear scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The independence assumption could be tested directly by measuring correlation between updates and weights late in training.
The same scaling rule may apply to other adaptive optimizers beyond AdamW and Scion.
Practitioners training large models could adopt the γ² rule to reduce the need for manual norm monitoring.

Load-bearing premise

Updates become independent of the weights at steady state.

What would settle it

An experiment showing that weight norms still drift when decoupled weight decay is set proportional to γ², or that removing the perpendicular component of the update changes training dynamics substantially.

Figures

Figures reproduced from arXiv: 2512.08217 by Jason Chuan-Chih Chou.

**Figure 2.** Figure 2: ImageNet-1k top-1 val. accuracy of simple ViT-S/16 trained for 90 epochs with momen [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Simple ViT-S/16 trained on ImageNet-1k for 90 epochs with ScionC (Algorithm 2 with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Training 124M Modded-NanoGPT on FineWeb-Edu-100B, Scion vs. ScionC. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training ViT-S/16 on ImageNet-1k, AdamW (upper) vs. AdamC (lower). [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Training ViT-S/16 on ImageNet-1k, Scion (upper) vs. ScionC (cosine, lower). [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: Numerical simulations of the system described by Eq. 5 where [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: Training a ViT-S/16 on ImageNet-1k for 90 epochs, AdamC with [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Training a ViT-S/16 with momentum scheduling that erroneously matches [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Training a ViT-S/16 with momentum scheduling that erroneously matches [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Stress testing ScionC by training a ViT-S/16 with momentum scheduling. Properly scaled [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: Comparing Scion, ScionC, and ScionC that scales [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗

**Figure 13.** Figure 13: Training ViT-S/16 on ImageNet-1k, Scion ( [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

read the original abstract

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto \gamma^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto \gamma^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto \gamma^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives that decoupled weight decay should scale with gamma squared for norm stability under a steady-state independence assumption, offering a practical alternative to orthogonality arguments but resting on an unproven premise.

read the letter

The core point is that decoupled weight decay proportional to gamma squared keeps weight norms stable if updates become independent of the weights at steady state. This replaces an earlier orthogonality claim and applies across optimizers, including a note on the Scion optimizer where total update contribution tracks a momentum-adjusted effective learning rate. The derivation is straightforward algebraic rearrangement once the independence assumption cancels the cross terms. That part is clean and directly addresses a common hyperparameter choice in AdamW-style training. The paper also reports that this scaling improves control over norms and yields better model performance in their checks. Those empirical observations are useful for practitioners who tune large models and want fewer free parameters. The main soft spot is the independence assumption itself. It is called simple and is used to drop the covariance term, yet the text does not supply a formal proof or quantitative measurements showing that the covariance actually vanishes in the regimes of interest, such as with momentum or adaptive statistics. The empirical verification is mentioned but lacks error bars, exclusion rules, or statistical tests, so it is hard to judge how robust the stability result is. Minor gaps like these are common in optimizer papers, but here the assumption carries the central claim. This work is aimed at researchers who analyze or tune optimizers for deep networks rather than at a broad audience. A reader already familiar with AdamW scaling debates will find the alternative derivation worth seeing. It is solid enough on its own terms to deserve serious referee time, mainly to test the assumption more carefully and tighten the empirical section.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that the conventional choice of decoupled weight decay proportional to the learning rate γ should instead be set proportional to γ² to produce stable weight norms. This follows from modeling the weight-norm evolution under the assumption that, at steady state, the update vector becomes statistically independent of the current weight vector irrespective of the optimizer. The authors further derive that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is governed by a momentum-dependent effective learning rate whose optimal value transfers across settings, and they report that the γ² scaling yields stable weight and gradient norms together with improved training dynamics and model performance.

Significance. If the independence assumption is valid and the empirical results hold under rigorous controls, the work supplies a theoretically motivated correction to a widely used hyper-parameter that could improve stability and controllability in large-scale optimizers such as AdamW and Scion.

major comments (2)

[Derivation of weight-norm stability (abstract and main derivation)] The derivation that decoupled weight decay ∝ γ² produces stable weight norm rests entirely on the unverified premise that updates become independent of weights at steady state. This assumption is invoked to eliminate cross terms in the norm-evolution equation, yet the manuscript provides neither a formal proof nor a quantitative measurement (e.g., empirical covariance or correlation statistics) that the covariance vanishes for the optimizers considered.
[Empirical verification section] The empirical verification of TUC characterization and norm stability is described without error bars, statistical significance tests, or explicit exclusion criteria for runs, which weakens the support for the claim that the γ² rule improves performance over the conventional γ scaling.

minor comments (2)

[Scion optimizer analysis] Define the precise expression for the momentum-dependent effective learning rate used in the Scion TUC analysis.
[Introduction] Clarify whether the orthogonality argument mentioned in the abstract is retained or superseded by the independence assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional empirical support and statistical reporting as outlined.

read point-by-point responses

Referee: [Derivation of weight-norm stability (abstract and main derivation)] The derivation that decoupled weight decay ∝ γ² produces stable weight norm rests entirely on the unverified premise that updates become independent of weights at steady state. This assumption is invoked to eliminate cross terms in the norm-evolution equation, yet the manuscript provides neither a formal proof nor a quantitative measurement (e.g., empirical covariance or correlation statistics) that the covariance vanishes for the optimizers considered.

Authors: The independence assumption is presented as a modeling simplification justified by the stochastic and high-dimensional nature of training, where updates at steady state tend to decorrelate from weights. While a general formal proof is not provided (as it would require restrictive assumptions on the loss landscape not holding for arbitrary optimizers), we will add quantitative empirical verification. A new subsection will report correlation coefficients and covariance norms between update and weight vectors across training for AdamW and Scion, confirming the cross terms are small at steady state. This strengthens the derivation without overclaiming universality. revision: yes
Referee: [Empirical verification section] The empirical verification of TUC characterization and norm stability is described without error bars, statistical significance tests, or explicit exclusion criteria for runs, which weakens the support for the claim that the γ² rule improves performance over the conventional γ scaling.

Authors: We agree that the empirical section would be strengthened by greater statistical rigor. The revised manuscript will include error bars (standard deviation over multiple independent runs with different seeds), p-values from t-tests or equivalent for performance comparisons, and explicit statements of run inclusion/exclusion criteria (e.g., divergence thresholds). These additions will better substantiate the observed benefits of γ² scaling on norms and performance. revision: yes

Circularity Check

1 steps flagged

Derivation of ∝ γ² for weight-norm stability rests on untested independence of updates from weights at steady state

specific steps

self definitional [Abstract]
"we derive that decoupled weight decay ∝ γ² results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer"

The stable-norm outcome is produced directly by substituting the independence assumption into the norm-evolution equation and canceling cross terms; the proportionality therefore follows tautologically from the premise rather than from external data or a separate derivation.

full rationale

The paper's central derivation obtains the ∝ γ² proportionality for stable weight norm by algebraic rearrangement under the modeling assumption that updates become independent of weights at steady state. This assumption is invoked to cancel cross terms in the norm evolution equation but is presented as 'simple' without formal proof or quantitative verification that covariance vanishes. The resulting claim is therefore equivalent to the premise by construction rather than an independent first-principles result. Other claims (TUC under Scion) receive empirical checks, but the load-bearing weight-norm stability result does not.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing element is a single domain assumption about steady-state update independence; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption updates become independent of the weights at steady state
Invoked to derive the gamma-squared scaling for decoupled weight decay and the effective learning rate for Scion

pith-pipeline@v0.9.0 · 5478 in / 1229 out tokens · 65075 ms · 2026-05-17T00:27:23.684365+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

[1]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le

URLhttps://arxiv.org/abs/2205.01580. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ne6zeqLFCZ. Ekin ...

work page arXiv 2023
[2]

doi: 10.1038/s41586-025-09422-z

URLhttps://proceedings.neurips.cc/paper_files/paper/2020/ file/d85b63ef0ccb114d0a3bb7b7d808028f-Paper.pdf. Francesco D’Angelo, Maksym Andriushchenko, Aditya Vardhan Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Pro- cessing Systems, 37:23191–23223, 2024. Aaron Defazio. Why gradients rapid...

work page doi:10.1038/s41586-025-09422-z 2020
[3]

The baseline ScionC withγ= 0.01, α= 0.1, η= 4×10 −4, thereforeλ= 0.04and C2 l = 2.375

work page
[4]

Theα= 0.01→1.0ScionC linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay with the same maximum learning rate γ= 0.01

work page
[5]

The results are mostly expected if we consider the effective learning rateγ eff over time (Fig

Theα= 0.01→1.0linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay but only scalesλ∝γ, ignoring the momentum schedule. The results are mostly expected if we consider the effective learning rateγ eff over time (Fig. 11). γeff decays early at the beginning of theα= 0.01→1.0ScionC experiment, so the top-...

work page 2025

[1] [1]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le

URLhttps://arxiv.org/abs/2205.01580. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ne6zeqLFCZ. Ekin ...

work page arXiv 2023

[2] [2]

doi: 10.1038/s41586-025-09422-z

URLhttps://proceedings.neurips.cc/paper_files/paper/2020/ file/d85b63ef0ccb114d0a3bb7b7d808028f-Paper.pdf. Francesco D’Angelo, Maksym Andriushchenko, Aditya Vardhan Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Pro- cessing Systems, 37:23191–23223, 2024. Aaron Defazio. Why gradients rapid...

work page doi:10.1038/s41586-025-09422-z 2020

[3] [3]

The baseline ScionC withγ= 0.01, α= 0.1, η= 4×10 −4, thereforeλ= 0.04and C2 l = 2.375

work page

[4] [4]

Theα= 0.01→1.0ScionC linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay with the same maximum learning rate γ= 0.01

work page

[5] [5]

The results are mostly expected if we consider the effective learning rateγ eff over time (Fig

Theα= 0.01→1.0linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay but only scalesλ∝γ, ignoring the momentum schedule. The results are mostly expected if we consider the effective learning rateγ eff over time (Fig. 11). γeff decays early at the beginning of theα= 0.01→1.0ScionC experiment, so the top-...

work page 2025