pith. sign in

arxiv: 2512.08217 · v3 · submitted 2025-12-09 · 💻 cs.LG

Correction of Decoupled Weight Decay

Pith reviewed 2026-05-17 00:27 UTC · model grok-4.3

classification 💻 cs.LG
keywords decoupled weight decayweight norm stabilityAdamWScion optimizerlearning rate scalingeffective learning ratetraining dynamics
0
0 comments X

The pith

Decoupled weight decay should scale with the square of the learning rate to stabilize weight norms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper questions the common choice of setting decoupled weight decay proportional to the learning rate itself. Instead it derives that proportionality to the square of the learning rate produces stable weight norms once updates no longer depend on the current weights. The same steady-state independence assumption also lets the authors characterize the total update contribution for the Scion optimizer through a momentum-adjusted effective learning rate. These scalings give tighter control over both weight and gradient norms and improve final model performance across optimizers.

Core claim

Decoupled weight decay ∝ γ² results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers, and decoupled weight decay ∝ γ² leads to stable weight and gradient norms that allow better control of the training dynamics and improved model performance.

What carries the argument

The assumption that updates become independent of the weights at steady state, used to derive the γ² scaling for decoupled weight decay that stabilizes norms.

If this is right

  • Weight and gradient norms remain stable throughout training.
  • Training dynamics can be controlled more directly by the choice of learning rate and weight decay.
  • The optimal momentum-dependent effective learning rate transfers across different runs and models.
  • Model performance improves when the γ² scaling is used instead of the conventional linear scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The independence assumption could be tested directly by measuring correlation between updates and weights late in training.
  • The same scaling rule may apply to other adaptive optimizers beyond AdamW and Scion.
  • Practitioners training large models could adopt the γ² rule to reduce the need for manual norm monitoring.

Load-bearing premise

Updates become independent of the weights at steady state.

What would settle it

An experiment showing that weight norms still drift when decoupled weight decay is set proportional to γ², or that removing the perpendicular component of the update changes training dynamics substantially.

Figures

Figures reproduced from arXiv: 2512.08217 by Jason Chuan-Chih Chou.

Figure 1
Figure 1. Figure 1: Training a ViT-S/16 with “Renormalized” AdamW results in negligible differences in top [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: ImageNet-1k top-1 val. accuracy of simple ViT-S/16 trained for 90 epochs with momen [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Simple ViT-S/16 trained on ImageNet-1k for 90 epochs with ScionC (Algorithm 2 with [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Training 124M Modded-NanoGPT on FineWeb-Edu-100B, Scion vs. ScionC. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training ViT-S/16 on ImageNet-1k, AdamW (upper) vs. AdamC (lower). [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training ViT-S/16 on ImageNet-1k, Scion (upper) vs. ScionC (cosine, lower). [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Numerical simulations of the system described by Eq. 5 where [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Training a ViT-S/16 on ImageNet-1k for 90 epochs, AdamC with [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Training a ViT-S/16 with momentum scheduling that erroneously matches [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training a ViT-S/16 with momentum scheduling that erroneously matches [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Stress testing ScionC by training a ViT-S/16 with momentum scheduling. Properly scaled [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Comparing Scion, ScionC, and ScionC that scales [PITH_FULL_IMAGE:figures/full_fig_p020_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Training ViT-S/16 on ImageNet-1k, Scion ( [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗
read the original abstract

Decoupled weight decay, solely responsible for the performance advantage of AdamW over Adam, has long been set to proportional to learning rate $\gamma$ without questioning. Some researchers have recently challenged such assumption and argued that decoupled weight decay should be set $\propto \gamma^2$ instead based on orthogonality arguments at steady state. To the contrary, we find that eliminating the contribution of the perpendicular component of the update to the weight norm leads to little change to the training dynamics. Instead, we derive that decoupled weight decay $\propto \gamma^2$ results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer. Based on the same assumption, we derive and empirically verify that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is better characterized by the momentum-dependent effective learning rate whose optimal value transfers and we show that decoupled weight decay $\propto \gamma^2$ leads to stable weight and gradient norms and allows us to better control the training dynamics and improve the model performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript argues that the conventional choice of decoupled weight decay proportional to the learning rate γ should instead be set proportional to γ² to produce stable weight norms. This follows from modeling the weight-norm evolution under the assumption that, at steady state, the update vector becomes statistically independent of the current weight vector irrespective of the optimizer. The authors further derive that the Total Update Contribution (TUC) of a minibatch under the Scion optimizer is governed by a momentum-dependent effective learning rate whose optimal value transfers across settings, and they report that the γ² scaling yields stable weight and gradient norms together with improved training dynamics and model performance.

Significance. If the independence assumption is valid and the empirical results hold under rigorous controls, the work supplies a theoretically motivated correction to a widely used hyper-parameter that could improve stability and controllability in large-scale optimizers such as AdamW and Scion.

major comments (2)
  1. [Derivation of weight-norm stability (abstract and main derivation)] The derivation that decoupled weight decay ∝ γ² produces stable weight norm rests entirely on the unverified premise that updates become independent of weights at steady state. This assumption is invoked to eliminate cross terms in the norm-evolution equation, yet the manuscript provides neither a formal proof nor a quantitative measurement (e.g., empirical covariance or correlation statistics) that the covariance vanishes for the optimizers considered.
  2. [Empirical verification section] The empirical verification of TUC characterization and norm stability is described without error bars, statistical significance tests, or explicit exclusion criteria for runs, which weakens the support for the claim that the γ² rule improves performance over the conventional γ scaling.
minor comments (2)
  1. [Scion optimizer analysis] Define the precise expression for the momentum-dependent effective learning rate used in the Scion TUC analysis.
  2. [Introduction] Clarify whether the orthogonality argument mentioned in the abstract is retained or superseded by the independence assumption.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and will revise the paper to incorporate additional empirical support and statistical reporting as outlined.

read point-by-point responses
  1. Referee: [Derivation of weight-norm stability (abstract and main derivation)] The derivation that decoupled weight decay ∝ γ² produces stable weight norm rests entirely on the unverified premise that updates become independent of weights at steady state. This assumption is invoked to eliminate cross terms in the norm-evolution equation, yet the manuscript provides neither a formal proof nor a quantitative measurement (e.g., empirical covariance or correlation statistics) that the covariance vanishes for the optimizers considered.

    Authors: The independence assumption is presented as a modeling simplification justified by the stochastic and high-dimensional nature of training, where updates at steady state tend to decorrelate from weights. While a general formal proof is not provided (as it would require restrictive assumptions on the loss landscape not holding for arbitrary optimizers), we will add quantitative empirical verification. A new subsection will report correlation coefficients and covariance norms between update and weight vectors across training for AdamW and Scion, confirming the cross terms are small at steady state. This strengthens the derivation without overclaiming universality. revision: yes

  2. Referee: [Empirical verification section] The empirical verification of TUC characterization and norm stability is described without error bars, statistical significance tests, or explicit exclusion criteria for runs, which weakens the support for the claim that the γ² rule improves performance over the conventional γ scaling.

    Authors: We agree that the empirical section would be strengthened by greater statistical rigor. The revised manuscript will include error bars (standard deviation over multiple independent runs with different seeds), p-values from t-tests or equivalent for performance comparisons, and explicit statements of run inclusion/exclusion criteria (e.g., divergence thresholds). These additions will better substantiate the observed benefits of γ² scaling on norms and performance. revision: yes

Circularity Check

1 steps flagged

Derivation of ∝ γ² for weight-norm stability rests on untested independence of updates from weights at steady state

specific steps
  1. self definitional [Abstract]
    "we derive that decoupled weight decay ∝ γ² results in stable weight norm based on the simple assumption that updates become independent of the weights at steady state, regardless of the nature of the optimizer"

    The stable-norm outcome is produced directly by substituting the independence assumption into the norm-evolution equation and canceling cross terms; the proportionality therefore follows tautologically from the premise rather than from external data or a separate derivation.

full rationale

The paper's central derivation obtains the ∝ γ² proportionality for stable weight norm by algebraic rearrangement under the modeling assumption that updates become independent of weights at steady state. This assumption is invoked to cancel cross terms in the norm evolution equation but is presented as 'simple' without formal proof or quantitative verification that covariance vanishes. The resulting claim is therefore equivalent to the premise by construction rather than an independent first-principles result. Other claims (TUC under Scion) receive empirical checks, but the load-bearing weight-norm stability result does not.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing element is a single domain assumption about steady-state update independence; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption updates become independent of the weights at steady state
    Invoked to derive the gamma-squared scaling for decoupled weight decay and the effective learning rate for Scion

pith-pipeline@v0.9.0 · 5478 in / 1229 out tokens · 65075 ms · 2026-05-17T00:27:23.684365+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

5 extracted references · 5 canonical work pages

  1. [1]

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le

    URLhttps://arxiv.org/abs/2205.01580. Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V Le. Symbolic discovery of optimization algorithms. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=ne6zeqLFCZ. Ekin ...

  2. [2]

    doi: 10.1038/s41586-025-09422-z

    URLhttps://proceedings.neurips.cc/paper_files/paper/2020/ file/d85b63ef0ccb114d0a3bb7b7d808028f-Paper.pdf. Francesco D’Angelo, Maksym Andriushchenko, Aditya Vardhan Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?Advances in Neural Information Pro- cessing Systems, 37:23191–23223, 2024. Aaron Defazio. Why gradients rapid...

  3. [3]

    The baseline ScionC withγ= 0.01, α= 0.1, η= 4×10 −4, thereforeλ= 0.04and C2 l = 2.375

  4. [4]

    Theα= 0.01→1.0ScionC linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay with the same maximum learning rate γ= 0.01

  5. [5]

    The results are mostly expected if we consider the effective learning rateγ eff over time (Fig

    Theα= 0.01→1.0linear scheduling experiment that linearly increases the momentum in addition to cosine learning rate decay but only scalesλ∝γ, ignoring the momentum schedule. The results are mostly expected if we consider the effective learning rateγ eff over time (Fig. 11). γeff decays early at the beginning of theα= 0.01→1.0ScionC experiment, so the top-...