Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics
Pith reviewed 2026-05-21 07:26 UTC · model grok-4.3
The pith
Weight decay acts as a scalar control parameter that separates memorization, developmental grokking, and collapse regimes in small transformers trained on modular arithmetic.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Weight decay acts as a scalar empirical control parameter for the memorization, developmental grokking, and collapse regimes in transformers trained on modular arithmetic. Two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales, the weight-decay axis separates the regimes, with a near-transition logistic fit localizing the memorization-to-developmental boundary at λ_c=0.0158 and a power-law fit giving an empirical exponent ν=0.757.
What carries the argument
Weight decay as a scalar empirical control parameter for training regimes, together with mean pairwise attention-head cosine similarity and entropy standard deviation as cheap online diagnostics computed from attention activations alone.
If this is right
- The memorization-to-developmental boundary localizes at λ_c=0.0158 with 95% CI [0.0109, 0.0200] and empirical power-law exponent ν=0.757.
- Attention-head re-initialization at λ=0.05 alters Phase-2 amplitude while matched weight-norm clipping does not.
- The weight-decay control pattern is preserved in a horizon-matched multi-task replication across four modular operations.
- Cross-architecture probes with 4L MLP, LSTM, and Mamba each replicate the weight-decay-controlled transition, though with architecture-specific λ_c values.
Where Pith is reading between the lines
- The attention-based diagnostics offer a low-cost way to monitor regime shifts that could be tested on sequence tasks outside modular arithmetic.
- Tuning weight decay may serve as a practical lever for inducing generalization in other small attention models, though larger-scale tests remain open.
- The observed exponent invites future finite-size scaling work to check consistency with known universality classes such as 3D Ising.
Load-bearing premise
The assumption that the weight-decay axis alone separates the memorization, developmental grokking, and collapse regimes across the tested conditions and model scales in modular arithmetic tasks.
What would settle it
A replication on modular arithmetic where varying weight decay alone fails to produce the distinct regimes or the reported logistic transition boundary at λ_c=0.0158.
Figures
read the original abstract
Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $\lambda_c=0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $\nu=0.757$ (CI [0.725, 0.799]). Reference exponents $\nu=1/2$ and 3D Ising $\nu \approx 0.63$ lie outside this empirical CI under our four-bin grid, so we report $\nu$ as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at $\lambda=0.05$ changes Phase-2 amplitude (Cohen's $d=-1.190$, n=10, $p_t=4.5 \times 10^{-3}$), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific $\lambda_c$ values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript examines how weight decay influences training regimes in transformers performing modular arithmetic, identifying distinct phases of memorization, developmental grokking, and collapse. It proposes that weight decay serves as a scalar control parameter and introduces two computationally inexpensive online diagnostics derived from attention activations: mean pairwise head cosine similarity and entropy standard deviation. The study reports results from eleven experimental conditions across three model scales, including a logistic fit for the critical weight decay λ_c = 0.0158 and a power-law exponent ν = 0.757, supported by multi-task replications, re-initialization experiments, and cross-architecture probes with MLPs, LSTMs, and Mambas.
Significance. If the central claims hold, the work supplies practical, low-cost diagnostics for tracking grokking dynamics directly from attention activations, complementing more expensive loss-landscape methods. The statistical fits with confidence intervals, replications across conditions and scales, and explicit scoping of claims to modular arithmetic in small attention models constitute clear strengths. The empirical treatment of the exponent ν (with reference values outside the CI) and the paired re-initialization results (Cohen's d and p-value) add rigor to the evidence that weight decay can be used as a control knob within the tested domain.
major comments (1)
- The central claim that weight decay acts as a scalar empirical control parameter cleanly separating memorization, developmental grokking, and collapse regimes is load-bearing for the title and abstract. All eleven conditions and replications (multi-task, re-initialization, cross-architecture) keep the base learning rate and AdamW β parameters fixed. Because effective regularization in AdamW arises from the interplay of λ with adaptive step sizes and weight-norm trajectories, the fitted λ_c = 0.0158 and the regime boundaries could shift if the learning rate were co-varied; this interaction is not ablated and therefore limits the strength of the 'scalar' characterization even within the scoped domain.
minor comments (2)
- The four-bin grid underlying the power-law fit and the construction of the 95% CI for ν are referenced in the abstract but the exact binning procedure, bin edges, and sensitivity checks are not detailed; adding this in the methods would improve reproducibility of the reported exponent and its comparison to ν=1/2 and the 3D Ising value.
- In the cross-architecture probes, architecture-specific λ_c values are stated but without a direct side-by-side table or discussion of the magnitude of differences relative to the transformer case; a concise comparison would clarify the scope of the replication.
Simulated Author's Rebuttal
We thank the referee for the constructive comment on the scope of our central claim. We agree that fixing the learning rate and AdamW β parameters limits the generality of describing weight decay as a fully independent scalar control parameter, and we will revise the manuscript to make this explicit.
read point-by-point responses
-
Referee: The central claim that weight decay acts as a scalar empirical control parameter cleanly separating memorization, developmental grokking, and collapse regimes is load-bearing for the title and abstract. All eleven conditions and replications (multi-task, re-initialization, cross-architecture) keep the base learning rate and AdamW β parameters fixed. Because effective regularization in AdamW arises from the interplay of λ with adaptive step sizes and weight-norm trajectories, the fitted λ_c = 0.0158 and the regime boundaries could shift if the learning rate were co-varied; this interaction is not ablated and therefore limits the strength of the 'scalar' characterization even within the scoped domain.
Authors: We agree with the referee's assessment. Our experimental design holds the base learning rate and AdamW β1, β2 fixed across all conditions (as stated in the methods), so the effective regularization strength is indeed an interplay rather than a pure function of λ alone. Consequently, the reported λ_c and regime boundaries are specific to these optimizer settings. We will revise the title, abstract, and the opening of Section 1 to qualify the claim as applying 'under fixed learning rate and AdamW hyperparameters.' A brief note will be added to the discussion acknowledging that co-varying the learning rate with weight decay remains unablated and is outside the current scope. This revision preserves the empirical utility of the diagnostics and fits within the tested domain while accurately reflecting the experimental controls. revision: yes
Circularity Check
No significant circularity; empirical fits and diagnostics are data-driven without reduction to inputs
full rationale
The paper reports experimental results on weight-decay regimes in small transformers trained on modular arithmetic. The logistic localization of λ_c=0.0158 and power-law exponent ν=0.757 are obtained from direct fits to observed phase boundaries across 210+ runs and multiple conditions; these are explicitly labeled empirical with confidence intervals and no universality claim. The two online diagnostics (mean pairwise attention-head cosine similarity; entropy standard deviation) are defined from attention activations and validated against loss-landscape measures without any equation that equates them to the fitted parameters by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text. The work is self-contained as an empirical scoping study; the central claim that weight decay separates regimes is tested rather than presupposed.
Axiom & Free-Parameter Ledger
free parameters (2)
- λ_c =
0.0158
- ν =
0.757
axioms (1)
- domain assumption Modular arithmetic tasks exhibit the described sharp transitions between memorization, generalization, and collapse under transformer training.
Reference graph
Works this paper leans on
-
[2]
URLhttps://arxiv.org/abs/ 2603.15492. Sarwan Ali. Critical windows of complexity control: When transformers decide to reason or memorize.arXiv preprint arXiv:2605.04396,
-
[3]
Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize
URLhttps://arxiv.org/abs/2605.04396. Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoenholz, Jascha Sohl-Dickstein, and Surya Ganguli. Statistical mechanics of deep learning.Annual Review of Condensed Matter Physics, 11:501–528,
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
URLhttps://arxiv.org/abs/2312.03012
doi: 10.1103/PhysRevResearch.6.033098. URLhttps://arxiv.org/abs/2312.03012. Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Unveiling induction heads: Provable training dynamics and feature learning in transformers. InNeurIPS,
-
[6]
Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion
doi: 10.1073/pnas.1520428113. Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?arXiv preprint arXiv:2310.04415,
-
[7]
URLhttps://arxiv. org/abs/2310.04415. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, Anthropic,
-
[8]
URLhttps://transformer-circuits.pub/2022/toy_model/index. html. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR,
work page 2022
-
[10]
URLhttps://arxiv. org/abs/2206.05794. Shreel Golwala. ILDR: Geometric early detection of grokking.arXiv preprint arXiv:2604.20923,
-
[11]
ILDR: Geometric Early Detection of Grokking
URL https://arxiv.org/abs/2604.20923. Laura Gomezjurado Gonzalez. The long delay to arithmetic generalization: When learned representations outrun behavior.arXiv preprint arXiv:2604.13082,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior
URLhttps://arxiv.org/abs/2604.13082. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Max Hennick and Guillaume Corlouer
URL https: //arxiv.org/abs/2602.14445. Max Hennick and Guillaume Corlouer. From density matrices to phase transitions in deep learning: Spectral early warnings and interpretability.arXiv preprint arXiv:2603.29805,
-
[15]
Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen
URLhttps://arxiv.org/ abs/2603.29805. Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen. A bioelectrical phase transition patterns the first vertebrate heartbeats.Nature, 622(7981):149–155,
-
[16]
What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,
Clare Lyle, Gharda Sokar, Razvan Pascanu, and Andras Gyorgy. What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,
-
[17]
Shalima Binta Manir and Anamika Paul Rupa
URLhttps://arxiv.org/abs/2311.18817. Shalima Binta Manir and Anamika Paul Rupa. A systematic empirical study of grokking: Depth, architecture, activation, and regularization.arXiv preprint arXiv:2603.25009,
- [18]
-
[19]
URLhttps://arxiv.org/abs/2002.06716
doi: 10.1038/s41467-021-24025-8. URLhttps://arxiv.org/abs/2002.06716. Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In NeurIPS,
-
[21]
Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A
URLhttps://arxiv.org/abs/2511.01938. Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A. Louis. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem. InNeurIPS,
-
[22]
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt
URL https://arxiv.org/abs/2404.17563. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InICLR,
-
[23]
Ido Nitsan, Stavit Drori, Yair E
URLhttps: //arxiv.org/abs/2311.03260. Ido Nitsan, Stavit Drori, Yair E. Lewis, Shlomi Cohen, and Shelly Tzlil. Mechanical communication in cardiac cell synchronized beating.Nature Physics, 12(5):472–477,
-
[24]
Catherine Olsson, Nelson Elhage, Neel Nanda, et al
doi: 10.1038/nphys3619. Catherine Olsson, Nelson Elhage, Neel Nanda, et al. In-context learning and induction heads.Transformer Circuits Thread, Anthropic,
-
[25]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,
work page internal anchor Pith review Pith/arXiv arXiv
-
[26]
HariK.PrakashandCharlesH. Martin. Late-stagegeneralizationcollapseingrokking: Detectinganti-grokking with Weightwatcher.arXiv preprint arXiv:2602.02859, 2026a. URLhttps://arxiv.org/abs/2602.02859. Hari K. Prakash and Charles H. Martin. Detecting overfitting in neural networks during long-horizon grokking using random matrix theory.arXiv preprint arXiv:260...
-
[27]
URLhttps://arxiv. org/abs/2603.03993. 20 Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InICLR,
-
[28]
URLhttps://arxiv.org/abs/1810.10531
doi: 10.1073/pnas.1820226116. URLhttps://arxiv.org/abs/1810.10531. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InNeurIPS,
-
[29]
URLhttps://arxiv.org/abs/2304.15004. Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J. Michaud, Berkan Ottlik, and Joseph Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,
-
[31]
URLhttps://arxiv.org/abs/2602.06702. Yiding Song and Hanming Ye. Model capacity determines grokking through competing memorisation and generalisation speeds.arXiv preprint arXiv:2605.09724,
-
[32]
Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds
URLhttps://arxiv.org/abs/2605.09724. Yifan Tang, Qiquan Wang, Inés García-Redondo, and Anthea Monod. Topological signatures of grokking. arXiv preprint arXiv:2605.06352,
work page internal anchor Pith review Pith/arXiv arXiv
-
[33]
Topological Signatures of Grokking
URLhttps://arxiv.org/abs/2605.06352. Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking.arXiv preprint arXiv:2509.21519,
work page internal anchor Pith review Pith/arXiv arXiv
-
[34]
URLhttps://arxiv.org/abs/2509.21519
doi: 10.48550/arXiv.2509.21519. URLhttps://arxiv.org/abs/2509.21519. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. The norm-separation delay law of grokking: A first-principles theory of delayed generalization.arXiv preprint arXiv:2603.13331, 2026a. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. Spectral ...
-
[35]
George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,
-
[36]
Grokking as Dimensional Phase Transition in Neural Networks
Ping Wang. Grokking as dimensional phase transition in neural networks.arXiv preprint arXiv:2604.04655, 2026a. URLhttps://arxiv.org/abs/2604.04655. Ping Wang. Dimensional criticality at grokking across MLPs and transformers.arXiv preprint arXiv:2604.16431, 2026b. Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and datase...
work page internal anchor Pith review Pith/arXiv arXiv
-
[38]
Distributional Spectral Diagnostics for Localizing Grokking Transitions
URLhttps://arxiv.org/abs/2605.08237. Jason Wei, Yi Tay, Rishi Bommasani, et al. Emergent abilities of large language models.TMLR,
work page internal anchor Pith review Pith/arXiv arXiv
-
[39]
To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,
Mingyue Xu, Gal Vardi, and Itay Safran. To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,
-
[40]
URLhttps://arxiv.org/abs/2601.19791. 21 Yongzhong Xu. Early-warning signals of grokking via loss-landscape geometry.arXiv preprint arXiv:2602.16967, 2026a. Yongzhong Xu. Spectral edge dynamics reveal functional modes of learning.arXiv preprint arXiv:2604.06256, 2026b. URLhttps://arxiv.org/abs/2604.06256. Yongzhong Xu. Low-dimensional and transversely curv...
work page internal anchor Pith review doi:10.48550/arxiv.2603.28964
-
[41]
The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
URLhttps://arxiv.org/abs/2603.05228. Junjie Zhang, Zhen Shen, Gang Xiong, and Xisong Dong. Grokking from abstraction to intelligence.arXiv preprint arXiv:2603.29262,
work page internal anchor Pith review Pith/arXiv arXiv
-
[42]
doi: 10.48550/arXiv.2603.29262. URLhttps://arxiv.org/abs/2603. 29262. Xiaotian Zhang, Yue Shang, Entao Yang, and Ge Zhang. Is grokking a computational glass relaxation?arXiv preprint arXiv:2505.11411,
-
[43]
The analysis compares measured participation ratio against the value predicted from the eigenvalue coefficient of variation after correcting the stored sample standard deviation to a population standard deviation, then applies the affine normalization used by the released JSON artifacts. Across 183 valid layer-epoch rows, the mean absolute raw-PR error is...
work page 2023
-
[44]
eliminates contamination from the late cycle and yieldsκvalues{18.4, 19.2, 17.9, 8.1, 8.6}across the same5 seeds, with boundλc∈{0.0125, 0.0120, 0.0128, 0.0284, 0.0268}. Three of five cohorts now fall in the empirical 95% CI; the across-cohort mean shifts toλbound c = 0.0185±0.0074(range[0 .012, 0.028]), with the mean itself inside the empirical CI. The re...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.