pith. sign in

arxiv: 2605.20441 · v1 · pith:65EOGNL3new · submitted 2026-05-19 · 💻 cs.LG · cs.AI· cs.NE

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Pith reviewed 2026-05-21 07:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE
keywords weight decaygrokkingtransformersmodular arithmeticattention diagnosticstraining regimesonline monitoringphase transitions
0
0 comments X

The pith

Weight decay acts as a scalar control parameter that separates memorization, developmental grokking, and collapse regimes in small transformers trained on modular arithmetic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers trained on modular arithmetic show sharp transitions between memorizing training examples, generalizing to unseen cases through grokking, and collapsing into poor performance. The paper shows that a single scalar, the weight decay coefficient, can be adjusted to select which of these regimes occurs during training. It introduces two low-cost online diagnostics computed only from attention activations: the mean pairwise cosine similarity across attention heads and the standard deviation of entropy values. These monitors track the transitions in real time and complement more expensive loss-landscape analyses. The separation holds across eleven experimental conditions and three model scales ranging from 0.82M to 85M parameters, with a logistic fit localizing the main transition boundary.

Core claim

Weight decay acts as a scalar empirical control parameter for the memorization, developmental grokking, and collapse regimes in transformers trained on modular arithmetic. Two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales, the weight-decay axis separates the regimes, with a near-transition logistic fit localizing the memorization-to-developmental boundary at λ_c=0.0158 and a power-law fit giving an empirical exponent ν=0.757.

What carries the argument

Weight decay as a scalar empirical control parameter for training regimes, together with mean pairwise attention-head cosine similarity and entropy standard deviation as cheap online diagnostics computed from attention activations alone.

If this is right

  • The memorization-to-developmental boundary localizes at λ_c=0.0158 with 95% CI [0.0109, 0.0200] and empirical power-law exponent ν=0.757.
  • Attention-head re-initialization at λ=0.05 alters Phase-2 amplitude while matched weight-norm clipping does not.
  • The weight-decay control pattern is preserved in a horizon-matched multi-task replication across four modular operations.
  • Cross-architecture probes with 4L MLP, LSTM, and Mamba each replicate the weight-decay-controlled transition, though with architecture-specific λ_c values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The attention-based diagnostics offer a low-cost way to monitor regime shifts that could be tested on sequence tasks outside modular arithmetic.
  • Tuning weight decay may serve as a practical lever for inducing generalization in other small attention models, though larger-scale tests remain open.
  • The observed exponent invites future finite-size scaling work to check consistency with known universality classes such as 3D Ising.

Load-bearing premise

The assumption that the weight-decay axis alone separates the memorization, developmental grokking, and collapse regimes across the tested conditions and model scales in modular arithmetic tasks.

What would settle it

A replication on modular arithmetic where varying weight decay alone fails to produce the distinct regimes or the reported logistic transition boundary at λ_c=0.0158.

Figures

Figures reproduced from arXiv: 2605.20441 by Lucky Verma.

Figure 1
Figure 1. Figure 1: Two-axis empirical regime diagram across [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Canonical two-phase dynamics at λ=1.0 over n=50 replicated 4L8H runs. Solid curves show cohort medians; bands show interquartile ranges. Phase 1 (yellow shading): synchronization + grokking. Phase 2 (blue shading): differentiation + test-accuracy plateau. The canonical seed-42 trace ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Long-horizon canonical seed-42 trajectory with a 4-seed matched underlay ({7, 11, 31, 123}) at 4L8H mod-add λ=1.0 20 000-epoch configuration. Bold colored traces (canonical seed 42): raw per-epoch values at low alpha plus moving-average-smoothed overlay. Gray traces: smoothed cross-seed trajectories underlaid to disclose cohort variability. P1–P5 bands are canonical seed-42 landmark windows, not per-seed f… view at source ↗
Figure 4
Figure 4. Figure 4: Peak σH vs per-head dimension d/H. Saturating-exponential AIC-preferred. Dotted line: random￾label null-control scale reference, not an equivalence claim. 10 −3 10 −2 10 −1 10 0 Weight decay ¸ 0.0 0.2 0.4 0.6 0.8 1.0 P(grok) (A) ¸c = 0:0158 CI[0:0109; 0:0200] 10 −1 10 0 ¸ ¡ ¸c 10 2 10 3 tgrok (B) º = 0:76 CI[0:72; 0:80] runs fit º = 0:76 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (A) Logistic P(grok) vs weight decay across N=210 runs post Phase A. Points are WD bins with Wilson 95% CIs, the orange curve is the logistic fit, and the dashed vertical line marks λc=0.0158 (95% CI [0.0109, 0.0200]). (B) Power-law divergence of tgrok above λc; ν=0.757, CI [0.725, 0.799]. Tested reference exponents such as ν=1/2 and 3D Ising ν≈0.63 are outside CI under the four-bin grid; we do not identif… view at source ↗
Figure 6
Figure 6. Figure 6: Permutation-symmetry test via affine-normalized [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ESD heavy-tail exponent α (Weightwatcher, layer-median) across the 11 canonical-trajectory checkpoints (seed 42 only). Cross-seed Weightwatcher at the matched 11-checkpoint grid is deferred; a coarser 3-seed supplementary trace provides a qualitative onset check. α drops from 2.07 at random init to 1.39 at Phase 1 grokking onset and remains in the heavy-tail regime (α < 2) through Phase 5. The third-phase … view at source ↗
Figure 8
Figure 8. Figure 8: Causal-intervention paired forest plot for peak [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Multi-task grok-rate heatmap (4 modular operations [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Cross-architecture transition comparison from logistic grok-rate fits. Bars show [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a) Ωtotal(t) Frobenius-norm trajectories for 5 cross-seed cohorts (canonical seed 42 + cross-seed cohort seeds 7, 11, 31, 123) at the same architecture, training λ=1.0, 20 000 epochs. Dashed lines show single-exponential AdamW-relaxation fits restricted to the M→G transition window (t ≤ 5 000, vertical gray). Late-stage cycle visible at t>10 000 explains the full-fit bimodal κ. (b) Early-window residuals… view at source ↗
read the original abstract

Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $\lambda_c=0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $\nu=0.757$ (CI [0.725, 0.799]). Reference exponents $\nu=1/2$ and 3D Ising $\nu \approx 0.63$ lie outside this empirical CI under our four-bin grid, so we report $\nu$ as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at $\lambda=0.05$ changes Phase-2 amplitude (Cohen's $d=-1.190$, n=10, $p_t=4.5 \times 10^{-3}$), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific $\lambda_c$ values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript examines how weight decay influences training regimes in transformers performing modular arithmetic, identifying distinct phases of memorization, developmental grokking, and collapse. It proposes that weight decay serves as a scalar control parameter and introduces two computationally inexpensive online diagnostics derived from attention activations: mean pairwise head cosine similarity and entropy standard deviation. The study reports results from eleven experimental conditions across three model scales, including a logistic fit for the critical weight decay λ_c = 0.0158 and a power-law exponent ν = 0.757, supported by multi-task replications, re-initialization experiments, and cross-architecture probes with MLPs, LSTMs, and Mambas.

Significance. If the central claims hold, the work supplies practical, low-cost diagnostics for tracking grokking dynamics directly from attention activations, complementing more expensive loss-landscape methods. The statistical fits with confidence intervals, replications across conditions and scales, and explicit scoping of claims to modular arithmetic in small attention models constitute clear strengths. The empirical treatment of the exponent ν (with reference values outside the CI) and the paired re-initialization results (Cohen's d and p-value) add rigor to the evidence that weight decay can be used as a control knob within the tested domain.

major comments (1)
  1. The central claim that weight decay acts as a scalar empirical control parameter cleanly separating memorization, developmental grokking, and collapse regimes is load-bearing for the title and abstract. All eleven conditions and replications (multi-task, re-initialization, cross-architecture) keep the base learning rate and AdamW β parameters fixed. Because effective regularization in AdamW arises from the interplay of λ with adaptive step sizes and weight-norm trajectories, the fitted λ_c = 0.0158 and the regime boundaries could shift if the learning rate were co-varied; this interaction is not ablated and therefore limits the strength of the 'scalar' characterization even within the scoped domain.
minor comments (2)
  1. The four-bin grid underlying the power-law fit and the construction of the 95% CI for ν are referenced in the abstract but the exact binning procedure, bin edges, and sensitivity checks are not detailed; adding this in the methods would improve reproducibility of the reported exponent and its comparison to ν=1/2 and the 3D Ising value.
  2. In the cross-architecture probes, architecture-specific λ_c values are stated but without a direct side-by-side table or discussion of the magnitude of differences relative to the transformer case; a concise comparison would clarify the scope of the replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the scope of our central claim. We agree that fixing the learning rate and AdamW β parameters limits the generality of describing weight decay as a fully independent scalar control parameter, and we will revise the manuscript to make this explicit.

read point-by-point responses
  1. Referee: The central claim that weight decay acts as a scalar empirical control parameter cleanly separating memorization, developmental grokking, and collapse regimes is load-bearing for the title and abstract. All eleven conditions and replications (multi-task, re-initialization, cross-architecture) keep the base learning rate and AdamW β parameters fixed. Because effective regularization in AdamW arises from the interplay of λ with adaptive step sizes and weight-norm trajectories, the fitted λ_c = 0.0158 and the regime boundaries could shift if the learning rate were co-varied; this interaction is not ablated and therefore limits the strength of the 'scalar' characterization even within the scoped domain.

    Authors: We agree with the referee's assessment. Our experimental design holds the base learning rate and AdamW β1, β2 fixed across all conditions (as stated in the methods), so the effective regularization strength is indeed an interplay rather than a pure function of λ alone. Consequently, the reported λ_c and regime boundaries are specific to these optimizer settings. We will revise the title, abstract, and the opening of Section 1 to qualify the claim as applying 'under fixed learning rate and AdamW hyperparameters.' A brief note will be added to the discussion acknowledging that co-varying the learning rate with weight decay remains unablated and is outside the current scope. This revision preserves the empirical utility of the diagnostics and fits within the tested domain while accurately reflecting the experimental controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fits and diagnostics are data-driven without reduction to inputs

full rationale

The paper reports experimental results on weight-decay regimes in small transformers trained on modular arithmetic. The logistic localization of λ_c=0.0158 and power-law exponent ν=0.757 are obtained from direct fits to observed phase boundaries across 210+ runs and multiple conditions; these are explicitly labeled empirical with confidence intervals and no universality claim. The two online diagnostics (mean pairwise attention-head cosine similarity; entropy standard deviation) are defined from attention activations and validated against loss-landscape measures without any equation that equates them to the fitted parameters by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text. The work is self-contained as an empirical scoping study; the central claim that weight decay separates regimes is tested rather than presupposed.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim relies on two fitted parameters from data analysis and the domain assumption about the behavior of modular arithmetic tasks in transformers.

free parameters (2)
  • λ_c = 0.0158
    Fitted via near-transition logistic fit to localize the memorization-to-developmental boundary from N=210 data points.
  • ν = 0.757
    Fitted exponent from power-law to the transition data, with CI [0.725, 0.799].
axioms (1)
  • domain assumption Modular arithmetic tasks exhibit the described sharp transitions between memorization, generalization, and collapse under transformer training.
    This is the foundational observation the control parameter is applied to.

pith-pipeline@v0.9.0 · 5895 in / 1398 out tokens · 59917 ms · 2026-05-21T07:26:48.643755+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 10 internal anchors

  1. [2]

    Sarwan Ali

    URLhttps://arxiv.org/abs/ 2603.15492. Sarwan Ali. Critical windows of complexity control: When transformers decide to reason or memorize.arXiv preprint arXiv:2605.04396,

  2. [3]

    Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

    URLhttps://arxiv.org/abs/2605.04396. Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoenholz, Jascha Sohl-Dickstein, and Surya Ganguli. Statistical mechanics of deep learning.Annual Review of Condensed Matter Physics, 11:501–528,

  3. [4]

    Yuda Bi, Chenyu Zhang, Qiheng Wang, and Vince D. Calhoun. Grokking as a falsifiable finite-size transition. arXiv preprint arXiv:2603.24746,

  4. [5]

    URLhttps://arxiv.org/abs/2312.03012

    doi: 10.1103/PhysRevResearch.6.033098. URLhttps://arxiv.org/abs/2312.03012. Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Unveiling induction heads: Provable training dynamics and feature learning in transformers. InNeurIPS,

  5. [6]

    Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion

    doi: 10.1073/pnas.1520428113. Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?arXiv preprint arXiv:2310.04415,

  6. [7]

    org/abs/2310.04415

    URLhttps://arxiv. org/abs/2310.04415. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, Anthropic,

  7. [8]

    URLhttps://transformer-circuits.pub/2022/toy_model/index. html. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR,

  8. [10]

    org/abs/2206.05794

    URLhttps://arxiv. org/abs/2206.05794. Shreel Golwala. ILDR: Geometric early detection of grokking.arXiv preprint arXiv:2604.20923,

  9. [11]

    ILDR: Geometric Early Detection of Grokking

    URL https://arxiv.org/abs/2604.20923. Laura Gomezjurado Gonzalez. The long delay to arithmetic generalization: When learned representations outrun behavior.arXiv preprint arXiv:2604.13082,

  10. [12]

    The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

    URLhttps://arxiv.org/abs/2604.13082. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

  11. [14]

    Max Hennick and Guillaume Corlouer

    URL https: //arxiv.org/abs/2602.14445. Max Hennick and Guillaume Corlouer. From density matrices to phase transitions in deep learning: Spectral early warnings and interpretability.arXiv preprint arXiv:2603.29805,

  12. [15]

    Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen

    URLhttps://arxiv.org/ abs/2603.29805. Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen. A bioelectrical phase transition patterns the first vertebrate heartbeats.Nature, 622(7981):149–155,

  13. [16]

    What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,

    Clare Lyle, Gharda Sokar, Razvan Pascanu, and Andras Gyorgy. What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,

  14. [17]

    Shalima Binta Manir and Anamika Paul Rupa

    URLhttps://arxiv.org/abs/2311.18817. Shalima Binta Manir and Anamika Paul Rupa. A systematic empirical study of grokking: Depth, architecture, activation, and regularization.arXiv preprint arXiv:2603.25009,

  15. [18]

    Charles H

    URLhttps://arxiv.org/abs/ 2603.25009. Charles H. Martin, Tongsu Peng, and Michael W. Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.Nature Communications, 12(1):4122,

  16. [19]

    URLhttps://arxiv.org/abs/2002.06716

    doi: 10.1038/s41467-021-24025-8. URLhttps://arxiv.org/abs/2002.06716. Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In NeurIPS,

  17. [21]

    Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A

    URLhttps://arxiv.org/abs/2511.01938. Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A. Louis. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem. InNeurIPS,

  18. [22]

    Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

    URL https://arxiv.org/abs/2404.17563. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InICLR,

  19. [23]

    Ido Nitsan, Stavit Drori, Yair E

    URLhttps: //arxiv.org/abs/2311.03260. Ido Nitsan, Stavit Drori, Yair E. Lewis, Shlomi Cohen, and Shelly Tzlil. Mechanical communication in cardiac cell synchronized beating.Nature Physics, 12(5):472–477,

  20. [24]

    Catherine Olsson, Nelson Elhage, Neel Nanda, et al

    doi: 10.1038/nphys3619. Catherine Olsson, Nelson Elhage, Neel Nanda, et al. In-context learning and induction heads.Transformer Circuits Thread, Anthropic,

  21. [25]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

  22. [26]

    HariK.PrakashandCharlesH. Martin. Late-stagegeneralizationcollapseingrokking: Detectinganti-grokking with Weightwatcher.arXiv preprint arXiv:2602.02859, 2026a. URLhttps://arxiv.org/abs/2602.02859. Hari K. Prakash and Charles H. Martin. Detecting overfitting in neural networks during long-horizon grokking using random matrix theory.arXiv preprint arXiv:260...

  23. [27]

    org/abs/2603.03993

    URLhttps://arxiv. org/abs/2603.03993. 20 Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InICLR,

  24. [28]

    URLhttps://arxiv.org/abs/1810.10531

    doi: 10.1073/pnas.1820226116. URLhttps://arxiv.org/abs/1810.10531. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InNeurIPS,

  25. [29]

    Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J

    URLhttps://arxiv.org/abs/2304.15004. Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J. Michaud, Berkan Ottlik, and Joseph Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,

  26. [31]

    Yiding Song and Hanming Ye

    URLhttps://arxiv.org/abs/2602.06702. Yiding Song and Hanming Ye. Model capacity determines grokking through competing memorisation and generalisation speeds.arXiv preprint arXiv:2605.09724,

  27. [32]

    Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

    URLhttps://arxiv.org/abs/2605.09724. Yifan Tang, Qiquan Wang, Inés García-Redondo, and Anthea Monod. Topological signatures of grokking. arXiv preprint arXiv:2605.06352,

  28. [33]

    Topological Signatures of Grokking

    URLhttps://arxiv.org/abs/2605.06352. Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking.arXiv preprint arXiv:2509.21519,

  29. [34]

    URLhttps://arxiv.org/abs/2509.21519

    doi: 10.48550/arXiv.2509.21519. URLhttps://arxiv.org/abs/2509.21519. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. The norm-separation delay law of grokking: A first-principles theory of delayed generalization.arXiv preprint arXiv:2603.13331, 2026a. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. Spectral ...

  30. [35]

    Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

    George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

  31. [36]

    Grokking as Dimensional Phase Transition in Neural Networks

    Ping Wang. Grokking as dimensional phase transition in neural networks.arXiv preprint arXiv:2604.04655, 2026a. URLhttps://arxiv.org/abs/2604.04655. Ping Wang. Dimensional criticality at grokking across MLPs and transformers.arXiv preprint arXiv:2604.16431, 2026b. Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and datase...

  32. [38]

    Distributional Spectral Diagnostics for Localizing Grokking Transitions

    URLhttps://arxiv.org/abs/2605.08237. Jason Wei, Yi Tay, Rishi Bommasani, et al. Emergent abilities of large language models.TMLR,

  33. [39]

    To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,

    Mingyue Xu, Gal Vardi, and Itay Safran. To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,

  34. [40]

    Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

    URLhttps://arxiv.org/abs/2601.19791. 21 Yongzhong Xu. Early-warning signals of grokking via loss-landscape geometry.arXiv preprint arXiv:2602.16967, 2026a. Yongzhong Xu. Spectral edge dynamics reveal functional modes of learning.arXiv preprint arXiv:2604.06256, 2026b. URLhttps://arxiv.org/abs/2604.06256. Yongzhong Xu. Low-dimensional and transversely curv...

  35. [41]

    The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

    URLhttps://arxiv.org/abs/2603.05228. Junjie Zhang, Zhen Shen, Gang Xiong, and Xisong Dong. Grokking from abstraction to intelligence.arXiv preprint arXiv:2603.29262,

  36. [42]

    URLhttps://arxiv.org/abs/2603

    doi: 10.48550/arXiv.2603.29262. URLhttps://arxiv.org/abs/2603. 29262. Xiaotian Zhang, Yue Shang, Entao Yang, and Ge Zhang. Is grokking a computational glass relaxation?arXiv preprint arXiv:2505.11411,

  37. [43]

    Across 183 valid layer-epoch rows, the mean absolute raw-PR error is2.10×10−7, the maximum raw-PR error is1.73×10−6, and the maximum affine-normalized PR error is2.56×10−7

    The analysis compares measured participation ratio against the value predicted from the eigenvalue coefficient of variation after correcting the stored sample standard deviation to a population standard deviation, then applies the affine normalization used by the released JSON artifacts. Across 183 valid layer-epoch rows, the mean absolute raw-PR error is...

  38. [44]

    Three of five cohorts now fall in the empirical 95% CI; the across-cohort mean shifts toλbound c = 0.0185±0.0074(range[0 .012, 0.028]), with the mean itself inside the empirical CI

    eliminates contamination from the late cycle and yieldsκvalues{18.4, 19.2, 17.9, 8.1, 8.6}across the same5 seeds, with boundλc∈{0.0125, 0.0120, 0.0128, 0.0284, 0.0268}. Three of five cohorts now fall in the empirical 95% CI; the across-cohort mean shifts toλbound c = 0.0185±0.0074(range[0 .012, 0.028]), with the mean itself inside the empirical CI. The re...