Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Lucky Verma

arxiv: 2605.20441 · v1 · pith:65EOGNL3new · submitted 2026-05-19 · 💻 cs.LG · cs.AI· cs.NE

Weight Decay Regimes in Grokking Transformers: Cheap Online Diagnostics

Lucky Verma This is my paper

Pith reviewed 2026-05-21 07:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.NE

keywords weight decaygrokkingtransformersmodular arithmeticattention diagnosticstraining regimesonline monitoringphase transitions

0 comments

The pith

Weight decay acts as a scalar control parameter that separates memorization, developmental grokking, and collapse regimes in small transformers trained on modular arithmetic.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Transformers trained on modular arithmetic show sharp transitions between memorizing training examples, generalizing to unseen cases through grokking, and collapsing into poor performance. The paper shows that a single scalar, the weight decay coefficient, can be adjusted to select which of these regimes occurs during training. It introduces two low-cost online diagnostics computed only from attention activations: the mean pairwise cosine similarity across attention heads and the standard deviation of entropy values. These monitors track the transitions in real time and complement more expensive loss-landscape analyses. The separation holds across eleven experimental conditions and three model scales ranging from 0.82M to 85M parameters, with a logistic fit localizing the main transition boundary.

Core claim

Weight decay acts as a scalar empirical control parameter for the memorization, developmental grokking, and collapse regimes in transformers trained on modular arithmetic. Two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales, the weight-decay axis separates the regimes, with a near-transition logistic fit localizing the memorization-to-developmental boundary at λ_c=0.0158 and a power-law fit giving an empirical exponent ν=0.757.

What carries the argument

Weight decay as a scalar empirical control parameter for training regimes, together with mean pairwise attention-head cosine similarity and entropy standard deviation as cheap online diagnostics computed from attention activations alone.

If this is right

The memorization-to-developmental boundary localizes at λ_c=0.0158 with 95% CI [0.0109, 0.0200] and empirical power-law exponent ν=0.757.
Attention-head re-initialization at λ=0.05 alters Phase-2 amplitude while matched weight-norm clipping does not.
The weight-decay control pattern is preserved in a horizon-matched multi-task replication across four modular operations.
Cross-architecture probes with 4L MLP, LSTM, and Mamba each replicate the weight-decay-controlled transition, though with architecture-specific λ_c values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The attention-based diagnostics offer a low-cost way to monitor regime shifts that could be tested on sequence tasks outside modular arithmetic.
Tuning weight decay may serve as a practical lever for inducing generalization in other small attention models, though larger-scale tests remain open.
The observed exponent invites future finite-size scaling work to check consistency with known universality classes such as 3D Ising.

Load-bearing premise

The assumption that the weight-decay axis alone separates the memorization, developmental grokking, and collapse regimes across the tested conditions and model scales in modular arithmetic tasks.

What would settle it

A replication on modular arithmetic where varying weight decay alone fails to produce the distinct regimes or the reported logistic transition boundary at λ_c=0.0158.

Figures

Figures reproduced from arXiv: 2605.20441 by Lucky Verma.

**Figure 2.** Figure 2: Canonical two-phase dynamics at λ=1.0 over n=50 replicated 4L8H runs. Solid curves show cohort medians; bands show interquartile ranges. Phase 1 (yellow shading): synchronization + grokking. Phase 2 (blue shading): differentiation + test-accuracy plateau. The canonical seed-42 trace ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Long-horizon canonical seed-42 trajectory with a 4-seed matched underlay ({7, 11, 31, 123}) at 4L8H mod-add λ=1.0 20 000-epoch configuration. Bold colored traces (canonical seed 42): raw per-epoch values at low alpha plus moving-average-smoothed overlay. Gray traces: smoothed cross-seed trajectories underlaid to disclose cohort variability. P1–P5 bands are canonical seed-42 landmark windows, not per-seed f… view at source ↗

**Figure 4.** Figure 4: Peak σH vs per-head dimension d/H. Saturating-exponential AIC-preferred. Dotted line: randomlabel null-control scale reference, not an equivalence claim. 10 −3 10 −2 10 −1 10 0 Weight decay ¸ 0.0 0.2 0.4 0.6 0.8 1.0 P(grok) (A) ¸c = 0:0158 CI[0:0109; 0:0200] 10 −1 10 0 ¸ ¡ ¸c 10 2 10 3 tgrok (B) º = 0:76 CI[0:72; 0:80] runs fit º = 0:76 [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: (A) Logistic P(grok) vs weight decay across N=210 runs post Phase A. Points are WD bins with Wilson 95% CIs, the orange curve is the logistic fit, and the dashed vertical line marks λc=0.0158 (95% CI [0.0109, 0.0200]). (B) Power-law divergence of tgrok above λc; ν=0.757, CI [0.725, 0.799]. Tested reference exponents such as ν=1/2 and 3D Ising ν≈0.63 are outside CI under the four-bin grid; we do not identif… view at source ↗

**Figure 6.** Figure 6: Permutation-symmetry test via affine-normalized [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: ESD heavy-tail exponent α (Weightwatcher, layer-median) across the 11 canonical-trajectory checkpoints (seed 42 only). Cross-seed Weightwatcher at the matched 11-checkpoint grid is deferred; a coarser 3-seed supplementary trace provides a qualitative onset check. α drops from 2.07 at random init to 1.39 at Phase 1 grokking onset and remains in the heavy-tail regime (α < 2) through Phase 5. The third-phase … view at source ↗

**Figure 8.** Figure 8: Causal-intervention paired forest plot for peak [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: Multi-task grok-rate heatmap (4 modular operations [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗

**Figure 10.** Figure 10: Cross-architecture transition comparison from logistic grok-rate fits. Bars show [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: (a) Ωtotal(t) Frobenius-norm trajectories for 5 cross-seed cohorts (canonical seed 42 + cross-seed cohort seeds 7, 11, 31, 123) at the same architecture, training λ=1.0, 20 000 epochs. Dashed lines show single-exponential AdamW-relaxation fits restricted to the M→G transition window (t ≤ 5 000, vertical gray). Late-stage cycle visible at t>10 000 explains the full-fit bimodal κ. (b) Early-window residuals… view at source ↗

read the original abstract

Transformers trained on modular arithmetic exhibit sharp transitions between memorization, generalization, and collapse. We show that weight decay acts as a scalar empirical control parameter for these regimes, and introduce two cheap online diagnostics, mean pairwise attention-head cosine similarity and entropy standard deviation, that track training dynamics from attention activations alone and complement loss-landscape diagnostics at lower compute cost. Across eleven experimental conditions and three model scales (0.82M to 85M parameters), the weight-decay axis separates memorization, developmental grokking, and collapse. A near-transition logistic fit localizes the memorization-to-developmental boundary at $\lambda_c=0.0158$ (95% CI [0.0109, 0.0200], N=210); a power-law fit gives an empirical exponent $\nu=0.757$ (CI [0.725, 0.799]). Reference exponents $\nu=1/2$ and 3D Ising $\nu \approx 0.63$ lie outside this empirical CI under our four-bin grid, so we report $\nu$ as empirical and defer universality-class identification to denser finite-size-scaling work. A horizon-matched multi-task replication (n=280, four modular operations) preserves the weight-decay control pattern; a paired attention-head re-initialization experiment at $\lambda=0.05$ changes Phase-2 amplitude (Cohen's $d=-1.190$, n=10, $p_t=4.5 \times 10^{-3}$), while matched weight-norm clipping does not. Three cross-architecture probes (4L MLP, 4L LSTM, and 4L Mamba; each n=70) replicate the weight-decay-controlled transition with architecture-specific $\lambda_c$ values. Main diagnostic claims are scoped to modular arithmetic in small transformer attention models; the non-attention experiments are scope probes, and architecture-wide, language-model, and universality-class claims are out of scope.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Weight decay separates the regimes in these small transformer grokking runs on modular arithmetic, with two new attention diagnostics and a fitted critical value, but the scalar-control story rests on fixed learning rate and optimizer settings.

read the letter

The main thing to know is that weight decay acts as a workable dial for moving between memorization, developmental grokking, and collapse in these modular-arithmetic transformer experiments, and the authors supply two cheap online diagnostics based on attention-head cosine similarity and entropy standard deviation that track the transition from activations alone. They back this with a logistic fit on 210 runs that pins the boundary at λ_c=0.0158 with a 95% CI, plus a power-law exponent of 0.757, and they run replications across eleven conditions, three model scales, a four-task horizon-matched setup, and cross-architecture probes on MLP, LSTM, and Mamba. The re-initialization test at λ=0.05 shows a clear effect while weight-norm clipping does not, and they properly report CIs and scope the claims to small attention models on this task. That empirical breadth and the explicit scoping are the parts that hold up cleanly. The diagnostics add a lower-cost complement to loss-landscape methods, which is useful if it replicates. The soft spot is exactly the one in the stress-test note: every sweep and replication keeps the learning rate and AdamW betas fixed. In AdamW the effective regularization is an interplay between λ, adaptive steps, and weight-norm trajectory, so the observed boundaries and the fitted λ_c could move if the base learning rate were co-varied. The paper does not ablate that interaction, which means the “scalar empirical control” claim is conditional on the fixed optimizer schedule they used. That is a real but contained limitation rather than a fatal flaw, and it does not undermine the scoped results they actually report. The work is aimed at people studying grokking or looking for practical, low-compute ways to monitor generalization transitions in attention models. A reader who wants concrete numbers and simple attention-based trackers for similar small-scale setups would get direct value from the fits and the diagnostic definitions. It has enough replications, statistical reporting, and honest scoping to deserve a serious referee rather than a desk reject. I would send it out for review and ask the authors to add a short learning-rate sensitivity check or to state more explicitly that the control holds only under the fixed AdamW schedule they tested.

Referee Report

1 major / 2 minor

Summary. The manuscript examines how weight decay influences training regimes in transformers performing modular arithmetic, identifying distinct phases of memorization, developmental grokking, and collapse. It proposes that weight decay serves as a scalar control parameter and introduces two computationally inexpensive online diagnostics derived from attention activations: mean pairwise head cosine similarity and entropy standard deviation. The study reports results from eleven experimental conditions across three model scales, including a logistic fit for the critical weight decay λ_c = 0.0158 and a power-law exponent ν = 0.757, supported by multi-task replications, re-initialization experiments, and cross-architecture probes with MLPs, LSTMs, and Mambas.

Significance. If the central claims hold, the work supplies practical, low-cost diagnostics for tracking grokking dynamics directly from attention activations, complementing more expensive loss-landscape methods. The statistical fits with confidence intervals, replications across conditions and scales, and explicit scoping of claims to modular arithmetic in small attention models constitute clear strengths. The empirical treatment of the exponent ν (with reference values outside the CI) and the paired re-initialization results (Cohen's d and p-value) add rigor to the evidence that weight decay can be used as a control knob within the tested domain.

major comments (1)

The central claim that weight decay acts as a scalar empirical control parameter cleanly separating memorization, developmental grokking, and collapse regimes is load-bearing for the title and abstract. All eleven conditions and replications (multi-task, re-initialization, cross-architecture) keep the base learning rate and AdamW β parameters fixed. Because effective regularization in AdamW arises from the interplay of λ with adaptive step sizes and weight-norm trajectories, the fitted λ_c = 0.0158 and the regime boundaries could shift if the learning rate were co-varied; this interaction is not ablated and therefore limits the strength of the 'scalar' characterization even within the scoped domain.

minor comments (2)

The four-bin grid underlying the power-law fit and the construction of the 95% CI for ν are referenced in the abstract but the exact binning procedure, bin edges, and sensitivity checks are not detailed; adding this in the methods would improve reproducibility of the reported exponent and its comparison to ν=1/2 and the 3D Ising value.
In the cross-architecture probes, architecture-specific λ_c values are stated but without a direct side-by-side table or discussion of the magnitude of differences relative to the transformer case; a concise comparison would clarify the scope of the replication.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on the scope of our central claim. We agree that fixing the learning rate and AdamW β parameters limits the generality of describing weight decay as a fully independent scalar control parameter, and we will revise the manuscript to make this explicit.

read point-by-point responses

Referee: The central claim that weight decay acts as a scalar empirical control parameter cleanly separating memorization, developmental grokking, and collapse regimes is load-bearing for the title and abstract. All eleven conditions and replications (multi-task, re-initialization, cross-architecture) keep the base learning rate and AdamW β parameters fixed. Because effective regularization in AdamW arises from the interplay of λ with adaptive step sizes and weight-norm trajectories, the fitted λ_c = 0.0158 and the regime boundaries could shift if the learning rate were co-varied; this interaction is not ablated and therefore limits the strength of the 'scalar' characterization even within the scoped domain.

Authors: We agree with the referee's assessment. Our experimental design holds the base learning rate and AdamW β1, β2 fixed across all conditions (as stated in the methods), so the effective regularization strength is indeed an interplay rather than a pure function of λ alone. Consequently, the reported λ_c and regime boundaries are specific to these optimizer settings. We will revise the title, abstract, and the opening of Section 1 to qualify the claim as applying 'under fixed learning rate and AdamW hyperparameters.' A brief note will be added to the discussion acknowledging that co-varying the learning rate with weight decay remains unablated and is outside the current scope. This revision preserves the empirical utility of the diagnostics and fits within the tested domain while accurately reflecting the experimental controls. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical fits and diagnostics are data-driven without reduction to inputs

full rationale

The paper reports experimental results on weight-decay regimes in small transformers trained on modular arithmetic. The logistic localization of λ_c=0.0158 and power-law exponent ν=0.757 are obtained from direct fits to observed phase boundaries across 210+ runs and multiple conditions; these are explicitly labeled empirical with confidence intervals and no universality claim. The two online diagnostics (mean pairwise attention-head cosine similarity; entropy standard deviation) are defined from attention activations and validated against loss-landscape measures without any equation that equates them to the fitted parameters by construction. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps in the provided text. The work is self-contained as an empirical scoping study; the central claim that weight decay separates regimes is tested rather than presupposed.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The claim relies on two fitted parameters from data analysis and the domain assumption about the behavior of modular arithmetic tasks in transformers.

free parameters (2)

λ_c = 0.0158
Fitted via near-transition logistic fit to localize the memorization-to-developmental boundary from N=210 data points.
ν = 0.757
Fitted exponent from power-law to the transition data, with CI [0.725, 0.799].

axioms (1)

domain assumption Modular arithmetic tasks exhibit the described sharp transitions between memorization, generalization, and collapse under transformer training.
This is the foundational observation the control parameter is applied to.

pith-pipeline@v0.9.0 · 5895 in / 1398 out tokens · 59917 ms · 2026-05-21T07:26:48.643755+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 10 internal anchors

[2]

Sarwan Ali

URLhttps://arxiv.org/abs/ 2603.15492. Sarwan Ali. Critical windows of complexity control: When transformers decide to reason or memorize.arXiv preprint arXiv:2605.04396,

work page arXiv
[3]

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

URLhttps://arxiv.org/abs/2605.04396. Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoenholz, Jascha Sohl-Dickstein, and Surya Ganguli. Statistical mechanics of deep learning.Annual Review of Condensed Matter Physics, 11:501–528,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Yuda Bi, Chenyu Zhang, Qiheng Wang, and Vince D. Calhoun. Grokking as a falsifiable finite-size transition. arXiv preprint arXiv:2603.24746,

work page arXiv
[5]

URLhttps://arxiv.org/abs/2312.03012

doi: 10.1103/PhysRevResearch.6.033098. URLhttps://arxiv.org/abs/2312.03012. Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Unveiling induction heads: Provable training dynamics and feature learning in transformers. InNeurIPS,

work page doi:10.1103/physrevresearch.6.033098
[6]

Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion

doi: 10.1073/pnas.1520428113. Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?arXiv preprint arXiv:2310.04415,

work page doi:10.1073/pnas.1520428113
[7]

org/abs/2310.04415

URLhttps://arxiv. org/abs/2310.04415. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, Anthropic,

work page arXiv
[8]

URLhttps://transformer-circuits.pub/2022/toy_model/index. html. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR,

work page 2022
[10]

org/abs/2206.05794

URLhttps://arxiv. org/abs/2206.05794. Shreel Golwala. ILDR: Geometric early detection of grokking.arXiv preprint arXiv:2604.20923,

work page arXiv
[11]

ILDR: Geometric Early Detection of Grokking

URL https://arxiv.org/abs/2604.20923. Laura Gomezjurado Gonzalez. The long delay to arithmetic generalization: When learned representations outrun behavior.arXiv preprint arXiv:2604.13082,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

URLhttps://arxiv.org/abs/2604.13082. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Max Hennick and Guillaume Corlouer

URL https: //arxiv.org/abs/2602.14445. Max Hennick and Guillaume Corlouer. From density matrices to phase transitions in deep learning: Spectral early warnings and interpretability.arXiv preprint arXiv:2603.29805,

work page arXiv
[15]

Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen

URLhttps://arxiv.org/ abs/2603.29805. Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen. A bioelectrical phase transition patterns the first vertebrate heartbeats.Nature, 622(7981):149–155,

work page arXiv
[16]

What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,

Clare Lyle, Gharda Sokar, Razvan Pascanu, and Andras Gyorgy. What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,

work page arXiv
[17]

Shalima Binta Manir and Anamika Paul Rupa

URLhttps://arxiv.org/abs/2311.18817. Shalima Binta Manir and Anamika Paul Rupa. A systematic empirical study of grokking: Depth, architecture, activation, and regularization.arXiv preprint arXiv:2603.25009,

work page arXiv
[18]

Charles H

URLhttps://arxiv.org/abs/ 2603.25009. Charles H. Martin, Tongsu Peng, and Michael W. Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.Nature Communications, 12(1):4122,

work page arXiv
[19]

URLhttps://arxiv.org/abs/2002.06716

doi: 10.1038/s41467-021-24025-8. URLhttps://arxiv.org/abs/2002.06716. Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In NeurIPS,

work page doi:10.1038/s41467-021-24025-8 2002
[21]

Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A

URLhttps://arxiv.org/abs/2511.01938. Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A. Louis. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem. InNeurIPS,

work page arXiv
[22]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

URL https://arxiv.org/abs/2404.17563. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InICLR,

work page arXiv
[23]

Ido Nitsan, Stavit Drori, Yair E

URLhttps: //arxiv.org/abs/2311.03260. Ido Nitsan, Stavit Drori, Yair E. Lewis, Shlomi Cohen, and Shelly Tzlil. Mechanical communication in cardiac cell synchronized beating.Nature Physics, 12(5):472–477,

work page arXiv
[24]

Catherine Olsson, Nelson Elhage, Neel Nanda, et al

doi: 10.1038/nphys3619. Catherine Olsson, Nelson Elhage, Neel Nanda, et al. In-context learning and induction heads.Transformer Circuits Thread, Anthropic,

work page doi:10.1038/nphys3619
[25]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

HariK.PrakashandCharlesH. Martin. Late-stagegeneralizationcollapseingrokking: Detectinganti-grokking with Weightwatcher.arXiv preprint arXiv:2602.02859, 2026a. URLhttps://arxiv.org/abs/2602.02859. Hari K. Prakash and Charles H. Martin. Detecting overfitting in neural networks during long-horizon grokking using random matrix theory.arXiv preprint arXiv:260...

work page arXiv
[27]

org/abs/2603.03993

URLhttps://arxiv. org/abs/2603.03993. 20 Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InICLR,

work page arXiv
[28]

URLhttps://arxiv.org/abs/1810.10531

doi: 10.1073/pnas.1820226116. URLhttps://arxiv.org/abs/1810.10531. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InNeurIPS,

work page doi:10.1073/pnas.1820226116
[29]

Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J

URLhttps://arxiv.org/abs/2304.15004. Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J. Michaud, Berkan Ottlik, and Joseph Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,

work page arXiv
[31]

Yiding Song and Hanming Ye

URLhttps://arxiv.org/abs/2602.06702. Yiding Song and Hanming Ye. Model capacity determines grokking through competing memorisation and generalisation speeds.arXiv preprint arXiv:2605.09724,

work page arXiv
[32]

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

URLhttps://arxiv.org/abs/2605.09724. Yifan Tang, Qiquan Wang, Inés García-Redondo, and Anthea Monod. Topological signatures of grokking. arXiv preprint arXiv:2605.06352,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Topological Signatures of Grokking

URLhttps://arxiv.org/abs/2605.06352. Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking.arXiv preprint arXiv:2509.21519,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

URLhttps://arxiv.org/abs/2509.21519

doi: 10.48550/arXiv.2509.21519. URLhttps://arxiv.org/abs/2509.21519. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. The norm-separation delay law of grokking: A first-principles theory of delayed generalization.arXiv preprint arXiv:2603.13331, 2026a. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. Spectral ...

work page doi:10.48550/arxiv.2509.21519
[35]

Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

work page arXiv
[36]

Grokking as Dimensional Phase Transition in Neural Networks

Ping Wang. Grokking as dimensional phase transition in neural networks.arXiv preprint arXiv:2604.04655, 2026a. URLhttps://arxiv.org/abs/2604.04655. Ping Wang. Dimensional criticality at grokking across MLPs and transformers.arXiv preprint arXiv:2604.16431, 2026b. Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and datase...

work page internal anchor Pith review Pith/arXiv arXiv
[38]

Distributional Spectral Diagnostics for Localizing Grokking Transitions

URLhttps://arxiv.org/abs/2605.08237. Jason Wei, Yi Tay, Rishi Bommasani, et al. Emergent abilities of large language models.TMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[39]

To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,

Mingyue Xu, Gal Vardi, and Itay Safran. To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,

work page arXiv
[40]

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

URLhttps://arxiv.org/abs/2601.19791. 21 Yongzhong Xu. Early-warning signals of grokking via loss-landscape geometry.arXiv preprint arXiv:2602.16967, 2026a. Yongzhong Xu. Spectral edge dynamics reveal functional modes of learning.arXiv preprint arXiv:2604.06256, 2026b. URLhttps://arxiv.org/abs/2604.06256. Yongzhong Xu. Low-dimensional and transversely curv...

work page internal anchor Pith review doi:10.48550/arxiv.2603.28964
[41]

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

URLhttps://arxiv.org/abs/2603.05228. Junjie Zhang, Zhen Shen, Gang Xiong, and Xisong Dong. Grokking from abstraction to intelligence.arXiv preprint arXiv:2603.29262,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

URLhttps://arxiv.org/abs/2603

doi: 10.48550/arXiv.2603.29262. URLhttps://arxiv.org/abs/2603. 29262. Xiaotian Zhang, Yue Shang, Entao Yang, and Ge Zhang. Is grokking a computational glass relaxation?arXiv preprint arXiv:2505.11411,

work page doi:10.48550/arxiv.2603.29262
[43]

Across 183 valid layer-epoch rows, the mean absolute raw-PR error is2.10×10−7, the maximum raw-PR error is1.73×10−6, and the maximum affine-normalized PR error is2.56×10−7

The analysis compares measured participation ratio against the value predicted from the eigenvalue coefficient of variation after correcting the stored sample standard deviation to a population standard deviation, then applies the affine normalization used by the released JSON artifacts. Across 183 valid layer-epoch rows, the mean absolute raw-PR error is...

work page 2023
[44]

Three of five cohorts now fall in the empirical 95% CI; the across-cohort mean shifts toλbound c = 0.0185±0.0074(range[0 .012, 0.028]), with the mean itself inside the empirical CI

eliminates contamination from the late cycle and yieldsκvalues{18.4, 19.2, 17.9, 8.1, 8.6}across the same5 seeds, with boundλc∈{0.0125, 0.0120, 0.0128, 0.0284, 0.0268}. Three of five cohorts now fall in the empirical 95% CI; the across-cohort mean shifts toλbound c = 0.0185±0.0074(range[0 .012, 0.028]), with the mean itself inside the empirical CI. The re...

work page 2000

[1] [2]

Sarwan Ali

URLhttps://arxiv.org/abs/ 2603.15492. Sarwan Ali. Critical windows of complexity control: When transformers decide to reason or memorize.arXiv preprint arXiv:2605.04396,

work page arXiv

[2] [3]

Critical Windows of Complexity Control: When Transformers Decide to Reason or Memorize

URLhttps://arxiv.org/abs/2605.04396. Yasaman Bahri, Jonathan Kadmon, Jeffrey Pennington, Sam S Schoenholz, Jascha Sohl-Dickstein, and Surya Ganguli. Statistical mechanics of deep learning.Annual Review of Condensed Matter Physics, 11:501–528,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [4]

Yuda Bi, Chenyu Zhang, Qiheng Wang, and Vince D. Calhoun. Grokking as a falsifiable finite-size transition. arXiv preprint arXiv:2603.24746,

work page arXiv

[4] [5]

URLhttps://arxiv.org/abs/2312.03012

doi: 10.1103/PhysRevResearch.6.033098. URLhttps://arxiv.org/abs/2312.03012. Siyu Chen, Heejune Sheen, Tianhao Wang, and Zhuoran Yang. Unveiling induction heads: Provable training dynamics and feature learning in transformers. InNeurIPS,

work page doi:10.1103/physrevresearch.6.033098

[5] [6]

Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion

doi: 10.1073/pnas.1520428113. Francesco D’Angelo, Maksym Andriushchenko, Aditya Varre, and Nicolas Flammarion. Why do we need weight decay in modern deep learning?arXiv preprint arXiv:2310.04415,

work page doi:10.1073/pnas.1520428113

[6] [7]

org/abs/2310.04415

URLhttps://arxiv. org/abs/2310.04415. Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, Roger Grosse, Sam McCandlish, Jared Kaplan, Dario Amodei, Martin Wattenberg, and Christopher Olah. Toy models of superposition.Transformer Circuits Thread, Anthropic,

work page arXiv

[7] [8]

URLhttps://transformer-circuits.pub/2022/toy_model/index. html. Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. InICLR,

work page 2022

[8] [10]

org/abs/2206.05794

URLhttps://arxiv. org/abs/2206.05794. Shreel Golwala. ILDR: Geometric early detection of grokking.arXiv preprint arXiv:2604.20923,

work page arXiv

[9] [11]

ILDR: Geometric Early Detection of Grokking

URL https://arxiv.org/abs/2604.20923. Laura Gomezjurado Gonzalez. The long delay to arithmetic generalization: When learned representations outrun behavior.arXiv preprint arXiv:2604.13082,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [12]

The Long Delay to Arithmetic Generalization: When Learned Representations Outrun Behavior

URLhttps://arxiv.org/abs/2604.13082. Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [14]

Max Hennick and Guillaume Corlouer

URL https: //arxiv.org/abs/2602.14445. Max Hennick and Guillaume Corlouer. From density matrices to phase transitions in deep learning: Spectral early warnings and interpretability.arXiv preprint arXiv:2603.29805,

work page arXiv

[12] [15]

Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen

URLhttps://arxiv.org/ abs/2603.29805. Bill Z Jia, Yitong Qi, J David Wong-Campos, Sean G Megason, and Adam E Cohen. A bioelectrical phase transition patterns the first vertebrate heartbeats.Nature, 622(7981):149–155,

work page arXiv

[13] [16]

What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,

Clare Lyle, Gharda Sokar, Razvan Pascanu, and Andras Gyorgy. What can grokking teach us about learning under nonstationarity?arXiv preprint arXiv:2507.20057,

work page arXiv

[14] [17]

Shalima Binta Manir and Anamika Paul Rupa

URLhttps://arxiv.org/abs/2311.18817. Shalima Binta Manir and Anamika Paul Rupa. A systematic empirical study of grokking: Depth, architecture, activation, and regularization.arXiv preprint arXiv:2603.25009,

work page arXiv

[15] [18]

Charles H

URLhttps://arxiv.org/abs/ 2603.25009. Charles H. Martin, Tongsu Peng, and Michael W. Mahoney. Predicting trends in the quality of state-of-the-art neural networks without access to training or testing data.Nature Communications, 12(1):4122,

work page arXiv

[16] [19]

URLhttps://arxiv.org/abs/2002.06716

doi: 10.1038/s41467-021-24025-8. URLhttps://arxiv.org/abs/2002.06716. Eric J Michaud, Ziming Liu, Uzay Girit, and Max Tegmark. The quantization model of neural scaling. In NeurIPS,

work page doi:10.1038/s41467-021-24025-8 2002

[17] [21]

Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A

URLhttps://arxiv.org/abs/2511.01938. Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A. Louis. An exactly solvable model for emergence and scaling laws in the multitask sparse parity problem. InNeurIPS,

work page arXiv

[18] [22]

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt

URL https://arxiv.org/abs/2404.17563. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InICLR,

work page arXiv

[19] [23]

Ido Nitsan, Stavit Drori, Yair E

URLhttps: //arxiv.org/abs/2311.03260. Ido Nitsan, Stavit Drori, Yair E. Lewis, Shlomi Cohen, and Shelly Tzlil. Mechanical communication in cardiac cell synchronized beating.Nature Physics, 12(5):472–477,

work page arXiv

[20] [24]

Catherine Olsson, Nelson Elhage, Neel Nanda, et al

doi: 10.1038/nphys3619. Catherine Olsson, Nelson Elhage, Neel Nanda, et al. In-context learning and induction heads.Transformer Circuits Thread, Anthropic,

work page doi:10.1038/nphys3619

[21] [25]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [26]

HariK.PrakashandCharlesH. Martin. Late-stagegeneralizationcollapseingrokking: Detectinganti-grokking with Weightwatcher.arXiv preprint arXiv:2602.02859, 2026a. URLhttps://arxiv.org/abs/2602.02859. Hari K. Prakash and Charles H. Martin. Detecting overfitting in neural networks during long-horizon grokking using random matrix theory.arXiv preprint arXiv:260...

work page arXiv

[23] [27]

org/abs/2603.03993

URLhttps://arxiv. org/abs/2603.03993. 20 Andrew M Saxe, James L McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. InICLR,

work page arXiv

[24] [28]

URLhttps://arxiv.org/abs/1810.10531

doi: 10.1073/pnas.1820226116. URLhttps://arxiv.org/abs/1810.10531. Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo. Are emergent abilities of large language models a mirage? InNeurIPS,

work page doi:10.1073/pnas.1820226116

[25] [29]

Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J

URLhttps://arxiv.org/abs/2304.15004. Jamie Simon, Daniel Kunin, Alexander Atanasov, Enric Boix-Adserà, Blake Bordelon, Jeremy Cohen, Nikhil Ghosh, Florentin Guth, Arthur Jacot, Mason Kamb, Dhruva Karkada, Eric J. Michaud, Berkan Ottlik, and Joseph Turnbull. There will be a scientific theory of deep learning.arXiv preprint arXiv:2604.21691,

work page arXiv

[26] [31]

Yiding Song and Hanming Ye

URLhttps://arxiv.org/abs/2602.06702. Yiding Song and Hanming Ye. Model capacity determines grokking through competing memorisation and generalisation speeds.arXiv preprint arXiv:2605.09724,

work page arXiv

[27] [32]

Model Capacity Determines Grokking through Competing Memorisation and Generalisation Speeds

URLhttps://arxiv.org/abs/2605.09724. Yifan Tang, Qiquan Wang, Inés García-Redondo, and Anthea Monod. Topological signatures of grokking. arXiv preprint arXiv:2605.06352,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [33]

Topological Signatures of Grokking

URLhttps://arxiv.org/abs/2605.06352. Yuandong Tian. Provable scaling laws of feature emergence from learning dynamics of grokking.arXiv preprint arXiv:2509.21519,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [34]

URLhttps://arxiv.org/abs/2509.21519

doi: 10.48550/arXiv.2509.21519. URLhttps://arxiv.org/abs/2509.21519. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. The norm-separation delay law of grokking: A first-principles theory of delayed generalization.arXiv preprint arXiv:2603.13331, 2026a. Truong Xuan Khanh, Truong Quynh Hoa, Luu Duc Trung, and Phan Thanh Duc. Spectral ...

work page doi:10.48550/arxiv.2509.21519

[30] [35]

Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

George Wang, Jesse Hoogland, Stan van Wingerden, Zach Furman, and Daniel Murfet. Differentiation and specialization of attention heads via the refined local learning coefficient.arXiv preprint arXiv:2410.02984,

work page arXiv

[31] [36]

Grokking as Dimensional Phase Transition in Neural Networks

Ping Wang. Grokking as dimensional phase transition in neural networks.arXiv preprint arXiv:2604.04655, 2026a. URLhttps://arxiv.org/abs/2604.04655. Ping Wang. Dimensional criticality at grokking across MLPs and transformers.arXiv preprint arXiv:2604.16431, 2026b. Xi Wang and Laurence Aitchison. How to set adamw’s weight decay as you scale model and datase...

work page internal anchor Pith review Pith/arXiv arXiv

[32] [38]

Distributional Spectral Diagnostics for Localizing Grokking Transitions

URLhttps://arxiv.org/abs/2605.08237. Jason Wei, Yi Tay, Rishi Bommasani, et al. Emergent abilities of large language models.TMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [39]

To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,

Mingyue Xu, Gal Vardi, and Itay Safran. To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,

work page arXiv

[34] [40]

Spectral Edge Dynamics: An Analytical-Empirical Study of Phase Transitions in Neural Network Training

URLhttps://arxiv.org/abs/2601.19791. 21 Yongzhong Xu. Early-warning signals of grokking via loss-landscape geometry.arXiv preprint arXiv:2602.16967, 2026a. Yongzhong Xu. Spectral edge dynamics reveal functional modes of learning.arXiv preprint arXiv:2604.06256, 2026b. URLhttps://arxiv.org/abs/2604.06256. Yongzhong Xu. Low-dimensional and transversely curv...

work page internal anchor Pith review doi:10.48550/arxiv.2603.28964

[35] [41]

The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology

URLhttps://arxiv.org/abs/2603.05228. Junjie Zhang, Zhen Shen, Gang Xiong, and Xisong Dong. Grokking from abstraction to intelligence.arXiv preprint arXiv:2603.29262,

work page internal anchor Pith review Pith/arXiv arXiv

[36] [42]

URLhttps://arxiv.org/abs/2603

doi: 10.48550/arXiv.2603.29262. URLhttps://arxiv.org/abs/2603. 29262. Xiaotian Zhang, Yue Shang, Entao Yang, and Ge Zhang. Is grokking a computational glass relaxation?arXiv preprint arXiv:2505.11411,

work page doi:10.48550/arxiv.2603.29262

[37] [43]

Across 183 valid layer-epoch rows, the mean absolute raw-PR error is2.10×10−7, the maximum raw-PR error is1.73×10−6, and the maximum affine-normalized PR error is2.56×10−7

The analysis compares measured participation ratio against the value predicted from the eigenvalue coefficient of variation after correcting the stored sample standard deviation to a population standard deviation, then applies the affine normalization used by the released JSON artifacts. Across 183 valid layer-epoch rows, the mean absolute raw-PR error is...

work page 2023

[38] [44]

Three of five cohorts now fall in the empirical 95% CI; the across-cohort mean shifts toλbound c = 0.0185±0.0074(range[0 .012, 0.028]), with the mean itself inside the empirical CI

eliminates contamination from the late cycle and yieldsκvalues{18.4, 19.2, 17.9, 8.1, 8.6}across the same5 seeds, with boundλc∈{0.0125, 0.0120, 0.0128, 0.0284, 0.0268}. Three of five cohorts now fall in the empirical 95% CI; the across-cohort mean shifts toλbound c = 0.0185±0.0074(range[0 .012, 0.028]), with the mean itself inside the empirical CI. The re...

work page 2000