First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

Luu Duc Trung; Phan Thanh Duc; Truong Quynh Hoa; Truong Xuan Khanh

arxiv: 2605.18845 · v1 · pith:APJ3VEKJnew · submitted 2026-05-13 · 💻 cs.LG · cs.AI

First-Passage Prediction of Grokking Delay: ACalibrated Law under AdamW with Causal Validation

Truong Xuan Khanh , Truong Quynh Hoa , Luu Duc Trung , Phan Thanh Duc This is my paper

Pith reviewed 2026-05-20 21:09 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords grokkingAdamWparameter normfirst-passage timegeneralization delayweight decaymodular arithmeticneural network training

0 comments

The pith

A logarithmic law based on squared parameter norm growth predicts grokking delay after memorization under AdamW.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to give the first closed-form prediction for how much later generalization appears after a model has memorized its training data when trained with AdamW. It models the extra delay as the waiting time until the squared length of the parameters crosses an architecture-specific threshold, yielding an explicit formula that depends on the weight-decay strength and a stable correction factor measured per architecture. A sympathetic reader would care because the formula turns an after-the-fact observation into a quantity that can be calibrated once and then used to forecast timing on new runs, and because direct interventions on the norm are shown to remove the delay entirely.

Core claim

Treating the delay as a first-passage time of the squared parameter norm V_t, the paper derives T_grok minus T_mem equals one over two kappa_LL eta lambda times the log of V_mem over V_star. kappa_LL is measured once per architecture and stays stable enough that calibration on a single hyperparameter cell forecasts delays on 26 held-out runs with 17.7 percent MAPE across a 41-fold range; the same law extends to MLPs at 18 percent error and the paper supplies a quantile-margin theorem plus causal interventions that freeze the norm or remove weight decay and thereby eliminate grokking.

What carries the argument

First-passage time of the squared parameter norm V_t crossing an architecture-dependent threshold V_star, which converts the ratio of norms at memorization and threshold into a logarithmic delay scaled by the effective contraction rate under AdamW.

If this is right

Calibrating kappa_LL and V_star once lets the law forecast delays on dozens of held-out runs and architectures with MAPE near 18 percent.
Freezing the norm or removing weight decay at the memorization point stops grokking from occurring.
Positive delay requires both a norm separation V_mem greater than V_post and the angular reachability condition given by the quantile-margin theorem.
The ratio V_star over V_mem remains comparatively stable inside one architecture even when absolute values change.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If norm growth really sets the clock, then deliberately slowing or accelerating that growth during training could be used to move the moment of generalization forward or backward on demand.
The within-architecture stability of the critical ratio suggests an underlying geometric invariant of the network that might be computed directly from the model structure rather than measured.
The same first-passage framing could be tested on other delayed-generalization settings, such as training on natural-language data or with different optimizers, to see whether norm dynamics remain the governing mechanism.

Load-bearing premise

The correction factor kappa_LL stays roughly constant within a given architecture so that a single measured value works for new runs and hyperparameter choices.

What would settle it

Measure the actual time from memorization to grokking on a new run, compute the predicted delay from the observed log ratio of the two norms divided by twice the calibrated kappa_LL times eta lambda, and check whether the numbers agree within the reported error band.

Figures

Figures reproduced from arXiv: 2605.18845 by Luu Duc Trung, Phan Thanh Duc, Truong Quynh Hoa, Truong Xuan Khanh.

**Figure 1.** Figure 1: Phenomenon, mechanism, prediction. (a) Standard grokking on modular addition: train accuracy reaches 99% at Tmem; val accuracy delays to Tgrok. (b) Parameter norm Vt = ∥θt∥ 2 decays exponentially during the delay phase (R2 = 0.97 per-trajectory). (c) Empirical decay rate vs. clean-SGD theoretical 2ηλ across N = 39 1-layer runs (R2 > 0.9): parallel but offset, defining the AdamW correction κLL ≈ 0.24 (withi… view at source ↗

**Figure 2.** Figure 2: Structured variability of κ. (a) κ vs λ at fixed (η, p). (b) κ vs p at fixed (η, λ). (c) κ across task families. Pooled median κ = 0.268 (red dashed) is contained within the IQR of every cell with N ≥ 3. Predictive validation. Calibrating κtrain = 0.252 and V train ⋆ = 2501 on the headline cell, we predict Tgrok on held-out runs. Method A (ρ-conditional, requires post-grok Vpost) gives an MAPE of 32.8–37.4… view at source ↗

**Figure 3.** Figure 3: Norm-direction decoupling on Block F. (a) Vt: F1/F2 contract; F3 frozen; F4 grows. (b) αt: F1/F2 saturate to roughly 75–80◦ passing through α ⋆ ∈ [39◦ , 53◦ ] (grey); F3/F4 plateau near 12◦ . (c) Phase diagram (α, V ): ◦ = Tmem, ⋆ = grokking transition. (d) τV ≪ τα for all grokking trajectories. both (i) norm separation Vmem > Vpost (Corollary 1), and (ii) angular reachability: supt αt ≥ α ⋆ , where, in th… view at source ↗

**Figure 4.** Figure 4: Necessity dichotomy. (a) Sparse parity (n = 20, k = 3, MLP): train and validation accuracy reach 99% together; the post-grokking norm exceeds the memorisation norm (Vpost/Vmem > 1), and there is no delay phase (Tgrok − Tmem = 0). (b) Modular addition under SGD versus AdamW: AdamW groks within ∼104 steps; vanilla SGD does not reach the memorisation precondition within 20,000 steps, so the predictive law has… view at source ↗

**Figure 5.** Figure 5: Cross-architecture κ (Section 6.4). Historical N = 10 measurements shown for reference: three cells cluster near κ ≈ 0.26; 2-layer outlier (κ = 0.42, CV 58%, small-sample). The current values supersede the small-N estimate: paper-2L (manual residuals, no LayerNorm) κ = 0.370 ± 0.056 at N = 29 (CV 15%, p < 10−9 vs. 1-layer); alt-2L (LayerNorm + biases) κ = 0.175 ± 0.018 at N = 30 (CV 10%). See Appendix K fo… view at source ↗

**Figure 6.** Figure 6: Causal ablation (Section 6.5). (a) F1, F2 decay; F3 constant; F4 grows. (b) F1, F2 grok at T ≈ 2–4k; F3, F4 chance through 30k (3/3). Augments [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Per-trajectory exponential fits confirm the theorem form. [PITH_FULL_IMAGE:figures/full_fig_p037_7.png] view at source ↗

read the original abstract

We give the first quantitative prediction of grokking delay under AdamW. Treating the delay as a first-passage time, we derive a closed-form law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star), where V_t = ||theta_t||^2 is the squared parameter norm, V_star is an architecture-dependent threshold, and kappa_LL absorbs the AdamW correction to the clean-SGD contraction rate 2 eta lambda. Calibrating (kappa_LL, V_star) on a single hyperparameter cell predicts grokking delays on 26 held-out runs with MAPE 17.7% over a 41x delay range; the law generalises to MLPs (MAPE 18.0%, N=34) and degrades to 23.3% on cross-task extension (N=46, 43.5x range), with a structured residual in which V_star / V_mem stays comparatively stable within architecture (CV about 14% on the 1L transformer). First-passage of V_t is necessary but not sufficient. A quantile-margin theorem establishes that positive delay requires both norm separation V_mem > V_post and angular reachability of a threshold alpha_star = arcsin(C / V_T_mem^(1/2)), where C is computable from the empirical NTK feature map and the validation-margin quantile. Calibrating C on modulus p=89 predicts alpha_star = 47.2 degrees at p=97 (observed 47.8 degrees, error 1.3%) as a prior cross-cell prediction. Causal interventions that freeze the norm or remove weight decay at memorisation eliminate grokking (0/6 vs. 3/3 baseline), trapping the angular displacement near 12 degrees. kappa_LL is empirically measured per architecture rather than derived from (beta_1, beta_2, epsilon); within-architecture CV stays at most 15% across four architectures, but values differ by about 2x between architectural variants beyond depth alone. Empirical scope is algorithmic tasks (modular arithmetic, sparse parity) under AdamW; whether the law transfers to natural-language scale models is open.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a calibrated first-passage formula for grokking delay under AdamW that predicts held-out runs at 18% MAPE after fitting two constants on one cell, with clean causal support for the norm-separation idea.

read the letter

The main thing to know is that this paper treats grokking delay as a first-passage time for the squared parameter norm and arrives at an explicit formula T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star). They calibrate kappa_LL and V_star on a single hyperparameter cell, then report 17.7% MAPE on 26 held-out runs spanning a 41x range in delay, plus similar numbers on MLPs and a bit worse on cross-task tests. A quantile-margin result also lets them predict the critical angle alpha_star across moduli with 1.3% error. The causal interventions are straightforward: freezing the norm or cutting weight decay at memorization time eliminates grokking in their small trials, which matches the claim that norm separation plus angular reachability is required. That combination of derivation, calibration test, and intervention is the concrete advance here. The soft spot is exactly what the stress-test flags. kappa_LL is measured from data per architecture rather than obtained by linearizing the AdamW update; it varies by a factor of about two across architectures even though it stays within 15% CV inside one. This makes the scaling with eta and lambda depend on the fitted constant, and the calibration step uses data from the same experimental family, so the held-out predictions carry real circularity. The 18-23% error is usable for rough timing estimates but not tight enough to treat the expression as a precise law yet. Everything stays on small algorithmic tasks, so transfer beyond that remains open. This is for people who work on grokking, optimization dynamics, or quantitative models of when generalization appears in overparameterized nets. A reader who wants a concrete handle on delay timing and is willing to calibrate per architecture will find the formula and the interventions worth testing. The paper shows clear engagement with the mechanics and has enough empirical grounding to deserve referee time rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The manuscript derives a closed-form first-passage law for grokking delay under AdamW, T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star), where V_t is the squared parameter norm, V_star is an architecture-dependent threshold, and kappa_LL is an empirically fitted scalar absorbing AdamW corrections to the SGD contraction rate. It calibrates (kappa_LL, V_star) on one hyperparameter cell to predict delays on 26 held-out runs (MAPE 17.7% over 41x range), extends to MLPs and cross-task settings, validates a quantile-margin theorem for angular threshold alpha_star via cross-cell prediction (1.3% error), and reports causal interventions freezing the norm or removing weight decay that eliminate grokking.

Significance. If the central relation holds, the work supplies a quantitative, testable link between norm contraction dynamics and generalization delay on algorithmic tasks, backed by causal evidence and moderate-error cross-predictions. The explicit calibration procedure and within-architecture stability of kappa_LL (CV <=15%) provide a practical starting point, though the empirical status of kappa_LL limits claims of a parameter-free law.

major comments (2)

[Abstract / central derivation] Abstract and derivation of the law: the closed-form delay expression absorbs all AdamW-specific effects (momentum buffers, decoupled weight decay, epsilon) into the scalar kappa_LL, which is measured empirically per architecture (within-architecture CV <=15%, but differing by ~2x across variants) rather than obtained by linearizing the AdamW update on post-memorization gradient statistics. Because the predicted scaling with eta and lambda rests on the stability of this multiplier, the manuscript should either derive kappa_LL from beta_1, beta_2, epsilon or demonstrate that deviations remain negligible outside the calibration cells; otherwise the functional form functions as a calibrated template rather than a derived law.
[Quantile-margin theorem and cross-cell validation] Validation and cross-cell prediction: the quantile-margin theorem establishes that positive delay requires both norm separation V_mem > V_post and angular reachability of alpha_star = arcsin(C / V_T_mem^(1/2)), with C computed from the empirical NTK feature map and validation-margin quantile. While the cross-cell test (calibrate C on p=89, predict alpha_star=47.2° at p=97, observed 47.8°, error 1.3%) is a strong point, the manuscript must specify the exact procedure for extracting C and the NTK feature map to permit independent verification.

minor comments (2)

The abstract reports MAPE values and ranges (17.7% on 26 runs, 18.0% on 34 MLP runs, 23.3% on 46 cross-task runs) but does not list the precise hyperparameter cells used for calibration versus held-out sets; adding a table or explicit enumeration would improve reproducibility.
Notation for V_t = ||theta_t||^2 and the distinction between V_mem, V_post, and V_star is introduced without a dedicated nomenclature section; a short table of symbols would reduce ambiguity when reading the first-passage argument.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback. We appreciate the recognition of the quantitative predictions, cross-cell validation, and causal interventions. We respond to each major comment below, indicating where the manuscript will be revised for clarity and reproducibility while preserving the empirical grounding of the results.

read point-by-point responses

Referee: [Abstract / central derivation] Abstract and derivation of the law: the closed-form delay expression absorbs all AdamW-specific effects (momentum buffers, decoupled weight decay, epsilon) into the scalar kappa_LL, which is measured empirically per architecture (within-architecture CV <=15%, but differing by ~2x across variants) rather than obtained by linearizing the AdamW update on post-memorization gradient statistics. Because the predicted scaling with eta and lambda rests on the stability of this multiplier, the manuscript should either derive kappa_LL from beta_1, beta_2, epsilon or demonstrate that deviations remain negligible outside the calibration cells; otherwise the functional form functions as a calibrated template rather than a derived law.

Authors: We agree that kappa_LL is obtained empirically rather than derived by linearizing the full AdamW update rule on post-memorization gradients. A complete analytic derivation would require strong assumptions on the statistics of the gradients after memorization, which vary with task and architecture and are not yet available in closed form. The manuscript already reports within-architecture stability (CV at most 15% across four architectures) and notes the roughly 2x variation across architectural families. In revision we will (i) explicitly label the expression a calibrated first-passage law whose scaling with eta and lambda is inherited from the underlying SGD contraction modulated by the stable empirical multiplier kappa_LL, and (ii) add a short discussion of why a parameter-free derivation from (beta_1, beta_2, epsilon) remains open. We do not claim the law is fully parameter-free. revision: partial
Referee: [Quantile-margin theorem and cross-cell validation] Validation and cross-cell prediction: the quantile-margin theorem establishes that positive delay requires both norm separation V_mem > V_post and angular reachability of alpha_star = arcsin(C / V_T_mem^(1/2)), with C computed from the empirical NTK feature map and validation-margin quantile. While the cross-cell test (calibrate C on p=89, predict alpha_star=47.2° at p=97, observed 47.8°, error 1.3%) is a strong point, the manuscript must specify the exact procedure for extracting C and the NTK feature map to permit independent verification.

Authors: We agree that the precise extraction procedure for C must be stated explicitly. The current manuscript describes C as computed from the empirical NTK feature map and the validation-margin quantile but does not enumerate the algorithmic steps. In the revised version we will add a dedicated paragraph (and, if space permits, an appendix) that details: (1) the finite-difference approximation used to obtain the empirical NTK on the training set after memorization, (2) the construction of the feature map from the top-k eigenvectors, and (3) the exact quantile selection rule applied to the per-example validation margins. This addition will make the cross-cell prediction of alpha_star fully reproducible. revision: yes

Circularity Check

1 steps flagged

kappa_LL and V_star calibrated on one cell to predict held-out delays

specific steps

fitted input called prediction [Abstract]
"Calibrating (kappa_LL, V_star) on a single hyperparameter cell predicts grokking delays on 26 held-out runs with MAPE 17.7% over a 41x delay range; the law generalises to MLPs (MAPE 18.0%, N=34) and degrades to 23.3% on cross-task extension (N=46, 43.5x range), with a structured residual in which V_star / V_mem stays comparatively stable within architecture (CV about 14% on the 1L transformer)."

The closed-form law T_grok - T_mem = (1 / 2 kappa_LL eta lambda) log(V_mem / V_star) incorporates kappa_LL as an architecture-dependent constant that 'absorbs the AdamW correction' and is 'empirically measured per architecture rather than derived from (beta_1, beta_2, epsilon)'. By fitting both kappa_LL and V_star on one cell and then using those values to generate predictions on held-out runs from the same family, the quantitative predictions are forced by the calibration data rather than obtained from first-principles derivation alone.

full rationale

The paper derives the functional form of the delay law from a first-passage model of parameter-norm contraction, but the effective rate factor kappa_LL (which absorbs all AdamW-specific effects) and the threshold V_star are explicitly calibrated on a single hyperparameter cell before the law is applied to held-out runs. This matches the fitted-input-called-prediction pattern: the claimed quantitative predictions on 26 held-out runs (and cross-architecture generalizations) are not parameter-free but depend on values fitted to data from the same experimental family. The derivation remains self-contained against external benchmarks for the functional form itself, but the central quantitative claim reduces to a calibrated template rather than a fully derived law. No self-citation chains, definitional loops, or ansatz smuggling are present in the provided text.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on two calibrated free parameters (kappa_LL, V_star) whose values are fitted rather than derived, plus domain assumptions about the contraction rate of parameter norms under AdamW and the necessity of norm first-passage for grokking.

free parameters (2)

kappa_LL
Empirically measured correction factor that absorbs AdamW's effect on the clean-SGD contraction rate; measured per architecture with within-architecture CV <=15%.
V_star
Architecture-dependent threshold on squared parameter norm that must be reached for generalization.

axioms (2)

domain assumption Parameter-norm dynamics under AdamW follow a contraction rate 2 eta lambda modified by a constant kappa_LL
Invoked to obtain the closed-form first-passage expression in the abstract.
domain assumption First-passage of V_t is necessary (though not sufficient) for grokking
Explicitly stated in the abstract as the modeling premise.

pith-pipeline@v0.9.0 · 5966 in / 1645 out tokens · 39266 ms · 2026-05-20T21:09:37.134076+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J-cost uniqueness) echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Tgrok − Tmem ≈ 1/(2 κLL η λ) log(Vmem / V⋆) … κLL absorbs the AdamW correction to the clean-SGD contraction rate 2 η λ
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean costAlphaLog_fourth_deriv_at_zero / J_uniquely_calibrated_via_higher_derivative echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Vt contracts exponentially … log-linear fit … rate-preserving observable

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 2 internal anchors

[1]

arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

URLhttps://arxiv.org/abs/2310.04415. Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URLhttps://arxiv.org/abs/2207.08799. Etienne Boursier, Scott Pesme, and Radu-A...

work page arXiv 2022
[2]

Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A

URLhttps://arxiv.org/abs/2511.01938. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations (ICLR),

work page arXiv
[3]

Progress measures for grokking via mechanistic interpretability

URLhttps://arxiv.org/abs/2301.05217. Pascal Jr. Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume Dumas. Predicting grokking long before it happens: A look into the loss landscape of models which grok. InICML Workshop on Neural Scaling Laws: Emergence and Phase Transitions, 2023. URL https://arxiv.org/abs/ 2306.13253. Alethea Power...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

URLhttps://arxiv.org/abs/2603.13331. Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390, 2023. URL https://arxiv.org/abs/ 2309.02390. Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Naive computations using E[bm/ √ bv]̸=E[bm]/ p E[bv]are essential; the ratio of expectations differs significantly from the expectation of the ratio

Correlations between mt and vt.In the interpolation regime, both moments are driven by the same gradient noise process. Naive computations using E[bm/ √ bv]̸=E[bm]/ p E[bv]are essential; the ratio of expectations differs significantly from the expectation of the ratio

work page
[6]

Time dependence of bias-correction.The bias-correction terms 1/(1−β t

work page
[7]

Whether the contraction phase lies in the bias-corrected or asymptotic regime depends on hyperparameters

matter at the start of training but become irrelevant for t≫1/(1−β 2). Whether the contraction phase lies in the bias-corrected or asymptotic regime depends on hyperparameters

work page
[8]

derive κLL from (β1, β2, ϵ)

Coupling between coordinates.Off-diagonal Hessian terms induce gradient correlations across coor- dinates. A purely diagonal analysis (treating coordinates independently) may miss systematic biases, particularly for deeper architectures. D.5 Promising approaches AdamW SDE.Malladi et al. (2022) derive a stochastic differential equation that approximates Ad...

work page 2022
[9]

All30 seeds gave R2 >0.91 on the standard window and κ∈[0.146,0.217] , consistent with monotonic contraction

showed no such bimodality. All30 seeds gave R2 >0.91 on the standard window and κ∈[0.146,0.217] , consistent with monotonic contraction. The overshoot regime is thus an architecture-specific phenomenon, plausibly tied to the absence of LayerNorm and biases in the paper’s architecture. Characterising this connection is left for future work. Kosson-form ref...

work page 2024
[10]

Install dependencies:pip install -r requirements.txt

work page
[11]

Output:results/campaign1/

Campaign 1 (headline κ, 66 runs): python code/master_experiment_v5.py (or v8 for refined cross-arch). Output:results/campaign1/

work page
[12]

4.Campaign 4 (Block F causal, 12 runs):python code/master_experiment_v7.py

Campaign 3 (cross-arch, 24 runs):python code/master_experiment_v8.py --campaign cross_arch . 4.Campaign 4 (Block F causal, 12 runs):python code/master_experiment_v7.py. 5.Block H (36 runs):python code/run_block_H_cross_cell.py

work page
[13]

Paper-2L N= 29 + alt-2L N= 30 : python code/run_modular_transformer2_validation_v2.py andrun_modular_transformer2_alternative.py

work page
[14]

Then run the CPU verification steps above

work page
[15]

Output:paper/figures/

Figures: python paper/scripts/make_figures_v3.py (and make_fig5_v3.py, make_fig6_v3.py). Output:paper/figures/. Memory footprint<2GB per run. All runs are deterministic from the recorded seeds. M.4 Caveats and design decisions • Fit window choice (T 0.95 grok vs T 0.99 grok).For the paper’s 2-layer transformer (N= 29 ), the standard [Tmem + 100, T0.99 gro...

work page 2026

[1] [1]

arXiv: 2310.04415 [cs.LG].url:https://arxiv.org/abs/2310.04415

URLhttps://arxiv.org/abs/2310.04415. Boaz Barak, Benjamin L. Edelman, Surbhi Goel, Sham Kakade, Eran Malach, and Cyril Zhang. Hidden progress in deep learning: SGD learns parities near the computational limit. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. URLhttps://arxiv.org/abs/2207.08799. Etienne Boursier, Scott Pesme, and Radu-A...

work page arXiv 2022

[2] [2]

Yoonsoo Nam, Nayara Fonseca, Seok Hyeong Lee, Chris Mingard, and Ard A

URLhttps://arxiv.org/abs/2511.01938. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations (ICLR),

work page arXiv

[3] [3]

Progress measures for grokking via mechanistic interpretability

URLhttps://arxiv.org/abs/2301.05217. Pascal Jr. Tikeng Notsawo, Hattie Zhou, Mohammad Pezeshki, Irina Rish, and Guillaume Dumas. Predicting grokking long before it happens: A look into the loss landscape of models which grok. InICML Workshop on Neural Scaling Laws: Emergence and Phase Transitions, 2023. URL https://arxiv.org/abs/ 2306.13253. Alethea Power...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

The Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization

URLhttps://arxiv.org/abs/2603.13331. Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390, 2023. URL https://arxiv.org/abs/ 2309.02390. Ruosi Wan, Zhanxing Zhu, Xiangyu Zhang, and Jian Sun. Spherical motion dynamics: Learning dynamics of normalized neural ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[5] [5]

Naive computations using E[bm/ √ bv]̸=E[bm]/ p E[bv]are essential; the ratio of expectations differs significantly from the expectation of the ratio

Correlations between mt and vt.In the interpolation regime, both moments are driven by the same gradient noise process. Naive computations using E[bm/ √ bv]̸=E[bm]/ p E[bv]are essential; the ratio of expectations differs significantly from the expectation of the ratio

work page

[6] [6]

Time dependence of bias-correction.The bias-correction terms 1/(1−β t

work page

[7] [7]

Whether the contraction phase lies in the bias-corrected or asymptotic regime depends on hyperparameters

matter at the start of training but become irrelevant for t≫1/(1−β 2). Whether the contraction phase lies in the bias-corrected or asymptotic regime depends on hyperparameters

work page

[8] [8]

derive κLL from (β1, β2, ϵ)

Coupling between coordinates.Off-diagonal Hessian terms induce gradient correlations across coor- dinates. A purely diagonal analysis (treating coordinates independently) may miss systematic biases, particularly for deeper architectures. D.5 Promising approaches AdamW SDE.Malladi et al. (2022) derive a stochastic differential equation that approximates Ad...

work page 2022

[9] [9]

All30 seeds gave R2 >0.91 on the standard window and κ∈[0.146,0.217] , consistent with monotonic contraction

showed no such bimodality. All30 seeds gave R2 >0.91 on the standard window and κ∈[0.146,0.217] , consistent with monotonic contraction. The overshoot regime is thus an architecture-specific phenomenon, plausibly tied to the absence of LayerNorm and biases in the paper’s architecture. Characterising this connection is left for future work. Kosson-form ref...

work page 2024

[10] [10]

Install dependencies:pip install -r requirements.txt

work page

[11] [11]

Output:results/campaign1/

Campaign 1 (headline κ, 66 runs): python code/master_experiment_v5.py (or v8 for refined cross-arch). Output:results/campaign1/

work page

[12] [12]

4.Campaign 4 (Block F causal, 12 runs):python code/master_experiment_v7.py

Campaign 3 (cross-arch, 24 runs):python code/master_experiment_v8.py --campaign cross_arch . 4.Campaign 4 (Block F causal, 12 runs):python code/master_experiment_v7.py. 5.Block H (36 runs):python code/run_block_H_cross_cell.py

work page

[13] [13]

Paper-2L N= 29 + alt-2L N= 30 : python code/run_modular_transformer2_validation_v2.py andrun_modular_transformer2_alternative.py

work page

[14] [14]

Then run the CPU verification steps above

work page

[15] [15]

Output:paper/figures/

Figures: python paper/scripts/make_figures_v3.py (and make_fig5_v3.py, make_fig6_v3.py). Output:paper/figures/. Memory footprint<2GB per run. All runs are deterministic from the recorded seeds. M.4 Caveats and design decisions • Fit window choice (T 0.95 grok vs T 0.99 grok).For the paper’s 2-layer transformer (N= 29 ), the standard [Tmem + 100, T0.99 gro...

work page 2026