Recognition: 2 theorem links
· Lean TheoremThe Norm-Separation Delay Law of Grokking: A First-Principles Theory of Delayed Generalization
Pith reviewed 2026-05-15 16:14 UTC · model grok-4.3
The pith
Grokking delay equals the inverse of the optimizer contraction rate times the log of the memorizing-to-generalizing norm ratio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Grokking is a norm-driven representational phase transition in regularised training dynamics. The delay T_grok minus T_mem equals Theta of gamma_eff inverse times log of theta_mem norm squared over theta_post norm squared, where gamma_eff is eta lambda for SGD and at least eta lambda for AdamW. The upper bound follows from a discrete Lyapunov contraction argument while the matching lower bound follows from the dynamical constraints of regularised first-order optimisation.
What carries the argument
The Norm-Separation Delay Law, which uses discrete Lyapunov contraction under regularization to quantify the time required for the smaller-norm generalizing interpolator to overtake the larger-norm memorizing one.
If this is right
- Grokking delay scales inversely with weight decay strength across tasks.
- Grokking delay scales inversely with learning rate.
- Grokking occurs reliably with AdamW but fails entirely with SGD at identical hyperparameters.
- A simple three-input predictor using contraction rate, norm ratio, and memorization time achieves 34.6 percent mean absolute error on held-out runs.
Where Pith is reading between the lines
- Measuring the norm ratio at the moment of memorization could allow early prediction of when generalization will appear.
- Optimizers or regularizers could be designed to control or shorten the delay by altering contraction rates or norm gaps.
- The same contraction-plus-norm-separation mechanism may govern other delayed-generalization phenomena beyond the tasks tested here.
Load-bearing premise
That grokking is caused by norm separation between two competing interpolating representations under regularization, with the generalizing solution having the smaller norm.
What would settle it
A controlled training run in which the observed grokking delay fails to scale inversely with weight decay or learning rate, or fails to show the predicted logarithmic dependence on the measured norm ratio at memorization time.
Figures
read the original abstract
Grokking -- the sudden generalisation that appears long after a model has perfectly memorised its training data -- has been widely observed but lacks a quantitative theory explaining the length of the delay. We show that grokking is a norm-driven representational phase transition in regularised training dynamics, and establish the Norm-Separation Delay Law: $T_{\mathrm{grok}} - T_{\mathrm{mem}} = \Theta(\gamma_{\mathrm{eff}}^{-1} \log(\|\theta_{\mathrm{mem}}\|^2 / \|\theta_{\mathrm{post}}\|^2))$, where $\gamma_{\mathrm{eff}}$ is the optimiser's effective contraction rate ($\gamma_{\mathrm{eff}} = \eta\lambda$ for SGD, $\gamma_{\mathrm{eff}} \ge \eta\lambda$ for AdamW). The upper bound follows from a discrete Lyapunov contraction argument; the matching lower bound from dynamical constraints of regularised first-order optimisation. Across 293 training runs spanning modular addition, modular multiplication, and sparse parity, we confirm three falsifiable predictions: inverse scaling with weight decay ($R^2 = 0.97$), inverse scaling with learning rate ($R^2 = 0.92$), and logarithmic dependence on the norm ratio (Pearson $r = 0.91$). A fourth finding reveals that grokking requires an optimiser capable of decoupling memorisation from contraction: SGD fails entirely at the same hyperparameters where AdamW reliably groks. These results reframe grokking not as a mysterious optimisation artefact but as a predictable consequence of norm separation between competing interpolating representations. We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error (bootstrap 95% CI [30.0%, 39.4%], $N=60$ seeds), enabling principled early stopping.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that grokking arises as a norm-driven representational phase transition under regularized training. It establishes the Norm-Separation Delay Law T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimizer's effective contraction rate (ηλ for SGD, ≥ηλ for AdamW). The upper bound is derived from a discrete Lyapunov contraction argument on the quadratic norm penalty; the matching lower bound follows from dynamical constraints of regularized first-order optimization. Across 293 runs on modular addition, multiplication, and sparse parity, the work reports inverse scaling of delay with weight decay (R²=0.97) and learning rate (R²=0.92), logarithmic dependence on the norm ratio (Pearson r=0.91), failure of SGD to grok at hyperparameters where AdamW succeeds, and a three-input predictor achieving 34.6% MAE at memorization time.
Significance. If the derivation is completed, the result supplies the first quantitative, falsifiable scaling law for grokking delay grounded in optimization dynamics rather than phenomenology. The high R² fits, the explicit contrast between SGD and AdamW, and the practical early-stopping algorithm constitute clear strengths that could be directly useful for training analysis. The work reframes delayed generalization as a predictable consequence of norm separation between competing interpolators.
major comments (1)
- [Abstract / Norm-Separation Delay Law statement] The central claim asserts both an upper and a matching lower bound for the Θ expression. The upper bound is attributed to a discrete Lyapunov contraction argument, yet the manuscript supplies only a high-level summary without the explicit sequence of inequalities, the precise Lyapunov function, or error terms. The lower bound is ascribed to 'dynamical constraints of regularised first-order optimisation' without a derivation showing that any trajectory must require at least Ω(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) steps before the smaller-norm solution can dominate the loss landscape. Until these steps are written out, the quantitative law reduces to an empirically supported scaling plus an unproven lower bound.
minor comments (2)
- The practical three-input predictor is announced with a 34.6% MAE but its exact inputs, training procedure, and bootstrap details are not fully specified in the provided text; a short algorithmic box or pseudocode would improve reproducibility.
- The definition of γ_eff for AdamW is given as ≥ηλ; an explicit expression or bound in terms of β1, β2, and ε would remove ambiguity when comparing optimizers.
Simulated Author's Rebuttal
We thank the referee for the careful and constructive review. The central concern is that the upper and lower bounds in the Norm-Separation Delay Law are stated at a high level without explicit derivations. We agree this must be remedied and will supply the complete proofs in the revision.
read point-by-point responses
-
Referee: [Abstract / Norm-Separation Delay Law statement] The central claim asserts both an upper and a matching lower bound for the Θ expression. The upper bound is attributed to a discrete Lyapunov contraction argument, yet the manuscript supplies only a high-level summary without the explicit sequence of inequalities, the precise Lyapunov function, or error terms. The lower bound is ascribed to 'dynamical constraints of regularised first-order optimisation' without a derivation showing that any trajectory must require at least Ω(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) steps before the smaller-norm solution can dominate the loss landscape. Until these steps are written out, the quantitative law reduces to an empirically supported scaling plus an unproven lower bound.
Authors: We acknowledge that the present manuscript presents the bounds at a summary level. In the revised version we will expand the dedicated proof section to include: (i) the explicit Lyapunov function V(θ) = ½‖θ‖² together with the full contraction inequality ‖θ_{t+1}‖² ≤ (1 − 2ηλ + O(η²L))‖θ_t‖² + η²‖∇L‖² under standard smoothness assumptions, yielding the O(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) upper bound with explicit remainder terms; (ii) the matching lower-bound argument showing that any first-order trajectory must take at least Ω(γ_eff^{-1} log(ratio)) steps for the smaller-norm interpolator to dominate, because the loss gap between the two competing solutions closes at a rate bounded by the same contraction factor and cannot be accelerated beyond it while both remain interpolators. These additions will render the Θ statement fully rigorous while preserving the main-text summary. revision: yes
Circularity Check
Delay law uses measured norm ratio at memorization as direct input to quantitative prediction
specific steps
-
fitted input called prediction
[Abstract (Norm-Separation Delay Law statement)]
"T_grok - T_mem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)), where γ_eff is the optimiser's effective contraction rate... We further derive a practical three-input algorithm that predicts grokking delay at memorisation time with 34.6% mean absolute error"
The delay is expressed directly in terms of the norm ratio measured at T_mem; the 'prediction' algorithm therefore takes that observed ratio as an input rather than deriving the full delay length from hyperparameters and initial conditions alone. The Θ scaling is then fitted to data that already encodes the same norm separation.
full rationale
The central claim presents the Norm-Separation Delay Law as derived from a discrete Lyapunov contraction (upper bound) plus dynamical constraints (lower bound). However, the explicit formula for the delay directly incorporates the observed ‖θ_mem‖² / ‖θ_post‖² ratio measured at T_mem, and the practical three-input prediction algorithm is evaluated on that same measured ratio. This makes the quantitative output statistically dependent on post-memorization observations rather than a parameter-free derivation from hyperparameters alone. The empirical R² and Pearson correlations are reported on the same runs, but no self-citation chain or self-definitional loop is present; the derivation steps themselves are not shown to collapse by construction. Overall partial circularity from fitted-input usage.
Axiom & Free-Parameter Ledger
free parameters (1)
- γ_eff
axioms (2)
- domain assumption Discrete Lyapunov contraction governs the upper bound on delay under regularized first-order optimization.
- domain assumption Grokking arises as a representational phase transition driven by norm separation between memorizing and generalizing interpolators.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Tgrok − Tmem = Θ(γ_eff^{-1} log(‖θ_mem‖² / ‖θ_post‖²)) ... upper bound follows from a discrete Lyapunov contraction argument
-
IndisputableMonolith/Foundation/AlphaCoordinateFixation.leancostAlphaLog_fourth_deriv_at_zero unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
exponential contraction of parameter norms ... rate 1−ηλ
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
The Spectral Lifecycle of Transformer Training: Transient Compression Waves, Persistent Spectral Gradients, and the Q/K--V Asymmetry
Transformer weight spectra exhibit transient compression waves that propagate layer-wise, persistent non-monotonic depth gradients in power-law exponents, and Q/K-V asymmetry, with the spectral exponent alpha predicti...
Reference graph
Works this paper leans on
-
[1]
Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal
doi:10.1073/pnas.1907378117. Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off.Proceedings of the National Academy of Sciences, 116(32):15849– 15854,
-
[2]
doi:10.1073/pnas.1903070116. Léon Bottou, Frank E. Curtis, and Jorge Nocedal. Optimization methods for large-scale machine learning. SIAM Review, 60(2):223–311,
-
[3]
Lénaïc Chizat, Edouard Oyallon, and Francis Bach
doi:10.1137/16M1080173. Lénaïc Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. In Advances in Neural Information Processing Systems, volume 32,
-
[4]
Xander Davies, Lauro Langosco, and David Krueger
URLhttps: //proceedings.mlr.press/v202/chughtai23a.html. Xander Davies, Lauro Langosco, and David Krueger. Unifying grokking and double descent.arXiv preprint arXiv:2303.06173,
-
[5]
Xander Davies, Lauro Langosco, and David Krueger
doi:10.48550/arXiv.2303.06173. Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks. InAdvances in Neural Information Processing Systems, volume 31,
-
[6]
Adam: A Method for Stochastic Optimization
URLhttps://arxiv.org/abs/1412.6980. Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl-Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. InAdvances in Neural Information Processing Systems, volume 32,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever
URLhttps://openreview.net/forum?id=XsHqr9dEGH. Preetum Nakkiran, Gal Kaplun, Yamini Bansal, Tristan Yang, Boaz Barak, and Ilya Sutskever. Deep double descent: Where bigger models and more data hurt.Journal of Statistical Mechanics: Theory and Experiment, 2021(12):124003,
work page 2021
-
[8]
Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt
doi:10.1088/1742-5468/ac3a74. Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations,
-
[9]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
URLhttps://openreview.net/forum?id=9XFSbDPmdW. Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gener- alization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
doi:10.48550/arXiv.2201.02177. 30 Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19(70):1–57,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.02177
-
[11]
Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua M
URL https://jmlr.org/papers/v19/18-188.html. Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua M. Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon.arXiv preprint arXiv:2206.04817,
-
[12]
Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua M
doi:10.48550/arXiv.2206.04817. Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,
-
[13]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N
doi:10.48550/arXiv.2309.02390. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30,
-
[14]
The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
Alper Yıldırım. The geometric inductive bias of grokking: Bypassing phase transitions via architectural topology.arXiv preprint arXiv:2603.05228,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
The Geometric Inductive Bias of Grokking: Bypassing Phase Transitions via Architectural Topology
doi:10.48550/arXiv.2603.05228. A Proof of the Discrete Escape Theorem We provide a self-contained proof of Theorem 3.2. The argument proceeds in three steps: (i) a one-step Lyapunov recursion, (ii) unrolling the recursion to obtain the escape time, and (iii) deriving the lower bound on escape time. Proof of Theorem 3.2 (full).Under the assumptions:Ltrain ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2603.05228
-
[16]
Distribute the Kactive frequencies evenly acrossHheads (each head handlesK/Hfrequencies)
computescos(2πk(a+ b)/p) =⟨e (k) a , e(k) b ⟩via a dot-product attention over the two-token sequence[E[:, a], E[:, b]]. Distribute the Kactive frequencies evenly acrossHheads (each head handlesK/Hfrequencies). For each headh: •W h Q, W h K ∈R d×dh: select the2(K/H)active Fourier coordinates for this head. Only2(K/H)rows are nonzero, each of magnitudeO(1),...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.