pith. sign in

arxiv: 2606.30388 · v1 · pith:ZZIHLJKMnew · submitted 2026-06-29 · 📊 stat.ML · cs.AI· cs.LG

A Stochastic--Geometric Theory of Scaling Laws in Grokking

Pith reviewed 2026-06-30 03:51 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG
keywords grokkingscaling lawsshell-core topologystopping-time theoryAdam optimizationparameter space geometrymemorizationgeneralization
0
0 comments X

The pith

Adam dynamics with weight shrinkage create a shell-core topology in parameter space that produces grokking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper characterizes the reachable solutions under Adam optimization with ℓ2 regularization as forming a thin outer spherical shell of random-initialization points, an intermediate spherical shell of memorization solutions, and an inner core of generalization solutions. This configuration forces trajectories to spend time in the memorization shell before reaching the generalization core, which accounts for the delay in grokking. Stopping-time theory applied to the geometry of these manifolds then yields explicit scaling relations for the transition time as a function of learning rate, batch size, and the regularization coefficient. The derived laws are checked against experiments and shown to recover known empirical patterns. A reader would care because the account ties the timing of generalization directly to the geometry induced by standard training choices.

Core claim

In the model's parameter space, random initialization solutions concentrate on a thin outer spherical shell, enclosing another spherical shell of memorization solutions, which in turn contains a core corresponding to the generalization solutions. This optimization-induced topological configuration gives rise to grokking. Leveraging stopping-time theory, the geometry of this configuration determines the solution transition time at which optimization trajectories escape the memorization manifold and first reach the boundary of the generalization manifold, producing scaling laws for grokking with respect to learning rate, batch size, and the ℓ2 regularization coefficient.

What carries the argument

The shell-core topological configuration of the reachable solution space under Adam with weight-shrinkage regularization, analyzed via stopping-time theory to compute escape times from the memorization manifold to the generalization core.

If this is right

  • Grokking time obeys a derived scaling law with respect to learning rate.
  • Grokking time obeys a derived scaling law with respect to batch size.
  • Grokking time obeys a derived scaling law with respect to the ℓ2 regularization coefficient.
  • The scaling laws recover empirical relations reported in earlier grokking studies.
  • The shell-core arrangement is consistent with the empirical distribution of solutions reached at different training phases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the concentric-shell geometry holds, then deliberately altering weight norms or regularization during training could shorten or eliminate the grokking delay.
  • The stopping-time approach could be applied to other first-order optimizers that produce comparable shrinkage effects.
  • In very high dimensions the spherical approximation might need correction terms that depend on the curvature of the loss surface.
  • The same geometric picture might clarify other delayed-generalization phenomena observed outside supervised classification.

Load-bearing premise

Adam dynamics with weight-shrinkage regularization produce a thin outer shell of random-initialization solutions enclosing a memorization shell that in turn encloses a generalization core, with the shells being approximately spherical.

What would settle it

Direct measurement of parameter vectors at successive training stages shows that memorization and generalization solutions do not form distinct concentric spherical shells separated by norm, or that observed grokking times fail to follow the predicted scaling with learning rate or regularization coefficient.

Figures

Figures reproduced from arXiv: 2606.30388 by Christian Gagn\'e, Ihsan Ullah, Jonas Ngnaw\'e, Karyn Morrissey, R\'ois\'in Luo.

Figure 1
Figure 1. Figure 1: Optimization-Induced Shell–Core Topology and Empirical Evidence. Left (a): grokking dynamics under weight-shrinkage regularization induce a shell–core topology in parame￾ter space, where initialization points θ0 ∼ N (0, σ2 Ip) concentrate on a hyperspherical shell Θ with radius ρΘ ≈ σ √p and thickness σ/√ 2, enclosing the memorization shell M \ G and the general￾ization core G; the radii ρG, ρM, and ρΘ are… view at source ↗
Figure 2
Figure 2. Figure 2: Scaling Laws of Manifold Radius ρ 2 M on S5. We show the scaling law of ρ 2 M with respect to the learning rate η, batch size b, and ℓ2 regularization coefficient λ on the S5 task. For each hyperparameter configuration, we train for ten runs. The results show that larger η/b modifies the dynamics with a stronger diffusion variance, whereas λ does not affect the diffusion variance. We also overlay the theor… view at source ↗
Figure 3
Figure 3. Figure 3: Observations in Grokking Dynamics. The markers τM and τG denote the first-hitting times of ∂M and ∂G, respectively, and the y-axis is shown on a logarithmic scale. Left (a) shows the exponential reduction of moment estimates; (mt, vt) rapidly converges to g¯t(1−e −α1t ), g ⊙ gt (1− e −α2t )  , with exponential decay exp(−O(t)). Middle–right (b–c) shows the exponential reduction of the gradient noise; the … view at source ↗
Figure 4
Figure 4. Figure 4: Scaling Laws of Manifold Radius ρ 2 G on S5. We show the scaling laws of ρ 2 G with respect to the learning rate η, batch size b, and ℓ2 regularization coefficient λ on the S5 task. For each hyperparameter configuration, we train for ten runs. The results show that larger η induces stronger diffusion variance, whereas λ does not affect the diffusion variance. We also overlay the theoretical fits. An additi… view at source ↗
Figure 5
Figure 5. Figure 5: Scaling laws of solution transition time on S5. We show that the solution transition time τM→G from the memorization manifold M to the generalization manifold G scales with the learning rate η, batch size b, and ℓ2 regularization coefficient λ. For each hyperparameter configuration, we train for ten runs. We also overlay the theoretical fits. An additional experimental results for Z127 are provided in [PI… view at source ↗
Figure 6
Figure 6. Figure 6: High-Level Sketch of Theoretical Analysis Framework. Table of Appendix Contents: A.1 Experimental Settings: Learning Tasks . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Additional Results: Adam–Induced Shell–Core Radius & Stopping-Time Concen￾tration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.3 Proof: Concentration of Normal Initialization . . . . . . . … view at source ↗
Figure 7
Figure 7. Figure 7: Proof Sketch for Adam’s Closed-Form Continuous-Time SDE Limit. This diagram illustrates the proof sketch and the corresponding correctness checks for Adam’s continuous-time SDE limit. The mini-batch gradient is modeled as a Wiener process by Central-Limit Theorem. Combining this stochastic gradient model with Adam’s discrete update rules and the continuous￾time interpolation yields the continuous-time limi… view at source ↗
Figure 8
Figure 8. Figure 8: Sanity Check with Radius SDE. We use the induced radius SDE (Lemma 6), derived via Ito’s lemma, as a sanity check for Adam’s continuous-time SDE limit. ˆ We use the induced radius SDE as a sanity check on the joint-state SDE of Lemma 1. Applying Ito’s ˆ lemma (Øksendal, 2003) to the quadratic form r 2 t = S ⊤ t E St with the θ-block projector E ∈ R 3p×3p defined in equation (176) below must reproduce the I… view at source ↗
Figure 9
Figure 9. Figure 9: Preconditioned Radius SDE. This experiment shows the dynamics of the precondi￾tioned radius SDE in Lemma 9, and two memorization-regime identities in Lemma 10 where tr(π(θt)) ≈ √ b tr (diag Σt) −1/2 [PITH_FULL_IMAGE:figures/full_fig_p034_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Late-Stage Radius SDE. This experiment shows the dynamics of the late-stage radius SDE in Lemma 11. The predicted radius dynamics closely match the theoretical values computed from Lemma 11 without counting residual terms, validating the correctness of the decomposition. In particular, the late-stage residual sum RSM + Rπ is negligible, as hypothesized. Lemma 11 (Reduced Late-Stage Radius SDE with Slow-Ma… view at source ↗
Figure 11
Figure 11. Figure 11: Scaling Law of Manifold Radius ρ 2 M on Z127. We show the scaling law of ρ 2 M with respect to the learning rate η, batch size b, and ℓ2 regularization coefficient λ on the Z127 task. For each hyperparameter configuration, we train for ten runs. The results show that larger η/b induces stronger diffusion variance, whereas λ does not affect the diffusion variance. We also overlay the theoretical fits. 56 … view at source ↗
Figure 12
Figure 12. Figure 12: Scaling Law of Manifold Radius ρ 2 G on Z127. We show the scaling law of ρ 2 G with respect to the learning rate η, batch size b, and ℓ2 regularization coefficient λ on the Z127 task. For each hyperparameter configuration, we train for ten runs. The results show that larger η induces stronger diffusion variance, whereas λ does not affect the diffusion variance. We also overlay the theoretical fits. 57 [P… view at source ↗
Figure 13
Figure 13. Figure 13: Scaling laws of solution transition time on Z127. We show that the solution transition time τM→G from the memorization manifold M to the generalization manifold G scales with the learning rate η, batch size b, and ℓ2 regularization coefficient λ. For each hyperparameter configu￾ration, we train for ten runs. We also overlay the theoretical fits. 58 [PITH_FULL_IMAGE:figures/full_fig_p058_13.png] view at source ↗
read the original abstract

Delayed generalization (\ie~grokking) refers to the phenomenon in which a neural network fits its training data early in training but only begins to generalize after a prolonged delay, often through an abrupt transition. Despite extensive empirical study, its underlying mechanism remains poorly understood. In this work, we first theoretically characterize a shell--core topological configuration of the reachable solution space induced by Adam's optimization dynamics with weight-shrinkage regularization, supported by empirical evidence. This optimization-induced topological configuration gives rise to grokking. In model's parameter space, random initialization solutions concentrate on a thin outer spherical shell, enclosing another spherical shell of memorization solutions, which in turn contains a core corresponding to the generalization solutions. Leveraging stopping-time theory, we then analyze the geometry of this topological configuration and the solution transition time at which optimization trajectories escape the memorization manifold and first reach the boundary of the generalization manifold. Our theoretical analysis derives grokking scaling laws for the learning rate, batch size, and $\ell_2$ regularization coefficient, which are further validated through experiments and shown to recover results from prior literature.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that Adam optimization dynamics with weight-shrinkage regularization induce a specific shell-core topological configuration in parameter space (thin outer spherical shell of random-initialization solutions enclosing a memorization shell, which encloses a generalization core). This configuration is said to cause grokking, and stopping-time theory is applied to the geometry to derive explicit scaling laws for grokking delay as a function of learning rate, batch size, and ℓ2 regularization strength; the laws are asserted to be validated by experiments and to recover prior results.

Significance. If the derivation of the spherical shells from the optimizer dynamics and the subsequent stopping-time analysis are rigorous, the work would supply a geometric mechanism linking optimizer-induced reachable sets to grokking scaling, offering a potential unification of empirical grokking observations with stochastic-process predictions.

major comments (2)
  1. [Theoretical characterization of shell-core configuration (abstract and main theory section)] The manuscript posits rather than derives the approximately spherical shell geometry of the reachable solution sets. The abstract states that the configuration is 'theoretically characterized' and 'induced by Adam's optimization dynamics,' yet the provided text contains no SDE derivation, isotropy proof, or explicit calculation showing that the reachable sets under Adam + ℓ2 regularization remain spherical (as opposed to ellipsoidal or angular due to feature learning). Because the hitting-time formulas and all scaling exponents depend on this geometry, the central claim is load-bearing on an unproven assumption.
  2. [Stopping-time analysis and scaling-law derivation] No equations, proof sketches, or explicit stopping-time expressions appear in the abstract or the summarized claims. The scaling laws for learning rate, batch size, and regularization coefficient are asserted to follow from the geometry, but without the intermediate derivations it is impossible to verify whether the exponents are obtained from first principles or reduce to fitted quantities.
minor comments (2)
  1. [Empirical validation] The abstract refers to 'empirical evidence' and 'experiments' validating the topology and scaling laws, but provides no figure, table, or quantitative metric references (e.g., measured shell radii or hitting-time distributions) that would allow assessment of the strength of support.
  2. [Notation and definitions] Clarify the precise mathematical definitions of the 'memorization manifold' and 'generalization manifold' boundaries used for the first-hitting-time calculation, including any regularity conditions required for the stopping-time results to apply.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address the two major comments point by point below and will revise the manuscript to make the derivations more explicit and self-contained.

read point-by-point responses
  1. Referee: [Theoretical characterization of shell-core configuration (abstract and main theory section)] The manuscript posits rather than derives the approximately spherical shell geometry of the reachable solution sets. The abstract states that the configuration is 'theoretically characterized' and 'induced by Adam's optimization dynamics,' yet the provided text contains no SDE derivation, isotropy proof, or explicit calculation showing that the reachable sets under Adam + ℓ2 regularization remain spherical (as opposed to ellipsoidal or angular due to feature learning). Because the hitting-time formulas and all scaling exponents depend on this geometry, the central claim is load-bearing on an unproven assumption.

    Authors: Section 3 of the manuscript derives the SDE approximation to the Adam dynamics under L2 regularization and establishes approximate spherical symmetry from the isotropy of the stochastic gradient noise term together with the radial contraction induced by weight decay. The assumption of negligible feature learning (fixed random features) is stated explicitly. We agree that a self-contained isotropy argument and discussion of the ellipsoidal perturbation under feature learning would strengthen the presentation; these will be added as a dedicated subsection with proof sketch. revision: yes

  2. Referee: [Stopping-time analysis and scaling-law derivation] No equations, proof sketches, or explicit stopping-time expressions appear in the abstract or the summarized claims. The scaling laws for learning rate, batch size, and regularization coefficient are asserted to follow from the geometry, but without the intermediate derivations it is impossible to verify whether the exponents are obtained from first principles or reduce to fitted quantities.

    Authors: Section 4 applies first-passage time analysis to the radial diffusion process between the memorization shell and generalization core, yielding the explicit scaling expressions in Theorems 4.1–4.3. The exponents arise directly from the mean hitting time of the associated stochastic differential equation. To improve verifiability we will insert a concise proof outline and the key intermediate equations into the main text (with full calculations remaining in the appendix). revision: yes

Circularity Check

0 steps flagged

No circularity: derivation proceeds from posited geometry via stopping-time analysis

full rationale

The abstract states that the shell-core configuration is 'theoretically characterize[d]' as induced by Adam dynamics with weight-shrinkage, supported by empirical evidence, after which stopping-time theory is applied to derive scaling laws for learning rate, batch size, and ℓ2 coefficient. No equations or self-citations are visible that reduce the scaling laws to fitted inputs by construction, nor is the geometry obtained via a self-citation chain or renamed known result. The central claim is a modeling derivation from an assumed topological configuration rather than a tautological re-expression of data or prior fitted quantities. This matches the default expectation of a non-circular theoretical paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unproven assertion that Adam plus L2 regularization creates a specific three-layer spherical topology in parameter space; stopping-time theory is then applied to that geometry. No free parameters or invented particles are named in the abstract.

axioms (1)
  • domain assumption Adam's optimization dynamics with weight-shrinkage regularization induce a shell-core topological configuration (outer random-init shell, middle memorization shell, inner generalization core) in parameter space.
    This geometry is stated as the starting point that gives rise to grokking and is used to derive the scaling laws.
invented entities (1)
  • shell-core topological configuration of reachable solutions no independent evidence
    purpose: To separate memorization and generalization regimes geometrically so that stopping-time analysis can predict transition time.
    The configuration is postulated from the optimizer's dynamics; no independent evidence outside the model is supplied in the abstract.

pith-pipeline@v0.9.1-grok · 5742 in / 1322 out tokens · 54130 ms · 2026-06-30T03:51:22.655140+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 6 canonical work pages

  1. [1]

    and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =

    Barak, Boaz and Edelman, Benjamin L. and Goel, Surbhi and Kakade, Sham and Malach, Eran and Zhang, Cyril , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  2. [2]

    Barakat, Anas and Bianchi, Pascal , title =. SIAM J. on Optimization , month = jan, pages =. 2021 , issue_date =. doi:10.1137/19M1263443 , abstract =

  3. [3]

    1999 , edition =

    Advanced Mathematical Methods for Scientists and Engineers I: Asymptotic Methods and Perturbation Theory , author =. 1999 , edition =. doi:10.1007/978-1-4757-3069-2 , isbn =

  4. [4]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Chughtai, Bilal and Chan, Lawrence and Nanda, Neel , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  5. [5]

    Adaptive Methods through the Lens of

    Enea Monzio Compagnoni and Tianlin Liu and Rustem Islamov and Frank Norbert Proske and Antonio Orvieto and Aurelien Lucchi , booktitle=. Adaptive Methods through the Lens of. 2025 , url=

  6. [6]

    A general system of differential equations to model first-order adaptive algorithms , year =

    Da Silva, Andr\'. A general system of differential equations to model first-order adaptive algorithms , year =. J. Mach. Learn. Res. , month = jan, articleno =

  7. [7]

    1965 , edition =

    Markov Processes: Volume II , author =. 1965 , edition =. doi:10.1007/978-3-662-25360-1 , isbn =

  8. [8]

    1991 , edition =

    Brownian Motion and Stochastic Calculus , author =. 1991 , edition =. doi:10.1007/978-1-4612-0949-2 , isbn =

  9. [9]

    Kingma and Jimmy Ba , editor =

    Diederik P. Kingma and Jimmy Ba , editor =. Adam:. 3rd International Conference on Learning Representations,. 2015 , url =

  10. [10]

    The Twelfth International Conference on Learning Representations , year=

    Grokking as the transition from lazy to rich training dynamics , author=. The Twelfth International Conference on Learning Representations , year=

  11. [11]

    and Tegmark, Max and Williams, Mike , title =

    Liu, Ziming and Kitouni, Ouail and Nolte, Niklas and Michaud, Eric J. and Tegmark, Max and Williams, Mike , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  12. [12]

    The Eleventh International Conference on Learning Representations , year=

    Omnigrok: Grokking Beyond Algorithmic Data , author=. The Eleventh International Conference on Learning Representations , year=

  13. [13]

    International Conference on Learning Representations , year=

    Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

  14. [14]

    The Twelfth International Conference on Learning Representations , year=

    Dichotomy of Early and Late Phase Implicit Biases Can Provably Induce Grokking , author=. The Twelfth International Conference on Learning Representations , year=

  15. [15]

    Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =

    Malladi, Sadhika and Lyu, Kaifeng and Panigrahi, Abhishek and Arora, Sanjeev , title =. Proceedings of the 36th International Conference on Neural Information Processing Systems , articleno =. 2022 , isbn =

  16. [16]

    The Eleventh International Conference on Learning Representations , year=

    Progress measures for grokking via mechanistic interpretability , author=. The Eleventh International Conference on Learning Representations , year=

  17. [17]

    2003 , edition =

    Stochastic Differential Equations: An Introduction with Applications , author =. 2003 , edition =. doi:10.1007/978-3-642-14394-6 , isbn =

  18. [18]

    2022 , eprint=

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets , author=. 2022 , eprint=

  19. [19]

    Proceedings of the 41st International Conference on Machine Learning , articleno =

    Stander, Dashiell and Yu, Qinan and Fan, Honglu and Biderman, Stella , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  20. [20]

    2023 , eprint=

    Explaining grokking through circuit efficiency , author=. 2023 , eprint=

  21. [21]

    High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

    Vershynin, Roman , year=. High-Dimensional Probability: An Introduction with Applications in Data Science , publisher=

  22. [22]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    The Clock and the Pizza: Two Stories in Mechanistic Explanation of Neural Networks , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  23. [23]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  24. [24]

    Hoffman and David M

    Stephan Mandt and Matthew D. Hoffman and David M. Blei , title =. Journal of Machine Learning Research , year =

  25. [25]

    Optimization-Induced Dynamics of

    R\'ois\'in Luo and James McDermott and Christian Gagn\'e and Qiang Sun and Colm O'Riordan , year=. Optimization-Induced Dynamics of. 2506.18588 , archivePrefix=

  26. [26]

    Proceedings of the Forty-Third International Conference on Machine Learning , year =

    Grokking Finite-Dimensional Algebra , author =. Proceedings of the Forty-Third International Conference on Machine Learning , year =. 2602.19533 , archivePrefix =

  27. [27]

    Proceedings of the Forty-Third International Conference on Machine Learning , year =

    Intrinsic Task Symmetry Drives Generalization in Algorithmic Tasks , author =. Proceedings of the Forty-Third International Conference on Machine Learning , year =. 2603.01968 , archivePrefix =

  28. [28]

    Efficient BackProp , year =

    LeCun, Yann and Bottou, L\'. Efficient BackProp , year =. Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop , pages =

  29. [29]

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages =

    Understanding the difficulty of training deep feedforward neural networks , author =. Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics , pages =. 2010 , editor =

  30. [30]

    Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) , pages =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV) , pages =. 2015 , isbn =. doi:10.1109/ICCV.2015.123 , abstract =