pith. sign in

arxiv: 2606.01521 · v1 · pith:ODZJI73Mnew · submitted 2026-06-01 · 💻 cs.LG · stat.ML

Fast Generalization after Interpolation via Critically Damped Momentum Optimization

Pith reviewed 2026-06-28 15:54 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords grokkinginterpolationcritically damped momentumnorm minimizationgeneralizationbiphasic optimizationoverparameterized models
0
0 comments X

The pith

GROKtimizer switches to critically damped momentum after interpolation to find low-norm solutions with quadratic speedup over gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GROKtimizer as a biphasic optimizer that first drives training loss to zero at interpolation and then applies critically damped momentum to minimize solution norm. This follows from the two-regime structure of risk minimization, where loss reduction precedes complexity reduction near the interpolation threshold. Under a local quadratic model of the post-interpolation basin, the method delivers quadratic speedup over classical gradient descent and is optimal among first-order methods. Experiments on synthetic grokking benchmarks and real datasets illustrate faster selection of generalizing solutions. The approach is reconciled with the flat-minima hypothesis by stressing the role of post-interpolation dynamics.

Core claim

GROKtimizer combines rapid convergence to interpolation with Critically Damped Momentum-based post-interpolation norm minimization, offering a natural solution for selecting low-norm interpolating solutions. Under a local quadratic model of the post-interpolation basin, GROKtimizer provides a quadratic speedup over classical gradient descent, with provable optimality among first-order optimizers.

What carries the argument

Critically Damped Momentum (CDM) applied after interpolation for norm minimization within a biphasic schedule.

If this is right

  • Post-interpolation dynamics determine which interpolating solution is reached and thus control generalization.
  • The biphasic schedule separates loss minimization from subsequent norm minimization.
  • Quadratic speedup holds and optimality among first-order methods is provable under the local quadratic model.
  • Reconciliation with the flat-minima hypothesis occurs by focusing on dynamics after the interpolation threshold.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit detection of the interpolation point could let other first-order methods adopt similar phase switches.
  • The quadratic speedup may serve as a benchmark for testing whether real loss surfaces near interpolation are locally quadratic.
  • The method suggests that generalization in overparameterized models depends more on post-interpolation behavior than on the path taken before interpolation.

Load-bearing premise

The post-interpolation loss surface can be modeled as a local quadratic basin whose minimum is the low-norm solution that generalizes well.

What would settle it

No quadratic speedup or generalization gain appears when GROKtimizer is run against gradient descent on a problem whose post-interpolation basin is exactly quadratic.

Figures

Figures reproduced from arXiv: 2606.01521 by Luca Muscarnera, Mihaela van der Schaar, Silas Ruhrberg Est\'evez, Yuanzhang Xiao.

Figure 1
Figure 1. Figure 1: Loss-Norm scaling laws. Test accuracy (indicated by point brightness and inline labels) reveals shifts in the classification boundary across different norms. Qualitative and quantitative differences in fitting and com￾plexity minimization Recent literature [23], [24] identifies the onset of grokking as a dynamical transition between a regime of data memorization and norm (complexity) minimiza￾tion. In part… view at source ↗
Figure 2
Figure 2. Figure 2: GROKtimizer dynamics before and after interpolation. function can be approximated as Lreg(w) ≈ Lunreg(w⋆ ) + 1 2 (w − w⋆ ) ⊤H(w − w⋆ ) + 1 2 λ∥w∥ 2 2 (2) with L(w⋆ ) being a constant, H being the Hessian matrix evaluated in w⋆ , and L(w⋆ ) = 0 in case of Mean Squared Error Loss. This assumption is motivated by the proximity to a local minimum; when the optimizer is captured in an attraction basin, the geom… view at source ↗
Figure 3
Figure 3. Figure 3: Optimal learning lies at the edge of criticality between convergence and instability. From the perspective of practical application, η of the CDM phase can be treated as a hyperparameter of our method. Alternatively, a practitioner may be interested in an efficient way to estimate λmax, compatibly with the computational intractability of the Hessian matrix in neural networks, in order to approximate the op… view at source ↗
Figure 4
Figure 4. Figure 4: MNIST sample efficiency under data scarcity. Test accuracy as the number of training examples per class increases from 10 to 50. Results averaged across n = 10 seeds. Sample efficiency in low-data regimes. We next test whether the post-interpolation norm￾minimization phase improves sample efficiency. We construct low-data MNIST classification tasks by varying the number of training exam￾ples per class from… view at source ↗
Figure 5
Figure 5. Figure 5: Language-model pretraining under data constraints. Validation-loss and parameter norm trajectories for Adam and GROKtimizer. This work studies model training in high￾dimensional, low-sample regimes, where in￾terpolation is often easy but generalization re￾mains difficult. Motivated by the two-phase structure observed in grokking, we proposed GROKtimizer, a biphasic optimization sched￾ule that separates rap… view at source ↗
Figure 6
Figure 6. Figure 6: Example training run on QM9. Training dynamics of GROKtimizer and baseline optimizers on the QM9 dataset. B.2 Detailed dynamics on the synthetic Gaussian benchmark We further analyze the GROKtimizer trajectory on the synthetic Gaussian benchmark, where the relationship between interpolation, parameter norm, and generalization can be inspected directly [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Detailed dynamics on the synthetic Gaussian benchmark. GROKtimizer initially follows a training-loss trajectory comparable to the baseline optimizers, but after interpolation it rapidly reduces parameter norm. This post-interpolation norm reduction is accompanied by lower test loss, consistent with the proposed complexity-minimization mechanism. becomes highly unstable under this setting and fails to achie… view at source ↗
Figure 8
Figure 8. Figure 8: Learning-rate comparison after interpolation. Performance on Binary Addition and Modular Addition when using an increased post-interpolation learning rate. While GROKtimizer remains stable and benefits from the second-phase schedule, Adam becomes unstable when trained with the same elevated learning rate. This shows that the observed speedup is not explained by learning rate alone, but by the phase-specifi… view at source ↗
Figure 9
Figure 9. Figure 9: Representation evolution during post-interpolation optimization. We visualize NTK￾based representations computed from the first output logit at different stages of the GROKtimizer trajectory. Immediately after interpolation, train and test samples from different classes remain partially overlapping, indicating that fitting the training data has not yet produced a representation that separates the test dist… view at source ↗
Figure 10
Figure 10. Figure 10: WeightWatcher analysis during post￾interpolation optimization. Evolution of the fit￾ted WeightWatcher exponent α on Modular Ad￾dition. During the complexity-minimization phase of GROKtimizer, α moves toward 2, consistent with the emergence of a more strongly regularized spectral structure and with the SETOL [56] theory of generalization quality. We further examine the post-interpolation dy￾namics using th… view at source ↗
Figure 11
Figure 11. Figure 11: MNIST sample-efficiency scaling. Log–log plot of test error, defined as 1 − test accuracy, against the number of training ex￾amples. We additionally analyze sample efficiency on MNIST using a log–log scaling plot of test error, defined as 1 − test accuracy, against the number of training examples. This view separates two effects: the slope captures how quickly error decreases as additional data are added,… view at source ↗
read the original abstract

A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes, where many interpolating solutions exist and optimization must implicitly select among minima with different generalization properties. Following recent theoretical advances on optimization dynamics near the interpolation threshold, we note that the two-regime structure of risk minimization, with loss minimization followed by complexity minimization, motivates a biphasic optimization schedule. We thus theoretically demonstrate that GROKtimizer, a biphasic strategy that combines rapid convergence to interpolation with Critically Damped Momentum (CDM)-based post-interpolation norm minimization, offers a natural solution for selecting low-norm interpolating solutions. Under a local quadratic model of the post-interpolation basin, GROKtimizer provides a quadratic speedup over classical gradient descent, with provable optimality among first-order optimizers. To showcase the applicability of our method, we evaluate GROKtimizer on several synthetic benchmarks common in the classical grokking literature and on various real-world datasets. Finally, we reconcile our findings with the flat-minima hypothesis, highlighting the importance of post-interpolation dynamics in the construction of high-quality, generalizing models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes GROKtimizer, a biphasic optimizer consisting of rapid convergence to the interpolation threshold followed by Critically Damped Momentum (CDM) for post-interpolation norm minimization. It claims that the two-regime structure of risk minimization motivates this schedule, and that under a local quadratic model of the post-interpolation basin, GROKtimizer yields a quadratic speedup over gradient descent together with provable optimality among first-order methods. The method is evaluated on synthetic grokking benchmarks and real-world datasets, and the results are reconciled with the flat-minima hypothesis.

Significance. If the local quadratic model is shown to be accurate and the derived rates hold, the work would supply both a practical biphasic schedule for selecting generalizing interpolators and a theoretical account of why post-interpolation dynamics matter, strengthening the link between optimization trajectories and generalization in overparameterized regimes.

major comments (2)
  1. [theoretical analysis (post-interpolation phase)] Abstract and theoretical claims section: the quadratic speedup and optimality statements are obtained by solving the dynamics under an explicit local quadratic model whose minimum is identified with the low-norm generalizing solution; no a-priori bound on the Taylor remainder or empirical verification that cubic or higher terms remain negligible throughout the CDM phase is supplied, so the claimed rates do not transfer if the loss deviates from quadratic.
  2. [experimental evaluation] Empirical section: the reported experiments on synthetic and real-world data do not include direct checks (e.g., curvature monitoring or comparison against non-quadratic models) that would confirm the quadratic approximation remains valid on the evaluated benchmarks, leaving the connection between theory and observed generalization gains unverified.
minor comments (2)
  1. [method description] Notation for the CDM update rule and the biphasic switching criterion should be defined explicitly with equation numbers rather than described only in prose.
  2. [discussion] The reconciliation paragraph with the flat-minima hypothesis would benefit from a short quantitative comparison (e.g., Hessian trace or sharpness metrics) between GROKtimizer and baseline trajectories.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the assumptions underlying our theoretical analysis and the need for stronger empirical validation of the local quadratic model. We address each major comment below, indicating the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [theoretical analysis (post-interpolation phase)] Abstract and theoretical claims section: the quadratic speedup and optimality statements are obtained by solving the dynamics under an explicit local quadratic model whose minimum is identified with the low-norm generalizing solution; no a-priori bound on the Taylor remainder or empirical verification that cubic or higher terms remain negligible throughout the CDM phase is supplied, so the claimed rates do not transfer if the loss deviates from quadratic.

    Authors: The quadratic speedup and optimality results are derived explicitly under the local quadratic model of the post-interpolation basin, as stated in the abstract and theoretical section. This is a deliberate modeling choice that permits closed-form dynamics and a proof of optimality among first-order methods within the model. We agree that the rates are conditional on the approximation and do not automatically extend to losses with significant higher-order terms. A general a-priori bound on the Taylor remainder would require additional global assumptions on the loss that lie outside the local analysis. In revision we will add a clarifying paragraph in the theoretical claims section that (i) reiterates the modeling assumptions, (ii) states the conditions under which higher-order terms are expected to remain negligible, and (iii) notes the conditional nature of the claimed rates. revision: partial

  2. Referee: [experimental evaluation] Empirical section: the reported experiments on synthetic and real-world data do not include direct checks (e.g., curvature monitoring or comparison against non-quadratic models) that would confirm the quadratic approximation remains valid on the evaluated benchmarks, leaving the connection between theory and observed generalization gains unverified.

    Authors: We agree that explicit verification would strengthen the link between the local quadratic analysis and the observed generalization improvements. In the revised manuscript we will augment the experimental section with direct checks: for both the synthetic grokking benchmarks and selected real-world datasets we will report curvature diagnostics (finite-difference estimates of the Hessian trace and leading eigenvalues) sampled throughout the CDM phase. These additions will confirm that the quadratic regime is a reasonable local description on the evaluated tasks and will make the connection to the theoretical rates more transparent. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims derived conditionally from explicit quadratic model assumption

full rationale

The paper's central claims of quadratic speedup and optimality are explicitly conditioned on an upfront modeling assumption of a local quadratic post-interpolation basin, as stated in the abstract. This assumption supplies the dynamics for the closed-form analysis rather than being fitted to or redefined by the derived results. No self-citations, self-definitional structures, fitted parameters renamed as predictions, or ansatzes smuggled via prior work appear in the provided text. The derivation chain is therefore self-contained: the speedup follows mathematically from the stated quadratic model without reducing to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Ledger constructed from abstract alone; full paper may introduce additional parameters or assumptions.

axioms (2)
  • domain assumption Risk minimization exhibits a two-regime structure consisting of loss minimization followed by complexity minimization.
    Explicitly invoked to motivate the biphasic GROKtimizer schedule.
  • domain assumption The post-interpolation loss surface admits a local quadratic model whose basin minimum corresponds to the desired low-norm generalizing solution.
    Used to derive the quadratic speedup and optimality among first-order methods.
invented entities (1)
  • GROKtimizer no independent evidence
    purpose: Biphasic optimizer that reaches interpolation rapidly then applies critically damped momentum for norm minimization
    New method introduced in the paper; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5763 in / 1437 out tokens · 25014 ms · 2026-06-28T15:54:14.264924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 3 canonical work pages

  1. [1]

    Understanding deep learning requires rethinking generalization,

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals, “Understanding deep learning requires rethinking generalization,” inInternational Conference on Learning Representations, 2017

  2. [2]

    Dealing with dimensionality: The application of machine learning to multi-omics data,

    D. Feldner-Busztin et al., “Dealing with dimensionality: The application of machine learning to multi-omics data,”Bioinformatics, vol. 39, no. 2, J. Wren, Ed., Jan. 2023,ISSN: 1367-4811

  3. [3]

    Scalable and accurate deep learning with electronic health records,

    A. Rajkomar et al., “Scalable and accurate deep learning with electronic health records,”npj Digital Medicine, vol. 1, no. 1, May 2018,ISSN: 2398-6352

  4. [4]

    Sparse signals in the cross-section of returns,

    A. Chinco, A. D. Clark-Joseph, and M. Ye, “Sparse signals in the cross-section of returns,” The Journal of Finance, vol. 74, no. 1, pp. 449–492, Nov. 2018,ISSN: 1540-6261

  5. [5]

    Small data machine learning in materials science,

    P. Xu, X. Ji, M. Li, and W. Lu, “Small data machine learning in materials science,”npj Computational Materials, vol. 9, no. 1, Mar. 2023,ISSN: 2057-3960

  6. [6]

    Machine learning methods for small data challenges in molecular science,

    B. Dou et al., “Machine learning methods for small data challenges in molecular science,” Chemical Reviews, vol. 123, no. 13, pp. 8736–8780, 2023,ISSN: 1520-6890

  7. [7]

    Loss surfaces, mode connectivity, and fast ensembling of dnns,

    T. Garipov, P. Izmailov, D. Podoprikhin, D. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc., 2018

  8. [8]

    Essentially no barriers in neural network energy landscape,

    F. Draxler, K. Veschgini, M. Salmhofer, and F. Hamprecht, “Essentially no barriers in neural network energy landscape,” inInternational conference on machine learning, PMLR, 2018, pp. 1309–1318

  9. [9]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off,

    M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,”Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019,ISSN: 1091-6490

  10. [10]

    The implicit bias of gradient descent on separable data,

    D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro, “The implicit bias of gradient descent on separable data,”Journal of Machine Learning Research, vol. 19, no. 70, pp. 1–57, 2018

  11. [11]

    A simple weight decay can improve generalization,

    A. Krogh and J. Hertz, “A simple weight decay can improve generalization,” inAdvances in Neural Information Processing Systems, J. Moody, S. Hanson, and R. Lippmann, Eds., vol. 4, Morgan-Kaufmann, 1991

  12. [12]

    C. E. Rasmussen and C. K. I. Williams,Gaussian Processes for Machine Learning. The MIT Press, Nov. 2005,ISBN: 9780262256834

  13. [13]

    Statistical inference, occam’s razor, and statistical mechanics on the space of probability distributions,

    V . Balasubramanian, “Statistical inference, occam’s razor, and statistical mechanics on the space of probability distributions,”Neural computation, vol. 9, no. 2, pp. 349–368, 1997

  14. [14]

    Surprises in high-dimensional ridgeless least squares interpolation,

    T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,”Annals of statistics, vol. 50, no. 2, p. 949, 2022

  15. [15]

    Power, Y

    A. Power, Y . Burda, H. Edwards, I. Babuschkin, and V . Misra,Grokking: Generalization beyond overfitting on small algorithmic datasets, 2022

  16. [16]

    Mechanisms of grokking: Loss landscape separation and negative learning momentum,

    T. Kumar et al., “Mechanisms of grokking: Loss landscape separation and negative learning momentum,”arXiv preprint, 2023

  17. [17]

    Towards understanding grokking: An effective theory of representation learning,

    Z. Liu, O. Kitouni, N. S. Nolte, E. Michaud, M. Tegmark, and M. Williams, “Towards understanding grokking: An effective theory of representation learning,”Advances in Neural Information Processing Systems, vol. 35, pp. 34 651–34 663, 2022

  18. [18]

    Not all explanations for deep learning phenomena are equally valuable,

    A. Jeffares and M. van der Schaar, “Not all explanations for deep learning phenomena are equally valuable,”arXiv preprint arXiv:2506.23286, 2025

  19. [19]

    Flat minima,

    S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural Computation, vol. 9, no. 1, pp. 1–42, Jan. 1997,ISSN: 1530-888X

  20. [20]

    Flat minima,

    S. Hochreiter and J. Schmidhuber, “Flat minima,”Neural computation, vol. 9, no. 1, pp. 1–42, 1997

  21. [21]

    Neural tangent kernel: Convergence and generalization in neural networks,

    A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural networks,”Advances in neural information processing systems, vol. 31, 2018

  22. [22]

    Benign overfitting in linear regression,

    P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, “Benign overfitting in linear regression,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 063–30 070, 2020

  23. [23]

    Musat,The geometry of grokking: Norm minimization on the zero-loss manifold, 2025

    T. Musat,The geometry of grokking: Norm minimization on the zero-loss manifold, 2025. 10

  24. [24]

    A theoretical framework for grokking: Interpola- tion followed by riemannian norm minimisation,

    E. Boursier, S. Pesme, and R.-A. Dragomir, “A theoretical framework for grokking: Interpola- tion followed by riemannian norm minimisation,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  25. [25]

    Wide neural networks of any depth evolve as linear models under gradient descent,

    J. Lee et al., “Wide neural networks of any depth evolve as linear models under gradient descent,”Advances in neural information processing systems, vol. 32, 2019

  26. [26]

    A convergence theory for deep learning via over- parameterization,

    Z. Allen-Zhu, Y . Li, and Z. Song, “A convergence theory for deep learning via over- parameterization,” inInternational conference on machine learning, PMLR, 2019, pp. 242– 252

  27. [27]

    Loss surfaces, mode connectivity, and fast ensembling of dnns,

    T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov, and A. G. Wilson, “Loss surfaces, mode connectivity, and fast ensembling of dnns,”Advances in neural information processing systems, vol. 31, 2018

  28. [28]

    Loss surface simplexes for mode connecting volumes and fast ensembling,

    G. Benton, W. Maddox, S. Lotfi, and A. G. G. Wilson, “Loss surface simplexes for mode connecting volumes and fast ensembling,” inInternational Conference on Machine Learning, PMLR, 2021, pp. 769–779

  29. [29]

    A method for solving the convex programming problem with convergence rate O(1/k2),

    Y . Nesterov, “A method for solving the convex programming problem with convergence rate O(1/k2),”Proceedings of the USSR Academy of Sciences, vol. 269, pp. 543–547, 1983

  30. [30]

    Nesterov,Introductory lectures on convex optimization: A basic course

    Y . Nesterov,Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 2013, vol. 87

  31. [31]

    Some methods of speeding up the convergence of iteration methods,

    B. Polyak, “Some methods of speeding up the convergence of iteration methods,”USSR Computational Mathematics and Mathematical Physics, vol. 4, no. 5, pp. 1–17, Jan. 1964, ISSN: 0041-5553

  32. [32]

    On the importance of initialization and momentum in deep learning,

    I. Sutskever, J. Martens, G. Dahl, and G. Hinton, “On the importance of initialization and momentum in deep learning,” inProceedings of the 30th International Conference on Ma- chine Learning, S. Dasgupta and D. McAllester, Eds., ser. Proceedings of Machine Learning Research, vol. 28, Atlanta, Georgia, USA: PMLR, 17–19 Jun 2013, pp. 1139–1147

  33. [33]

    Adam: A method for stochastic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” inInternational Conference on Learning Representations (ICLR), 2015

  34. [34]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Confer- ence on Learning Representations, 2017

  35. [35]

    Ogata,Modern Control Engineering, 4th

    K. Ogata,Modern Control Engineering, 4th. USA: Prentice Hall PTR, 2001,ISBN: 0130609072

  36. [36]

    The implicit bias of gradient descent on separable data,

    D. Soudry, E. Hoffer, and N. Srebro, “The implicit bias of gradient descent on separable data,” inInternational Conference on Learning Representations, 2018

  37. [37]

    Implicit bias of gradient descent on linear convolutional networks,

    S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro, “Implicit bias of gradient descent on linear convolutional networks,” inAdvances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc., 2018

  38. [38]

    Neyshabur,Implicit regularization in deep learning, 2017

    B. Neyshabur,Implicit regularization in deep learning, 2017

  39. [39]

    Grokking as the transition from lazy to rich training dynamics,

    T. Kumar, B. Bordelon, S. J. Gershman, and C. Pehlevan, “Grokking as the transition from lazy to rich training dynamics,” inThe Twelfth International Conference on Learning Repre- sentations, 2024

  40. [40]

    Varma, R

    V . Varma, R. Shah, Z. Kenton, J. Kramár, and R. Kumar,Explaining grokking through circuit efficiency, 2023

  41. [41]

    Thilak, E

    V . Thilak, E. Littwin, S. Zhai, O. Saremi, R. Paiss, and J. Susskind,The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon, 2022

  42. [42]

    Progress measures for grokking via mechanistic interpretability,

    N. Nanda, L. Chan, T. Lieberum, J. Smith, and J. Steinhardt, “Progress measures for grokking via mechanistic interpretability,” inThe Eleventh International Conference on Learning Representations, 2023

  43. [43]

    J. Lee, B. G. Kang, K. Kim, and K. M. Lee,Grokfast: Accelerated grokking by amplifying slow gradients, 2024

  44. [44]

    X. Zhou, S. Fan, M. Jaggi, and J. Fu,Neuralgrok: Accelerate grokking by neural gradient transformation, 2025

  45. [45]

    Tveit, B

    A. Tveit, B. Remseth, and A. Skogvold,Muon optimizer accelerates grokking, 2025

  46. [46]

    Acceleration of grokking in learning arithmetic operations via kolmogorov–arnold representation,

    Y . Park, M. Kim, and Y . Kim, “Acceleration of grokking in learning arithmetic operations via kolmogorov–arnold representation,”Neurocomputing, vol. 640, p. 130 347, Aug. 2025,ISSN: 0925-2312. 11

  47. [47]

    Grokking in the wild: Data augmentation for real-world multi-hop reasoning with transformers,

    R. Abramov, F. Steinbauer, and G. Kasneci, “Grokking in the wild: Data augmentation for real-world multi-hop reasoning with transformers,” inForty-second International Conference on Machine Learning, 2025

  48. [48]

    When do flat minima optimizers work?

    J. Kaddour, L. Liu, R. Silva, and M. J. Kusner, “When do flat minima optimizers work?” Advances in Neural Information Processing Systems, vol. 35, pp. 16 577–16 595, 2022

  49. [49]

    Sharp minima can generalize for deep nets,

    L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” inInternational Conference on Machine Learning, PMLR, 2017, pp. 1019–1028

  50. [50]

    J. M. Lee,Riemannian manifolds: an introduction to curvature. Springer Science & Business Media, 2006

  51. [51]

    Adaptive subgradient methods for online learning and stochastic optimization,

    J. Duchi, E. Hazan, and Y . Singer, “Adaptive subgradient methods for online learning and stochastic optimization,”Journal of Machine Learning Research, vol. 12, no. 61, pp. 2121– 2159, 2011

  52. [52]

    Dropout: A simple way to prevent neural networks from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural networks from overfitting,”Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014

  53. [53]

    Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,

    R. Caruana, S. Lawrence, and C. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” inAdvances in Neural Information Processing Systems, T. Leen, T. Dietterich, and V . Tresp, Eds., vol. 13, MIT Press, 2000

  54. [54]

    On early stopping in gradient descent learning,

    Y . Yao, L. Rosasco, and A. Caponnetto, “On early stopping in gradient descent learning,” Constructive Approximation, vol. 26, no. 2, pp. 289–315, Apr. 2007,ISSN: 1432-0940

  55. [55]

    Sharp minima can generalize for deep nets,

    L. Dinh, R. Pascanu, S. Bengio, and Y . Bengio, “Sharp minima can generalize for deep nets,” inProceedings of the 34th International Conference on Machine Learning - Volume 70, ser. ICML’17, Sydney, NSW, Australia: JMLR.org, 2017, pp. 1019–1028

  56. [56]

    Martin, Christopher Hinrichs,SETOL: A Semi-Empirical Theory of (Deep) Learning

    C. H. Martin and C. Hinrichs, “Setol: A semi-empirical theory of (deep) learning,”arXiv preprint arXiv:2507.17912, 2025

  57. [57]

    Prakash, Charles H

    H. K. Prakash and C. H. Martin, “Grokking and generalization collapse: Insights from htsr theory,”arXiv preprint arXiv:2506.04434, 2025

  58. [58]

    Comprehensive molecular characterization of clear cell renal cell carcinoma,

    T. R. Network, “Comprehensive molecular characterization of clear cell renal cell carcinoma,” Nature, vol. 499, no. 7456, pp. 43–49, 2013

  59. [59]

    Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,

    T. R. Golub et al., “Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring,”science, vol. 286, no. 5439, pp. 531–537, 1999. 12 SUPPLEMENTARY INFORMATION A Extended Related Work Optimizer families in deep learning.Deep-learning optimization is dominated by several families of first-order methods. Stochastic ...