pith. machine review for the scientific record. sign in

arxiv: 2601.03162 · v2 · submitted 2026-01-06 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:18 UTC · model grok-4.3

classification 💻 cs.LG
keywords preconditioned gradient descentspectral biasgrokkingNTK regimerich learning regimeneural network optimizationlazy regime
0
0 comments X

The pith

Preconditioned gradient descent mitigates spectral bias to enable uniform parameter exploration in the NTK regime and accelerate transition to the rich learning regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how preconditioned gradient descent methods such as Gauss-Newton alter the learning dynamics of neural networks by reducing spectral bias, the tendency to fit low frequencies before high ones. It builds on the idea that grokking occurs as training moves from the lazy NTK regime to a feature-rich regime and conjectures that removing spectral bias lets PGD explore the full parameter space evenly even while still in the NTK phase. If this holds, grokking becomes a controllable transition rather than an unavoidable delay, and optimizer choice alone can shorten the time to rich behavior without changing network architecture. This matters for tasks that need rapid capture of fine details, where standard gradient descent's bias slows progress.

Core claim

The paper claims that preconditioned gradient descent mitigates spectral bias, allowing uniform exploration of the parameter space during the NTK lazy regime. On the basis of the hypothesis that grokking marks the transition from the lazy NTK regime to the feature-rich regime, the theoretical and experimental results indicate that PGD shortens grokking delays by removing the low-frequency bias that otherwise impedes even exploration.

What carries the argument

Preconditioned gradient descent (PGD), such as Gauss-Newton, which counters the spectral bias of ordinary gradient descent to promote uniform parameter-space exploration inside the NTK regime.

If this is right

  • Grokking delays shrink when spectral bias is removed, turning the phenomenon into a shorter transitional phase.
  • Tasks that require high-frequency or fine-scale structures converge faster under PGD than under standard gradient descent.
  • The NTK regime can support uniform exploration once the optimizer is preconditioned, rather than being inherently biased.
  • Optimizer choice directly controls the length of the lazy-to-rich transition without altering model size or data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar preconditioners or second-order methods may produce comparable reductions in grokking on other architectures.
  • The same mechanism could be tested on scientific modeling tasks where high-frequency content must be learned early.
  • If the uniform-exploration claim holds, it suggests a route to design optimizers that deliberately shorten the NTK phase.

Load-bearing premise

The premise that grokking arises specifically from the shift out of the NTK lazy regime and that PGD removes spectral bias without introducing new confounding dynamics.

What would settle it

An experiment in which spectral bias is demonstrably reduced by PGD yet grokking delays remain unchanged, or in which parameter-space exploration stays non-uniform throughout the NTK phase despite preconditioning.

Figures

Figures reproduced from arXiv: 2601.03162 by Alexey Voronin, Ben Southworth, Eric Cyr, Shuai Jiang.

Figure 1
Figure 1. Figure 1: MNIST grokking induced by multiplying the initialization by [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Top-left: With SGD, spectral bias reflects ill-conditioned NTK curvature, resulting in the trajectory from w0 to w ∗ NTK that bends along level sets, so progress differs across directions. Top-middle: Preconditioning (LM, µ > 0) uses curvature/Hessian information (Gauss-Newton) to rescale directions, producing a more direct path. Top-right: As µ → 0 (GN), updates nearly equalize progress across directions … view at source ↗
Figure 3
Figure 3. Figure 3: Mode-wise FFT error (first 10 frequencies) under SGD, LM ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: PINNs training loss SGD, Adam and LM dynamics with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy of the modulo task trained using SGD and LM with similar initialization. The [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Polynomial regression grokking induced by output scaling [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Levenberg-Marquardt reduces grokking but doesn’t generalize alone. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Classical modular addition example using a transformer. Solid lines indicate train and [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Plot of the residuals in the first ten frequencies of the FFT of residuals for SGD, and [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Plot of loss resulting from MNIST data; AdamW on the left and PGD on the right. The [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Grokking behavior on MNIST using cross-entropy under different optimizers. Grokking [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Accuracy and loss from the modular addition using transformer task where “Continue” [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Plots of Figures 1 and 7 with additional seeds. Top plot corresponds to Figure 1 and [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗
read the original abstract

Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that preconditioned gradient descent methods such as Gauss-Newton mitigate spectral bias in neural networks, thereby enabling uniform parameter-space exploration within the NTK regime and shortening the grokking delay that arises during the transition from the lazy NTK regime to the rich feature-learning regime. This is asserted via a central conjecture linking bias removal to uniform exploration, supported by theoretical arguments and empirical results that are presented as confirmation of the same conjecture.

Significance. If the central conjecture is rigorously established, the work would clarify how optimization choices interact with the lazy-to-rich transition, offering a potential route to accelerate generalization in tasks that require high-frequency features without relying on prolonged training.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (conjecture statement): the claim that PGD removes spectral bias to produce frequency-independent convergence rates inside the NTK approximation is not accompanied by a derivation of the preconditioned NTK operator or its eigenvalues; without this step the attribution of observed acceleration to bias removal rather than curvature rescaling remains unverified.
  2. [§4] §4 (experimental confirmation): the reported results are described as direct confirmation of the uniform-exploration conjecture, yet no controls isolate preconditioning effects from changes in effective step size or curvature; this creates a circular loop in which the same hypothesis both generates the prediction and interprets the data.
  3. [§2] §2 (spectral-bias premise): the assumption that spectral bias is the dominant cause of delayed feature learning in standard GD is taken as given, but no quantitative comparison shows that other factors (e.g., initialization scale or loss landscape geometry) are negligible under the paper’s training regimes.
minor comments (2)
  1. [§3] Notation for the preconditioned kernel is introduced without an explicit equation relating it to the standard NTK; adding this relation would improve readability.
  2. [Figures] Figure captions should state the precise hyper-parameters (learning rate, preconditioner damping, network width) used in each panel to allow direct replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major concerns point by point below, proposing specific revisions where appropriate to strengthen the presentation and rigor of our claims.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (conjecture statement): the claim that PGD removes spectral bias to produce frequency-independent convergence rates inside the NTK approximation is not accompanied by a derivation of the preconditioned NTK operator or its eigenvalues; without this step the attribution of observed acceleration to bias removal rather than curvature rescaling remains unverified.

    Authors: We agree that providing an explicit derivation would enhance the clarity of our argument. Although §3 presents theoretical reasoning connecting the preconditioner to the removal of spectral bias through the dynamics of the preconditioned gradient flow, we will add a dedicated subsection deriving the form of the preconditioned NTK operator and analyzing its eigenvalues to demonstrate the frequency-independent convergence rates. This will more rigorously support the attribution to bias removal. revision: yes

  2. Referee: [§4] §4 (experimental confirmation): the reported results are described as direct confirmation of the uniform-exploration conjecture, yet no controls isolate preconditioning effects from changes in effective step size or curvature; this creates a circular loop in which the same hypothesis both generates the prediction and interprets the data.

    Authors: We acknowledge this concern regarding potential confounding factors. Our experiments maintain consistent learning rates and initializations across methods, but to better isolate the preconditioning effect, we will include new controls such as rescaling the effective step size for standard GD to match the preconditioned updates and additional baselines. These additions will provide clearer separation between the effects of preconditioning and other optimization parameters. revision: yes

  3. Referee: [§2] §2 (spectral-bias premise): the assumption that spectral bias is the dominant cause of delayed feature learning in standard GD is taken as given, but no quantitative comparison shows that other factors (e.g., initialization scale or loss landscape geometry) are negligible under the paper’s training regimes.

    Authors: While the spectral bias phenomenon is extensively documented in prior work, we recognize the value of direct evidence in our experimental setup. In the revision, we will incorporate quantitative comparisons by varying initialization scales and examining the geometry of the loss landscape (e.g., via Hessian analysis) to confirm that spectral bias is indeed the primary driver of the observed delays in our training regimes. revision: yes

Circularity Check

1 steps flagged

Conjecture on PGD removing spectral bias for uniform NTK exploration is restated as a 'prediction' then confirmed by experiments without independent derivation of the modified kernel

specific steps
  1. fitted input called prediction [Abstract]
    "Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime."

    The text explicitly equates the conjecture with a 'prediction' whose confirmation is then offered as evidence for the same conjecture. Because the paper supplies no separate derivation of the claimed NTK modification under PGD, the experimental confirmation is statistically forced by the premise it is said to test.

full rationale

The paper states a conjecture about PGD enabling uniform parameter exploration in the NTK regime by removing spectral bias, then immediately presents experimental results as confirmation of 'this prediction' and as evidence for the grokking transition hypothesis. No derivation is provided showing how preconditioning alters the NTK operator's frequency spectrum; the confirmation loop therefore reduces the central claim to the initial assumption plus data selected to match it. This matches the fitted-input-called-prediction pattern at the level of the core narrative, though the paper remains self-contained on other empirical observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that grokking is caused by the NTK-to-rich transition and that spectral bias is the main impediment to uniform exploration; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Grokking arises as learning transitions from the NTK lazy regime to the feature-rich regime
    Explicitly stated as the hypothesis being tested in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1235 out tokens · 25453 ms · 2026-05-16T17:18:43.571450+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

  1. [1]

    When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

    Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, and Ji Xu. When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

  2. [2]

    Simplicity bias in transformers and their ability to learn sparse boolean functions.arXiv preprint arXiv:2211.12316,

    Satwik Bhattamishra, Arkil Patel, Varun Kanade, and Phil Blunsom. Simplicity bias in transformers and their ability to learn sparse boolean functions.arXiv preprint arXiv:2211.12316,

  3. [3]

    Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization.arXiv preprint arXiv:2411.07979,

    Davide Buffelli, Jamie McGowan, Wangkun Xu, Alexandru Cioba, Da-shan Shiu, Guillaume Hen- nequin, and Alberto Bernacchia. Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization.arXiv preprint arXiv:2411.07979,

  4. [4]

    Gram-gauss-newton method: Learning overparameterized neural networks for regression problems

    Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. Gram-gauss-newton method: Learning overparameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675,

  5. [5]

    A randomised subspace gauss-newton method for nonlinear least-squares.arXiv preprint arXiv:2211.05727,

    Coralia Cartis, Jaroslav Fowkes, and Zhen Shao. A randomised subspace gauss-newton method for nonlinear least-squares.arXiv preprint arXiv:2211.05727,

  6. [6]

    Gauss-newton dynamics for neural networks: A riemannian optimization perspective

    Semih Cayci. Gauss-newton dynamics for neural networks: A riemannian optimization perspective. arXiv preprint arXiv:2412.14031,

  7. [7]

    Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks.arXiv preprint arXiv:2407.01613,

    Wenqian Chen, Amanda A Howard, and Panos Stinis. Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks.arXiv preprint arXiv:2407.01613,

  8. [8]

    On the promise of the stochastic generalized gauss-newton method for training dnns.arXiv preprint arXiv:2006.02409,

    Matilde Gargiani, Andrea Zanelli, Moritz Diehl, and Frank Hutter. On the promise of the stochastic generalized gauss-newton method for training dnns.arXiv preprint arXiv:2006.02409,

  9. [9]

    On the activation function dependence of the spectral bias of neural networks.arXiv preprint arXiv:2208.04924,

    Qingguo Hong, Jonathan W Siegel, Qinyang Tan, and Jinchao Xu. On the activation function dependence of the spectral bias of neural networks.arXiv preprint arXiv:2208.04924,

  10. [10]

    Gauss-newton natural gradient descent for physics- informed computational fluid dynamics.arXiv preprint arXiv:2402.10680,

    Anas Jnini, Flavio Vella, and Marius Zeinhofer. Gauss-newton natural gradient descent for physics- informed computational fluid dynamics.arXiv preprint arXiv:2402.10680,

  11. [11]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

  12. [12]

    Exact gauss-newton optimization for training deep neural networks.arXiv preprint arXiv:2405.14402,

    Mikalai Korbit, Adeyemi D Adeoye, Alberto Bemporad, and Mario Zanon. Exact gauss-newton optimization for training deep neural networks.arXiv preprint arXiv:2405.14402,

  13. [13]

    URL https://cs231n.github. io. Stanford University, Course Notes. Accessed: 2025-06-17. Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022a. Ziming Liu, Eric J Michaud,...

  14. [14]

    The Levenberg-Marquardt algorithm: implementation and theory

    Jorge J Moré. The Levenberg-Marquardt algorithm: implementation and theory. InNumerical analysis: proceedings of the biennial Conference held at Dundee, June 28–July 1, 1977, pages 105–116. Springer,

  15. [15]

    Characterizing the spectrum of the NTK via a power series expansion.arXiv preprint arXiv:2211.07844,

    Michael Murray, Hui Jin, Benjamin Bowman, and Guido Montufar. Characterizing the spectrum of the NTK via a power series expansion.arXiv preprint arXiv:2211.07844,

  16. [16]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

  17. [17]

    Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks

    Yi Ren and Donald Goldfarb. Efficient subsampled gauss-newton and natural gradient methods for training neural networks.arXiv preprint arXiv:1906.02353,

  18. [18]

    13 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind

    Version: 2021-10-27 v0.0-e7150f2d (alpha). 13 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817,

  19. [19]

    Explaining grokking through circuit efficiency

    Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

  20. [20]

    Training behavior of deep neural network in frequency domain

    Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. InNeural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer,

  21. [21]

    On understanding and overcoming spectral biases of deep neural network learning methods for solving pdes.arXiv preprint arXiv:2501.09987,

    Zhi-Qin John Xu, Lulu Zhang, and Wei Cai. On understanding and overcoming spectral biases of deep neural network learning methods for solving pdes.arXiv preprint arXiv:2501.09987,

  22. [22]

    A rationale from frequency perspective for grokking in training neural network.arXiv preprint arXiv:2405.17479,

    Zhangchen Zhou, Yaoyu Zhang, and Zhi-Qin John Xu. A rationale from frequency perspective for grokking in training neural network.arXiv preprint arXiv:2405.17479,

  23. [23]

    Proof of Lemma 3.2

    14 A Proofs in Section 3.1 We first present the proofs for the results in Section 3.1. Proof of Lemma 3.2. Consider the SVD Jt =U tΛ1/2 t V T t where for notational purposes, we repre- sent the singular values in their square root formΛ 1/2 t . Then by direct calculation Jt(µI+J tJ T t )−1J T t =U tΛ1/2 t V T t (µVtV T t +V tΛtV T t )−1VtΛ1/2 t U T t =U t...

  24. [24]

    We chose LM with fixed learning rate (rather than dynamic line-search) to better emulate continuous time dynamics

    Table 1: Hyperparameters for regression problems Parameter Value Number of Layers 2 Hidden Dimension 80 Kernel Initialization Kaiming Uniform Bias Initialization Zeros Activation Function tanh Output Dimension 1 Learning Rate 1×10 −2 Batch Size (1D) 100 (full batch) Batch Size (2D) 400 B.2 PINNs hyperparameters The usage of PGD in PINNs is not new, and on...

  25. [25]

    The remaining details are shown in Table

    16 Table 2: Hyperparameters for PINNs Parameter Value Number of Hidden Layers 1 Hidden Dimension 256 Kernel Initialization Kaiming Uniform Bias Initialization Zeros Activation Function tanh Learning Rate 1×10 −3,1×10 −2,1×10 −1 Total interior points 640 Total boundary points 40 Batch Size 200 B.3 Grokking hyperparameters B.3.1 Modular Arithmetic We follow...

  26. [26]

    The remaining hyperparameters are shown in Table

    B.3.2 Polynomial Regression The main implementations are based [Kumar et al., 2024, §5] which we refer the reader to for model definition, and exact data generation formulation. The remaining hyperparameters are shown in Table

  27. [27]

    Interestingly, the loss values are generally lower for PGD and is clearly decreasing even as the classification error fails to noticeably change

    17 Table 4: Hyperparameters for Transformer Modular Addition Hidden Dimensions 128 Layers 2 Head 4 Modulo Parameterp 97 Training Percentage of Dataset 50% Learning Rate (Adam) 10−3 Learning Rate (LM) 1 Weight Decay (All) 0 Batch Size 512 Conjugate Gradient max iters 150, residual threshold10 −6 Line Search c= 10 −4,τ= 0.5, max iters 10 Table 5: Polynomial...