Recognition: 2 theorem links
· Lean TheoremOn the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime
Pith reviewed 2026-05-16 17:18 UTC · model grok-4.3
The pith
Preconditioned gradient descent mitigates spectral bias to enable uniform parameter exploration in the NTK regime and accelerate transition to the rich learning regime.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that preconditioned gradient descent mitigates spectral bias, allowing uniform exploration of the parameter space during the NTK lazy regime. On the basis of the hypothesis that grokking marks the transition from the lazy NTK regime to the feature-rich regime, the theoretical and experimental results indicate that PGD shortens grokking delays by removing the low-frequency bias that otherwise impedes even exploration.
What carries the argument
Preconditioned gradient descent (PGD), such as Gauss-Newton, which counters the spectral bias of ordinary gradient descent to promote uniform parameter-space exploration inside the NTK regime.
If this is right
- Grokking delays shrink when spectral bias is removed, turning the phenomenon into a shorter transitional phase.
- Tasks that require high-frequency or fine-scale structures converge faster under PGD than under standard gradient descent.
- The NTK regime can support uniform exploration once the optimizer is preconditioned, rather than being inherently biased.
- Optimizer choice directly controls the length of the lazy-to-rich transition without altering model size or data.
Where Pith is reading between the lines
- Similar preconditioners or second-order methods may produce comparable reductions in grokking on other architectures.
- The same mechanism could be tested on scientific modeling tasks where high-frequency content must be learned early.
- If the uniform-exploration claim holds, it suggests a route to design optimizers that deliberately shorten the NTK phase.
Load-bearing premise
The premise that grokking arises specifically from the shift out of the NTK lazy regime and that PGD removes spectral bias without introducing new confounding dynamics.
What would settle it
An experiment in which spectral bias is demonstrably reduced by PGD yet grokking delays remain unchanged, or in which parameter-space exploration stays non-uniform throughout the NTK phase despite preconditioning.
Figures
read the original abstract
Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that preconditioned gradient descent methods such as Gauss-Newton mitigate spectral bias in neural networks, thereby enabling uniform parameter-space exploration within the NTK regime and shortening the grokking delay that arises during the transition from the lazy NTK regime to the rich feature-learning regime. This is asserted via a central conjecture linking bias removal to uniform exploration, supported by theoretical arguments and empirical results that are presented as confirmation of the same conjecture.
Significance. If the central conjecture is rigorously established, the work would clarify how optimization choices interact with the lazy-to-rich transition, offering a potential route to accelerate generalization in tasks that require high-frequency features without relying on prolonged training.
major comments (3)
- [Abstract and §3] Abstract and §3 (conjecture statement): the claim that PGD removes spectral bias to produce frequency-independent convergence rates inside the NTK approximation is not accompanied by a derivation of the preconditioned NTK operator or its eigenvalues; without this step the attribution of observed acceleration to bias removal rather than curvature rescaling remains unverified.
- [§4] §4 (experimental confirmation): the reported results are described as direct confirmation of the uniform-exploration conjecture, yet no controls isolate preconditioning effects from changes in effective step size or curvature; this creates a circular loop in which the same hypothesis both generates the prediction and interprets the data.
- [§2] §2 (spectral-bias premise): the assumption that spectral bias is the dominant cause of delayed feature learning in standard GD is taken as given, but no quantitative comparison shows that other factors (e.g., initialization scale or loss landscape geometry) are negligible under the paper’s training regimes.
minor comments (2)
- [§3] Notation for the preconditioned kernel is introduced without an explicit equation relating it to the standard NTK; adding this relation would improve readability.
- [Figures] Figure captions should state the precise hyper-parameters (learning rate, preconditioner damping, network width) used in each panel to allow direct replication.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our manuscript. We address each of the major concerns point by point below, proposing specific revisions where appropriate to strengthen the presentation and rigor of our claims.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (conjecture statement): the claim that PGD removes spectral bias to produce frequency-independent convergence rates inside the NTK approximation is not accompanied by a derivation of the preconditioned NTK operator or its eigenvalues; without this step the attribution of observed acceleration to bias removal rather than curvature rescaling remains unverified.
Authors: We agree that providing an explicit derivation would enhance the clarity of our argument. Although §3 presents theoretical reasoning connecting the preconditioner to the removal of spectral bias through the dynamics of the preconditioned gradient flow, we will add a dedicated subsection deriving the form of the preconditioned NTK operator and analyzing its eigenvalues to demonstrate the frequency-independent convergence rates. This will more rigorously support the attribution to bias removal. revision: yes
-
Referee: [§4] §4 (experimental confirmation): the reported results are described as direct confirmation of the uniform-exploration conjecture, yet no controls isolate preconditioning effects from changes in effective step size or curvature; this creates a circular loop in which the same hypothesis both generates the prediction and interprets the data.
Authors: We acknowledge this concern regarding potential confounding factors. Our experiments maintain consistent learning rates and initializations across methods, but to better isolate the preconditioning effect, we will include new controls such as rescaling the effective step size for standard GD to match the preconditioned updates and additional baselines. These additions will provide clearer separation between the effects of preconditioning and other optimization parameters. revision: yes
-
Referee: [§2] §2 (spectral-bias premise): the assumption that spectral bias is the dominant cause of delayed feature learning in standard GD is taken as given, but no quantitative comparison shows that other factors (e.g., initialization scale or loss landscape geometry) are negligible under the paper’s training regimes.
Authors: While the spectral bias phenomenon is extensively documented in prior work, we recognize the value of direct evidence in our experimental setup. In the revision, we will incorporate quantitative comparisons by varying initialization scales and examining the geometry of the loss landscape (e.g., via Hessian analysis) to confirm that spectral bias is indeed the primary driver of the observed delays in our training regimes. revision: yes
Circularity Check
Conjecture on PGD removing spectral bias for uniform NTK exploration is restated as a 'prediction' then confirmed by experiments without independent derivation of the modified kernel
specific steps
-
fitted input called prediction
[Abstract]
"Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime."
The text explicitly equates the conjecture with a 'prediction' whose confirmation is then offered as evidence for the same conjecture. Because the paper supplies no separate derivation of the claimed NTK modification under PGD, the experimental confirmation is statistically forced by the premise it is said to test.
full rationale
The paper states a conjecture about PGD enabling uniform parameter exploration in the NTK regime by removing spectral bias, then immediately presents experimental results as confirmation of 'this prediction' and as evidence for the grokking transition hypothesis. No derivation is provided showing how preconditioning alters the NTK operator's frequency spectrum; the confirmation loop therefore reduces the central claim to the initial assumption plus data selected to match it. This matches the fitted-input-called-prediction pattern at the level of the core narrative, though the paper remains self-contained on other empirical observations.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Grokking arises as learning transitions from the NTK lazy regime to the feature-rich regime
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.2 … ∂/∂t ê_i = −λ_i(e)/(μ+λ_i(e)) ê_i … κ_LM ≪ κ_GD
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Lemma 3.3 … ∂/∂t ê_i = −1 for λ_i(e) > ε
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,
Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, and Ji Xu. When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,
-
[2]
Satwik Bhattamishra, Arkil Patel, Varun Kanade, and Phil Blunsom. Simplicity bias in transformers and their ability to learn sparse boolean functions.arXiv preprint arXiv:2211.12316,
-
[3]
Davide Buffelli, Jamie McGowan, Wangkun Xu, Alexandru Cioba, Da-shan Shiu, Guillaume Hen- nequin, and Alberto Bernacchia. Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization.arXiv preprint arXiv:2411.07979,
-
[4]
Gram-gauss-newton method: Learning overparameterized neural networks for regression problems
Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. Gram-gauss-newton method: Learning overparameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675,
-
[5]
Coralia Cartis, Jaroslav Fowkes, and Zhen Shao. A randomised subspace gauss-newton method for nonlinear least-squares.arXiv preprint arXiv:2211.05727,
-
[6]
Gauss-newton dynamics for neural networks: A riemannian optimization perspective
Semih Cayci. Gauss-newton dynamics for neural networks: A riemannian optimization perspective. arXiv preprint arXiv:2412.14031,
-
[7]
Wenqian Chen, Amanda A Howard, and Panos Stinis. Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks.arXiv preprint arXiv:2407.01613,
-
[8]
Matilde Gargiani, Andrea Zanelli, Moritz Diehl, and Frank Hutter. On the promise of the stochastic generalized gauss-newton method for training dnns.arXiv preprint arXiv:2006.02409,
-
[9]
Qingguo Hong, Jonathan W Siegel, Qinyang Tan, and Jinchao Xu. On the activation function dependence of the spectral bias of neural networks.arXiv preprint arXiv:2208.04924,
-
[10]
Anas Jnini, Flavio Vella, and Marius Zeinhofer. Gauss-newton natural gradient descent for physics- informed computational fluid dynamics.arXiv preprint arXiv:2402.10680,
-
[11]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Exact gauss-newton optimization for training deep neural networks.arXiv preprint arXiv:2405.14402,
Mikalai Korbit, Adeyemi D Adeoye, Alberto Bemporad, and Mario Zanon. Exact gauss-newton optimization for training deep neural networks.arXiv preprint arXiv:2405.14402,
-
[13]
URL https://cs231n.github. io. Stanford University, Course Notes. Accessed: 2025-06-17. Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022a. Ziming Liu, Eric J Michaud,...
-
[14]
The Levenberg-Marquardt algorithm: implementation and theory
Jorge J Moré. The Levenberg-Marquardt algorithm: implementation and theory. InNumerical analysis: proceedings of the biennial Conference held at Dundee, June 28–July 1, 1977, pages 105–116. Springer,
work page 1977
-
[15]
Characterizing the spectrum of the NTK via a power series expansion.arXiv preprint arXiv:2211.07844,
Michael Murray, Hui Jin, Benjamin Bowman, and Guido Montufar. Characterizing the spectrum of the NTK via a power series expansion.arXiv preprint arXiv:2211.07844,
-
[16]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks
Yi Ren and Donald Goldfarb. Efficient subsampled gauss-newton and natural gradient methods for training neural networks.arXiv preprint arXiv:1906.02353,
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[18]
13 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind
Version: 2021-10-27 v0.0-e7150f2d (alpha). 13 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817,
-
[19]
Explaining grokking through circuit efficiency
Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,
-
[20]
Training behavior of deep neural network in frequency domain
Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. InNeural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer,
work page 2019
-
[21]
Zhi-Qin John Xu, Lulu Zhang, and Wei Cai. On understanding and overcoming spectral biases of deep neural network learning methods for solving pdes.arXiv preprint arXiv:2501.09987,
-
[22]
Zhangchen Zhou, Yaoyu Zhang, and Zhi-Qin John Xu. A rationale from frequency perspective for grokking in training neural network.arXiv preprint arXiv:2405.17479,
-
[23]
14 A Proofs in Section 3.1 We first present the proofs for the results in Section 3.1. Proof of Lemma 3.2. Consider the SVD Jt =U tΛ1/2 t V T t where for notational purposes, we repre- sent the singular values in their square root formΛ 1/2 t . Then by direct calculation Jt(µI+J tJ T t )−1J T t =U tΛ1/2 t V T t (µVtV T t +V tΛtV T t )−1VtΛ1/2 t U T t =U t...
work page 2022
-
[24]
Table 1: Hyperparameters for regression problems Parameter Value Number of Layers 2 Hidden Dimension 80 Kernel Initialization Kaiming Uniform Bias Initialization Zeros Activation Function tanh Output Dimension 1 Learning Rate 1×10 −2 Batch Size (1D) 100 (full batch) Batch Size (2D) 400 B.2 PINNs hyperparameters The usage of PGD in PINNs is not new, and on...
work page 2024
-
[25]
The remaining details are shown in Table
16 Table 2: Hyperparameters for PINNs Parameter Value Number of Hidden Layers 1 Hidden Dimension 256 Kernel Initialization Kaiming Uniform Bias Initialization Zeros Activation Function tanh Learning Rate 1×10 −3,1×10 −2,1×10 −1 Total interior points 640 Total boundary points 40 Batch Size 200 B.3 Grokking hyperparameters B.3.1 Modular Arithmetic We follow...
work page 2019
-
[26]
The remaining hyperparameters are shown in Table
B.3.2 Polynomial Regression The main implementations are based [Kumar et al., 2024, §5] which we refer the reader to for model definition, and exact data generation formulation. The remaining hyperparameters are shown in Table
work page 2024
-
[27]
17 Table 4: Hyperparameters for Transformer Modular Addition Hidden Dimensions 128 Layers 2 Head 4 Modulo Parameterp 97 Training Percentage of Dataset 50% Learning Rate (Adam) 10−3 Learning Rate (LM) 1 Weight Decay (All) 0 Batch Size 512 Conjugate Gradient max iters 150, residual threshold10 −6 Line Search c= 10 −4,τ= 0.5, max iters 10 Table 5: Polynomial...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.