arxiv: 2601.03162 · v2 · submitted 2026-01-06 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

On the Convergence Behavior of Preconditioned Gradient Descent Toward the Rich Learning Regime

Shuai Jiang , Alexey Voronin , Eric Cyr , Ben Southworth

Authors on Pith no claims yet

Pith reviewed 2026-05-16 17:18 UTC · model grok-4.3

classification 💻 cs.LG

keywords preconditioned gradient descentspectral biasgrokkingNTK regimerich learning regimeneural network optimizationlazy regime

0 comments

The pith

Preconditioned gradient descent mitigates spectral bias to enable uniform parameter exploration in the NTK regime and accelerate transition to the rich learning regime.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies how preconditioned gradient descent methods such as Gauss-Newton alter the learning dynamics of neural networks by reducing spectral bias, the tendency to fit low frequencies before high ones. It builds on the idea that grokking occurs as training moves from the lazy NTK regime to a feature-rich regime and conjectures that removing spectral bias lets PGD explore the full parameter space evenly even while still in the NTK phase. If this holds, grokking becomes a controllable transition rather than an unavoidable delay, and optimizer choice alone can shorten the time to rich behavior without changing network architecture. This matters for tasks that need rapid capture of fine details, where standard gradient descent's bias slows progress.

Core claim

The paper claims that preconditioned gradient descent mitigates spectral bias, allowing uniform exploration of the parameter space during the NTK lazy regime. On the basis of the hypothesis that grokking marks the transition from the lazy NTK regime to the feature-rich regime, the theoretical and experimental results indicate that PGD shortens grokking delays by removing the low-frequency bias that otherwise impedes even exploration.

What carries the argument

Preconditioned gradient descent (PGD), such as Gauss-Newton, which counters the spectral bias of ordinary gradient descent to promote uniform parameter-space exploration inside the NTK regime.

If this is right

Grokking delays shrink when spectral bias is removed, turning the phenomenon into a shorter transitional phase.
Tasks that require high-frequency or fine-scale structures converge faster under PGD than under standard gradient descent.
The NTK regime can support uniform exploration once the optimizer is preconditioned, rather than being inherently biased.
Optimizer choice directly controls the length of the lazy-to-rich transition without altering model size or data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar preconditioners or second-order methods may produce comparable reductions in grokking on other architectures.
The same mechanism could be tested on scientific modeling tasks where high-frequency content must be learned early.
If the uniform-exploration claim holds, it suggests a route to design optimizers that deliberately shorten the NTK phase.

Load-bearing premise

The premise that grokking arises specifically from the shift out of the NTK lazy regime and that PGD removes spectral bias without introducing new confounding dynamics.

What would settle it

An experiment in which spectral bias is demonstrably reduced by PGD yet grokking delays remain unchanged, or in which parameter-space exploration stays non-uniform throughout the NTK phase despite preconditioning.

Figures

Figures reproduced from arXiv: 2601.03162 by Alexey Voronin, Ben Southworth, Eric Cyr, Shuai Jiang.

**Figure 2.** Figure 2: Top-left: With SGD, spectral bias reflects ill-conditioned NTK curvature, resulting in the trajectory from w0 to w ∗ NTK that bends along level sets, so progress differs across directions. Top-middle: Preconditioning (LM, µ > 0) uses curvature/Hessian information (Gauss-Newton) to rescale directions, producing a more direct path. Top-right: As µ → 0 (GN), updates nearly equalize progress across directions … view at source ↗

**Figure 3.** Figure 3: Mode-wise FFT error (first 10 frequencies) under SGD, LM ( [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: PINNs training loss SGD, Adam and LM dynamics with [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Accuracy of the modulo task trained using SGD and LM with similar initialization. The [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Polynomial regression grokking induced by output scaling [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Levenberg-Marquardt reduces grokking but doesn’t generalize alone. [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: Classical modular addition example using a transformer. Solid lines indicate train and [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Plot of the residuals in the first ten frequencies of the FFT of residuals for SGD, and [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Plot of loss resulting from MNIST data; AdamW on the left and PGD on the right. The [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

**Figure 11.** Figure 11: Grokking behavior on MNIST using cross-entropy under different optimizers. Grokking [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

**Figure 12.** Figure 12: Accuracy and loss from the modular addition using transformer task where “Continue” [PITH_FULL_IMAGE:figures/full_fig_p021_12.png] view at source ↗

**Figure 13.** Figure 13: Plots of Figures 1 and 7 with additional seeds. Top plot corresponds to Figure 1 and [PITH_FULL_IMAGE:figures/full_fig_p021_13.png] view at source ↗

read the original abstract

Spectral bias, the tendency of neural networks to learn low frequencies first, can be both a blessing and a curse. While it enhances the generalization capabilities by suppressing high-frequency noise, it can be a limitation in scientific tasks that require capturing fine-scale structures. The delayed generalization phenomenon known as grokking is another barrier to rapid training of neural networks. Grokking has been hypothesized to arise as learning transitions from the NTK to the feature-rich regime. This paper explores the impact of preconditioned gradient descent (PGD), such as Gauss-Newton, on spectral bias and grokking phenomena. We demonstrate through theoretical and empirical results how PGD can mitigate issues associated with spectral bias. Additionally, building on the rich learning regime grokking hypothesis, we study how PGD can be used to reduce delays associated with grokking. Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime. These findings deepen our understanding of the interplay between optimization dynamics, spectral bias, and the phases of neural network learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PGD might shorten grokking by letting the NTK regime explore more uniformly, but the paper gives no derivation that preconditioning actually flattens the frequency spectrum of the kernel.

read the letter

The paper's central claim is that preconditioned methods like Gauss-Newton remove spectral bias, so the network explores the NTK regime more evenly and therefore reaches the rich regime faster, cutting grokking time. That is the new angle they are pushing: an optimization lever aimed at the lazy-to-rich transition rather than a new architecture or loss term. They tie it to scientific ML needs where high-frequency accuracy matters, which is a reasonable motivation. The experiments are presented as direct confirmation of the uniform-exploration prediction. That connection is worth discussing even if the evidence is still light. The soft spot is exactly where the stress-test note lands. There is no shown derivation for how the preconditioned NTK operator changes its eigenvalues across frequencies, so it is unclear whether the observed speed-up comes from bias removal or from simple curvature rescaling and effective step-size effects. The argument therefore loops from conjecture to the same experiments without an independent check. If the full paper has the missing operator analysis and controls that rule out the alternatives, the claim strengthens; otherwise it stays conjectural. This is for people already working on NTK dynamics, grokking, or preconditioning in over-parameterized models. A reader who wants concrete levers for spectral bias will find the hypothesis useful to test. It is coherent enough on its own terms to deserve referee time, mainly to force the derivation and tighter experiments into view.

Referee Report

3 major / 2 minor

Summary. The paper claims that preconditioned gradient descent methods such as Gauss-Newton mitigate spectral bias in neural networks, thereby enabling uniform parameter-space exploration within the NTK regime and shortening the grokking delay that arises during the transition from the lazy NTK regime to the rich feature-learning regime. This is asserted via a central conjecture linking bias removal to uniform exploration, supported by theoretical arguments and empirical results that are presented as confirmation of the same conjecture.

Significance. If the central conjecture is rigorously established, the work would clarify how optimization choices interact with the lazy-to-rich transition, offering a potential route to accelerate generalization in tasks that require high-frequency features without relying on prolonged training.

major comments (3)

[Abstract and §3] Abstract and §3 (conjecture statement): the claim that PGD removes spectral bias to produce frequency-independent convergence rates inside the NTK approximation is not accompanied by a derivation of the preconditioned NTK operator or its eigenvalues; without this step the attribution of observed acceleration to bias removal rather than curvature rescaling remains unverified.
[§4] §4 (experimental confirmation): the reported results are described as direct confirmation of the uniform-exploration conjecture, yet no controls isolate preconditioning effects from changes in effective step size or curvature; this creates a circular loop in which the same hypothesis both generates the prediction and interprets the data.
[§2] §2 (spectral-bias premise): the assumption that spectral bias is the dominant cause of delayed feature learning in standard GD is taken as given, but no quantitative comparison shows that other factors (e.g., initialization scale or loss landscape geometry) are negligible under the paper’s training regimes.

minor comments (2)

[§3] Notation for the preconditioned kernel is introduced without an explicit equation relating it to the standard NTK; adding this relation would improve readability.
[Figures] Figure captions should state the precise hyper-parameters (learning rate, preconditioner damping, network width) used in each panel to allow direct replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript. We address each of the major concerns point by point below, proposing specific revisions where appropriate to strengthen the presentation and rigor of our claims.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (conjecture statement): the claim that PGD removes spectral bias to produce frequency-independent convergence rates inside the NTK approximation is not accompanied by a derivation of the preconditioned NTK operator or its eigenvalues; without this step the attribution of observed acceleration to bias removal rather than curvature rescaling remains unverified.

Authors: We agree that providing an explicit derivation would enhance the clarity of our argument. Although §3 presents theoretical reasoning connecting the preconditioner to the removal of spectral bias through the dynamics of the preconditioned gradient flow, we will add a dedicated subsection deriving the form of the preconditioned NTK operator and analyzing its eigenvalues to demonstrate the frequency-independent convergence rates. This will more rigorously support the attribution to bias removal. revision: yes
Referee: [§4] §4 (experimental confirmation): the reported results are described as direct confirmation of the uniform-exploration conjecture, yet no controls isolate preconditioning effects from changes in effective step size or curvature; this creates a circular loop in which the same hypothesis both generates the prediction and interprets the data.

Authors: We acknowledge this concern regarding potential confounding factors. Our experiments maintain consistent learning rates and initializations across methods, but to better isolate the preconditioning effect, we will include new controls such as rescaling the effective step size for standard GD to match the preconditioned updates and additional baselines. These additions will provide clearer separation between the effects of preconditioning and other optimization parameters. revision: yes
Referee: [§2] §2 (spectral-bias premise): the assumption that spectral bias is the dominant cause of delayed feature learning in standard GD is taken as given, but no quantitative comparison shows that other factors (e.g., initialization scale or loss landscape geometry) are negligible under the paper’s training regimes.

Authors: While the spectral bias phenomenon is extensively documented in prior work, we recognize the value of direct evidence in our experimental setup. In the revision, we will incorporate quantitative comparisons by varying initialization scales and examining the geometry of the loss landscape (e.g., via Hessian analysis) to confirm that spectral bias is indeed the primary driver of the observed delays in our training regimes. revision: yes

Circularity Check

1 steps flagged

Conjecture on PGD removing spectral bias for uniform NTK exploration is restated as a 'prediction' then confirmed by experiments without independent derivation of the modified kernel

specific steps

fitted input called prediction [Abstract]
"Our conjecture is that PGD, without the impediment of spectral bias, enables uniform exploration of the parameter space in the NTK regime. Our experimental results confirm this prediction, providing strong evidence that grokking represents a transitional behavior between the lazy regime characterized by the NTK and the rich regime."

The text explicitly equates the conjecture with a 'prediction' whose confirmation is then offered as evidence for the same conjecture. Because the paper supplies no separate derivation of the claimed NTK modification under PGD, the experimental confirmation is statistically forced by the premise it is said to test.

full rationale

The paper states a conjecture about PGD enabling uniform parameter exploration in the NTK regime by removing spectral bias, then immediately presents experimental results as confirmation of 'this prediction' and as evidence for the grokking transition hypothesis. No derivation is provided showing how preconditioning alters the NTK operator's frequency spectrum; the confirmation loop therefore reduces the central claim to the initial assumption plus data selected to match it. This matches the fitted-input-called-prediction pattern at the level of the core narrative, though the paper remains self-contained on other empirical observations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that grokking is caused by the NTK-to-rich transition and that spectral bias is the main impediment to uniform exploration; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Grokking arises as learning transitions from the NTK lazy regime to the feature-rich regime
Explicitly stated as the hypothesis being tested in the abstract.

pith-pipeline@v0.9.0 · 5527 in / 1235 out tokens · 25453 ms · 2026-05-16T17:18:43.571450+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.2 … ∂/∂t ê_i = −λ_i(e)/(μ+λ_i(e)) ê_i … κ_LM ≪ κ_GD
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lemma 3.3 … ∂/∂t ê_i = −1 for λ_i(e) > ε

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 3 internal anchors

[1]

When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

Shun-ichi Amari, Jimmy Ba, Roger Grosse, Xuechen Li, Atsushi Nitanda, Taiji Suzuki, Denny Wu, and Ji Xu. When does preconditioning help or hurt generalization?arXiv preprint arXiv:2006.10732,

work page arXiv 2006
[2]

Simplicity bias in transformers and their ability to learn sparse boolean functions.arXiv preprint arXiv:2211.12316,

Satwik Bhattamishra, Arkil Patel, Varun Kanade, and Phil Blunsom. Simplicity bias in transformers and their ability to learn sparse boolean functions.arXiv preprint arXiv:2211.12316,

work page arXiv
[3]

Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization.arXiv preprint arXiv:2411.07979,

Davide Buffelli, Jamie McGowan, Wangkun Xu, Alexandru Cioba, Da-shan Shiu, Guillaume Hen- nequin, and Alberto Bernacchia. Exact, tractable gauss-newton optimization in deep reversible architectures reveal poor generalization.arXiv preprint arXiv:2411.07979,

work page arXiv
[4]

Gram-gauss-newton method: Learning overparameterized neural networks for regression problems

Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. Gram-gauss-newton method: Learning overparameterized neural networks for regression problems. arXiv preprint arXiv:1905.11675,

work page arXiv 1905
[5]

A randomised subspace gauss-newton method for nonlinear least-squares.arXiv preprint arXiv:2211.05727,

Coralia Cartis, Jaroslav Fowkes, and Zhen Shao. A randomised subspace gauss-newton method for nonlinear least-squares.arXiv preprint arXiv:2211.05727,

work page arXiv
[6]

Gauss-newton dynamics for neural networks: A riemannian optimization perspective

Semih Cayci. Gauss-newton dynamics for neural networks: A riemannian optimization perspective. arXiv preprint arXiv:2412.14031,

work page arXiv
[7]

Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks.arXiv preprint arXiv:2407.01613,

Wenqian Chen, Amanda A Howard, and Panos Stinis. Self-adaptive weights based on balanced residual decay rate for physics-informed neural networks and deep operator networks.arXiv preprint arXiv:2407.01613,

work page arXiv
[8]

On the promise of the stochastic generalized gauss-newton method for training dnns.arXiv preprint arXiv:2006.02409,

Matilde Gargiani, Andrea Zanelli, Moritz Diehl, and Frank Hutter. On the promise of the stochastic generalized gauss-newton method for training dnns.arXiv preprint arXiv:2006.02409,

work page arXiv 2006
[9]

On the activation function dependence of the spectral bias of neural networks.arXiv preprint arXiv:2208.04924,

Qingguo Hong, Jonathan W Siegel, Qinyang Tan, and Jinchao Xu. On the activation function dependence of the spectral bias of neural networks.arXiv preprint arXiv:2208.04924,

work page arXiv
[10]

Gauss-newton natural gradient descent for physics- informed computational fluid dynamics.arXiv preprint arXiv:2402.10680,

Anas Jnini, Flavio Vella, and Marius Zeinhofer. Gauss-newton natural gradient descent for physics- informed computational fluid dynamics.arXiv preprint arXiv:2402.10680,

work page arXiv
[11]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Exact gauss-newton optimization for training deep neural networks.arXiv preprint arXiv:2405.14402,

Mikalai Korbit, Adeyemi D Adeoye, Alberto Bemporad, and Mario Zanon. Exact gauss-newton optimization for training deep neural networks.arXiv preprint arXiv:2405.14402,

work page arXiv
[13]

URL https://cs231n.github. io. Stanford University, Course Notes. Accessed: 2025-06-17. Ziming Liu, Ouail Kitouni, Niklas S Nolte, Eric Michaud, Max Tegmark, and Mike Williams. Towards understanding grokking: An effective theory of representation learning.Advances in Neural Information Processing Systems, 35:34651–34663, 2022a. Ziming Liu, Eric J Michaud,...

work page arXiv 2025
[14]

The Levenberg-Marquardt algorithm: implementation and theory

Jorge J Moré. The Levenberg-Marquardt algorithm: implementation and theory. InNumerical analysis: proceedings of the biennial Conference held at Dundee, June 28–July 1, 1977, pages 105–116. Springer,

work page 1977
[15]

Characterizing the spectrum of the NTK via a power series expansion.arXiv preprint arXiv:2211.07844,

Michael Murray, Hui Jin, Benjamin Bowman, and Guido Montufar. Characterizing the spectrum of the NTK via a power series expansion.arXiv preprint arXiv:2211.07844,

work page arXiv
[16]

Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Gen- eralization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

Efficient Subsampled Gauss-Newton and Natural Gradient Methods for Training Neural Networks

Yi Ren and Donald Goldfarb. Efficient subsampled gauss-newton and natural gradient methods for training neural networks.arXiv preprint arXiv:1906.02353,

work page internal anchor Pith review Pith/arXiv arXiv 1906
[18]

13 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind

Version: 2021-10-27 v0.0-e7150f2d (alpha). 13 Vimal Thilak, Etai Littwin, Shuangfei Zhai, Omid Saremi, Roni Paiss, and Joshua Susskind. The slingshot mechanism: An empirical study of adaptive optimizers and the grokking phenomenon. arXiv preprint arXiv:2206.04817,

work page arXiv 2021
[19]

Explaining grokking through circuit efficiency

Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390,

work page arXiv
[20]

Training behavior of deep neural network in frequency domain

Zhi-Qin John Xu, Yaoyu Zhang, and Yanyang Xiao. Training behavior of deep neural network in frequency domain. InNeural Information Processing: 26th International Conference, ICONIP 2019, Sydney, NSW, Australia, December 12–15, 2019, Proceedings, Part I 26, pages 264–274. Springer,

work page 2019
[21]

On understanding and overcoming spectral biases of deep neural network learning methods for solving pdes.arXiv preprint arXiv:2501.09987,

Zhi-Qin John Xu, Lulu Zhang, and Wei Cai. On understanding and overcoming spectral biases of deep neural network learning methods for solving pdes.arXiv preprint arXiv:2501.09987,

work page arXiv
[22]

A rationale from frequency perspective for grokking in training neural network.arXiv preprint arXiv:2405.17479,

Zhangchen Zhou, Yaoyu Zhang, and Zhi-Qin John Xu. A rationale from frequency perspective for grokking in training neural network.arXiv preprint arXiv:2405.17479,

work page arXiv
[23]

Proof of Lemma 3.2

14 A Proofs in Section 3.1 We first present the proofs for the results in Section 3.1. Proof of Lemma 3.2. Consider the SVD Jt =U tΛ1/2 t V T t where for notational purposes, we repre- sent the singular values in their square root formΛ 1/2 t . Then by direct calculation Jt(µI+J tJ T t )−1J T t =U tΛ1/2 t V T t (µVtV T t +V tΛtV T t )−1VtΛ1/2 t U T t =U t...

work page 2022
[24]

We chose LM with fixed learning rate (rather than dynamic line-search) to better emulate continuous time dynamics

Table 1: Hyperparameters for regression problems Parameter Value Number of Layers 2 Hidden Dimension 80 Kernel Initialization Kaiming Uniform Bias Initialization Zeros Activation Function tanh Output Dimension 1 Learning Rate 1×10 −2 Batch Size (1D) 100 (full batch) Batch Size (2D) 400 B.2 PINNs hyperparameters The usage of PGD in PINNs is not new, and on...

work page 2024
[25]

The remaining details are shown in Table

16 Table 2: Hyperparameters for PINNs Parameter Value Number of Hidden Layers 1 Hidden Dimension 256 Kernel Initialization Kaiming Uniform Bias Initialization Zeros Activation Function tanh Learning Rate 1×10 −3,1×10 −2,1×10 −1 Total interior points 640 Total boundary points 40 Batch Size 200 B.3 Grokking hyperparameters B.3.1 Modular Arithmetic We follow...

work page 2019
[26]

The remaining hyperparameters are shown in Table

B.3.2 Polynomial Regression The main implementations are based [Kumar et al., 2024, §5] which we refer the reader to for model definition, and exact data generation formulation. The remaining hyperparameters are shown in Table

work page 2024
[27]

Interestingly, the loss values are generally lower for PGD and is clearly decreasing even as the classification error fails to noticeably change

17 Table 4: Hyperparameters for Transformer Modular Addition Hidden Dimensions 128 Layers 2 Head 4 Modulo Parameterp 97 Training Percentage of Dataset 50% Learning Rate (Adam) 10−3 Learning Rate (LM) 1 Weight Decay (All) 0 Batch Size 512 Conjugate Gradient max iters 150, residual threshold10 −6 Line Search c= 10 −4,τ= 0.5, max iters 10 Table 5: Polynomial...

work page 2016