Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

Justin Sirignano; Konstantinos Spiliopoulos; Konstantin Riedl

arxiv: 2605.08352 · v2 · pith:VYTKFDUTnew · submitted 2026-05-08 · 💻 cs.LG · math.PR· stat.ML

Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

Konstantin Riedl , Konstantinos Spiliopoulos , Justin Sirignano This is my paper

Pith reviewed 2026-05-21 07:51 UTC · model grok-4.3

classification 💻 cs.LG math.PRstat.ML

keywords Newton methodneural tangent kerneloverparameterized neural networksconvergence analysisspectral biasinfinite width limitregularized Hessian

0 comments

The pith

In the infinite-width limit, regularized Newton's method trains neural networks to global minimizers with zero loss at an exponential rate uniform across frequencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a convergence analysis for the regularized Newton method applied to neural network training as the number of hidden units tends to infinity. It establishes that the finite-width dynamics converge in probability to a deterministic limit equation driven by a Newton neural tangent kernel. In this limit the network reaches the target data exponentially fast and attains zero loss, with the rate holding uniformly over the frequency spectrum when the regularization parameter is scaled to vanish appropriately with width. This stands in contrast to gradient descent, whose neural tangent kernel eigenvalues accumulate near zero and produce slow convergence on high-frequency targets.

Core claim

As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a Newton neural tangent kernel. In the infinite-width limit the NN converges exponentially fast to the target data, i.e., a global minimizer with zero loss, and this convergence is uniform across the frequency spectrum. The analysis identifies a scaling formula for the regularization parameter that vanishes at a suitable rate with growing width, under which the regularized Hessian remains positive definite during training for all sufficiently large finite widths and the Newton updates for individual parameters converge to zero, so the 0

What carries the argument

The Newton neural tangent kernel (NNTK), the limiting operator that governs the regularized Newton updates and whose eigenvalues remain uniformly bounded away from zero under the derived scaling of the regularization parameter.

Load-bearing premise

There exists a scaling formula for the regularization parameter that vanishes at a suitable rate with growing width such that the regularized Hessian remains positive definite for all sufficiently large finite widths during training.

What would settle it

A numerical check, for successively larger finite widths and target functions containing high-frequency components, that the training loss decays exponentially to machine precision at a rate independent of frequency content while the smallest eigenvalue of the regularized effective kernel stays bounded below by a positive constant.

Figures

Figures reproduced from arXiv: 2605.08352 by Justin Sirignano, Konstantinos Spiliopoulos, Konstantin Riedl.

**Figure 2.** Figure 2: Empirical validation of Lemma 9 and Theorem 7 for a shallow [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The Jacobian J N θ ∈ RM×N(d+2) and its different slicings. Hessian of the loss. The Hessian ∇2 θL N (θ) of the loss L N (θ) from (2) is given by ∇2 θL N (θ) = 1 M X M m=1 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

A convergence analysis is developed for the regularized Newton method for training neural networks (NNs) in the overparameterized limit. As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a ``Newton neural tangent kernel'' (NNTK). Explicit rates characterizing this convergence are provided and, in the infinite-width limit, we prove that the NN converges exponentially fast to the target data (i.e., a global minimizer with zero loss). We show that this convergence is uniform across the frequency spectrum, addressing the spectral bias inherent in gradient descent. The eigenvalues of the NTK for gradient descent accumulate at zero, leading to slow convergence for target data with high-frequency components. In contrast, the NNTK has uniformly lower bounded eigenvalues if the regularization parameter is selected appropriately, allowing Newton's method to converge more quickly for data with high-frequency components. Mathematical challenges that need to be addressed in our analysis include the implicit parameter update of the Newton method with a potentially indefinite Hessian matrix and the fact that the dimension of this linear system of equations tends to infinity as the NN width grows. This complicates deriving the training dynamics in the overparameterized limit as well as proving the convergence of the finite-width dynamics thereto. The analysis identifies a scaling formula for selecting the regularization parameter, which we show can vanish at a suitable rate as the number of hidden units becomes larger. We prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training and the Newton updates for individual NN parameters converge to zero, showing that the model behaves as a linearization around the initialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Newton's method reaches zero loss exponentially fast in the infinite-width limit with uniform rates across frequencies via a new NNTK, but the regularization must control negative curvature along the full trajectory.

read the letter

The main thing here is that regularized Newton's method for overparameterized networks converges exponentially to zero loss in the infinite-width limit, and the rate stays uniform over frequencies because the Newton neural tangent kernel has eigenvalues bounded away from zero. This is the direct contrast they draw with gradient descent, whose NTK eigenvalues accumulate at zero and produce spectral bias on high-frequency targets.

Referee Report

2 major / 2 minor

Summary. The manuscript develops a convergence analysis for the regularized Newton method applied to overparameterized neural networks. As width tends to infinity, finite-width training dynamics converge in probability to a deterministic limit equation involving a Newton neural tangent kernel (NNTK). Explicit rates are derived for this convergence, and in the infinite-width limit the network is shown to converge exponentially to a global minimizer with zero loss. The convergence is claimed to be uniform across frequencies, in contrast to the spectral bias of gradient descent whose NTK eigenvalues accumulate at zero. The analysis identifies a scaling for the regularization parameter that vanishes with width while ensuring the regularized Hessian remains positive definite for all sufficiently large finite widths throughout training, allowing the model to behave as a linearization around initialization.

Significance. If the central claims hold, the work supplies a rigorous justification for the faster and spectrally uniform convergence of Newton-type methods relative to gradient descent in the overparameterized regime. The introduction of the NNTK, the explicit rates, and the uniform lower bound on its eigenvalues constitute a clear theoretical advance. The manuscript also ships a concrete scaling formula for regularization that vanishes with width, which is a strength when the accompanying positive-definiteness proof is complete.

major comments (2)

[§4] §4 (or the section deriving the infinite-width limit): the argument that the regularized Hessian remains positive definite for all sufficiently large widths along the entire Newton trajectory relies on a scaling formula identified inside the analysis. The bound on the most negative eigenvalue of the unregularized Hessian must be shown to be uniform over the path, not merely at initialization; otherwise the vanishing regularization term may fail to dominate for some finite widths.
[limit derivation section] The treatment of the infinite-dimensional linear system arising from the Newton update: the passage from the finite-width regularized dynamics to the deterministic NNTK limit equation requires controlling the implicit parameter updates. The current sketch does not yet make explicit how the convergence in probability is obtained when the system dimension grows with width.

minor comments (2)

Notation for the NNTK and its eigenvalues should be introduced with a clear comparison table to the standard NTK to aid readability.
A short remark on how the post-hoc scaling formula can be computed in practice (or bounded without knowledge of the full trajectory) would strengthen the presentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We respond to each major comment point by point and indicate where revisions will be made to the manuscript.

read point-by-point responses

Referee: §4 (or the section deriving the infinite-width limit): the argument that the regularized Hessian remains positive definite for all sufficiently large widths along the entire Newton trajectory relies on a scaling formula identified inside the analysis. The bound on the most negative eigenvalue of the unregularized Hessian must be shown to be uniform over the path, not merely at initialization; otherwise the vanishing regularization term may fail to dominate for some finite widths.

Authors: We agree that a uniform bound on the most negative eigenvalue of the unregularized Hessian along the full trajectory is necessary to guarantee the regularization dominates for all large finite widths. The manuscript establishes positive definiteness for sufficiently large widths and shows that Newton updates converge to zero, which keeps parameters close to initialization. To close this gap rigorously, we will insert a new lemma bounding the eigenvalue deviation from initialization values via the derived exponential convergence rate. The revised §4 will then verify that the chosen scaling of the regularization parameter works uniformly over the path. revision: yes
Referee: The treatment of the infinite-dimensional linear system arising from the Newton update: the passage from the finite-width regularized dynamics to the deterministic NNTK limit equation requires controlling the implicit parameter updates. The current sketch does not yet make explicit how the convergence in probability is obtained when the system dimension grows with width.

Authors: We thank the referee for highlighting the need for greater explicitness in the limit derivation. The current argument relies on concentration inequalities for the empirical NNTK and its derivatives together with the conditioning provided by regularization to obtain convergence in probability of the finite-width Newton step to the deterministic limit. We will expand the relevant section with a step-by-step outline that isolates the operator-norm difference between the finite- and infinite-width linear systems, controls the implicit parameter updates via the vanishing regularization, and invokes standard results on convergence of random operators in growing dimension. This will render the passage to the NNTK limit fully rigorous. revision: yes

Circularity Check

0 steps flagged

No circularity: limit derivation and regularization scaling are proven, not assumed by construction

full rationale

The paper derives the infinite-width limit equation from finite-width regularized Newton dynamics and supplies an explicit scaling for the regularization parameter. It then proves that this scaling ensures the regularized Hessian remains positive definite for all sufficiently large finite widths throughout the trajectory, yielding exponential convergence uniform in frequency. No step reduces to a self-definition, a fitted input relabeled as prediction, or a load-bearing self-citation; the positive-definiteness claim is established by direct analysis rather than imposed to force the result. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claims rest on the infinite-width limit, a specific regularization scaling chosen to guarantee positive definiteness, and standard probabilistic convergence arguments; the NNTK is introduced as the limiting object rather than postulated independently.

free parameters (1)

regularization parameter scaling
A scaling formula is identified inside the analysis that lets the parameter vanish at a suitable rate with network width while keeping the regularized Hessian positive definite.

axioms (2)

domain assumption Finite-width Newton dynamics converge in probability to a deterministic limit equation as hidden-unit count tends to infinity.
Invoked to obtain the NNTK description of the training trajectory.
ad hoc to paper The regularized Hessian remains positive definite for all sufficiently large finite widths under the identified scaling.
This is proved in the paper but functions as a load-bearing premise for the method to be well-defined throughout training.

invented entities (1)

Newton neural tangent kernel (NNTK) no independent evidence
purpose: Characterizes the deterministic limit dynamics of the regularized Newton updates.
Defined via the infinite-width limit; no independent external evidence or falsifiable prediction outside the derivation is provided.

pith-pipeline@v0.9.0 · 5841 in / 1610 out tokens · 52333 ms · 2026-05-21T07:51:18.435517+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

regularizer γN scaling as γN = γ / N^{2β−1} ... prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

A. D. Adeoye, P. C. Petersen, and A. Bemporad. Regularized Gauss-Newton for optimizing overparameterized neural networks.arXiv preprint arXiv:2404.14875, 2024

work page arXiv 2024
[2]

R. Anil, V . Gupta, T. Koren, K. Regan, and Y . Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002
[3]

Arbel, R

M. Arbel, R. Menegaux, and P. Wolinski. Rethinking Gauss-Newton for learning over- parameterized models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

work page 2023
[4]

Q. Bai, S. Rosenberg, and W. Xu. Generalized tangent kernel: A unified geometric foundation for natural gradient and standard gradient.Trans. Mach. Learn. Res., 2025, 2025

work page 2025
[5]

Bietti and J

A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages ...

work page 2019
[6]

Bonfanti, G

A. Bonfanti, G. Bruno, and C. Cipriani. The challenges of the nonlinear regime for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouve...

work page 2024
[7]

Boyd and L

S. Boyd and L. Vandenberghe.Convex optimization. Cambridge University Press, Cambridge, 2004

work page 2004
[8]

T. Cai, R. Gao, J. Hou, S. Chen, D. Wang, D. He, Z. Zhang, and L. Wang. Gram-Gauss-Newton method: Learning overparameterized neural networks for regression problems.arXiv preprint arXiv:1905.11675, 2019

work page arXiv 1905
[9]

Y . Cao, Z. Fang, Y . Wu, D. Zhou, and Q. Gu. Towards understanding the spectral bias of deep learning. In Z. Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 2205–2211, 2021

work page 2021
[10]

Carvalho, J

L. Carvalho, J. a. L. Costa, J. Mour ao, and G. c. Oliveira. The positivity of the neural tangent kernel.SIAM J. Math. Data Sci., 7(2):495–515, 2025

work page 2025
[11]

S. Cayci. A Riemannian optimization perspective of the Gauss-Newton method for feedforward neural networks.arXiv preprint arXiv:2412.14031, 2024

work page arXiv 2024
[12]

Chizat and F

L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

work page 2018
[13]

Chizat, M

L. Chizat, M. Colombo, X. Fernández-Real, and A. Figalli. Infinite-width limit of deep linear neural networks.Comm. Pure Appl. Math., 77(10):3958–4007, 2024

work page 2024
[14]

Chizat, E

L. Chizat, E. Oyallon, and F. R. Bach. On lazy training in differentiable programming. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, C...

work page 2019
[15]

A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust-region methods. MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2000

work page 2000
[16]

G. Cybenko. Approximation by superpositions of a sigmoidal function.Math. Control. Signals Syst., 2(4):303–314, 1989

work page 1989
[17]

Dangel, J

F. Dangel, J. Müller, and M. Zeinhofer. Kronecker-factored approximate curvature for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancou...

work page 2024
[18]

N. A. Daryakenari, K. Shukla, and G. E. Karniadakis. Representation meets optimization: Training PINNs and PIKANs for gray-box discovery in systems pharmacology.Computers in Biology and Medicine, 201:111393, 2026

work page 2026
[19]

Fletcher.Practical methods of optimization

R. Fletcher.Practical methods of optimization. Wiley-Interscience [John Wiley & Sons], New York, second edition, 2001. 35

work page 2001
[20]

D. M. Gomes, Y . Zhang, E. Belilovsky, G. Wolf, and M. S. Hosseini. Adafisher: Adaptive second order optimization via Fisher information.arXiv preprint arXiv:2405.16397, 2024

work page arXiv 2024
[21]

Gupta, T

V . Gupta, T. Koren, and Y . Singer. Shampoo: Preconditioned stochastic tensor optimization. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, pages 1837–1845. PMLR, 2018

work page 2018
[22]

K. Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257, 1991

work page 1991
[23]

Ishikawa and R

S. Ishikawa and R. Karakida. On the parameterization of second-order optimization effective towards the infinite width. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024

work page 2024
[24]

Y . Ito. Nonlinearity creates linear independence.Adv. Comput. Math., 5(2-3):189–203, 1996

work page 1996
[25]

Jacot, C

A. Jacot, C. Hongler, and F. Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2...

work page 2018
[26]

Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

A. Jnini, E. Kiyani, K. Shukla, J. F. Urban, N. A. Daryakenari, J. Muller, M. Zeinhofer, and G. E. Karniadakis. Curvature-aware optimization for high-accuracy physics-informed neural networks.arXiv preprint arXiv:2604.05230, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Jnini, F

A. Jnini, F. Vella, and M. Zeinhofer. Gauss-Newton natural gradient descent for physics- informed computational fluid dynamics.Computers & Fluids, page 106955, 2025

work page 2025
[28]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024

work page 2024
[29]

Karakida and K

R. Karakida and K. Osawa. Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb...

work page 2020
[30]

Karhadkar, M

K. Karhadkar, M. Murray, and G. F. Montúfar. Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimension. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 20...

work page 2024
[31]

Kiyani, K

E. Kiyani, K. Shukla, J. F. Urbán, J. Darbon, and G. E. Karniadakis. Optimizing the optimizer for physics-informed neural networks and kolmogorov-arnold networks.Computer Methods in Applied Mechanics and Engineering, 446:118308, 2025

work page 2025
[32]

D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1-3):503–528, 1989

work page 1989
[33]

Martens.Second-order optimization for neural networks

J. Martens.Second-order optimization for neural networks. University of Toronto (Canada), 2016

work page 2016
[34]

Martens and R

J. Martens and R. B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In F. R. Bach and D. M. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR Workshop and Conference Proceedings, pages 2408–2417, 2015

work page 2015
[35]

S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. 36

work page 2018
[36]

Mishchenko

K. Mishchenko. Regularized Newton method with global O(1/k2) convergence.SIAM J. Optim., 33(3):1440–1462, 2023

work page 2023
[37]

Müller and M

J. Müller and M. Zeinhofer. Achieving high accuracy with pinns via energy natural gradient descent. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 25471–25485...

work page 2023
[38]

Nocedal and S

J. Nocedal and S. J. Wright.Numerical optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, second edition, 2006

work page 2006
[39]

Riedl, J

K. Riedl, J. A. Sirignano, and K. Spiliopoulos. Global convergence of adjoint-optimized neural PDEs.J. Mach. Learn. Res., 26:295:1–295:94, 2025

work page 2025
[40]

G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: an interacting particle system approach.Comm. Pure Appl. Math., 75(9):1889–1935, 2022

work page 1935
[41]

Sirignano, J

J. Sirignano, J. MacArt, and K. Spiliopoulos. PDE-constrained models with neural network terms: optimization and global convergence.J. Comput. Phys., 481:Paper No. 112016, 35, 2023

work page 2023
[42]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Scaling limit of neural networks with the Xavier initialization and convergence to a global minimum.arXiv preprint arXiv:1907.04108, 2019

work page arXiv 1907
[43]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a central limit theorem. Stochastic Process. Appl., 130(3):1820–1852, 2020

work page 2020
[44]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math., 80(2):725–752, 2020

work page 2020
[45]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of deep neural networks.Math. Oper. Res., 47(1):120–152, 2022

work page 2022
[46]

Spiliopoulos, R

K. Spiliopoulos, R. Sowers, and J. Sirignano.Mathematical Foundations of Deep Learning Models and Algorithms. American Mathematical Society, 2025

work page 2025
[47]

J. A. Tropp. An introduction to matrix concentration inequalities.Found. Trends Mach. Learn., 8(1-2):1–230, 2015

work page 2015
[48]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

work page 2025
[49]

Y . Wang, M. Bennani, J. Martens, S. Racanière, S. Blackwell, A. Matthews, S. Nikolov, G. Cao- Labora, D. S. Park, M. Arjovsky, et al. Discovery of unstable singularities.arXiv preprint arXiv:2509.14185, 2025

work page arXiv 2025
[50]

Zhang, J

G. Zhang, J. Martens, and R. B. Grosse. Fast convergence of natural gradient descent for over- parameterized neural networks. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché- Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 201...

work page 2019

[1] [1]

A. D. Adeoye, P. C. Petersen, and A. Bemporad. Regularized Gauss-Newton for optimizing overparameterized neural networks.arXiv preprint arXiv:2404.14875, 2024

work page arXiv 2024

[2] [2]

R. Anil, V . Gupta, T. Koren, K. Regan, and Y . Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002

[3] [3]

Arbel, R

M. Arbel, R. Menegaux, and P. Wolinski. Rethinking Gauss-Newton for learning over- parameterized models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

work page 2023

[4] [4]

Q. Bai, S. Rosenberg, and W. Xu. Generalized tangent kernel: A unified geometric foundation for natural gradient and standard gradient.Trans. Mach. Learn. Res., 2025, 2025

work page 2025

[5] [5]

Bietti and J

A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages ...

work page 2019

[6] [6]

Bonfanti, G

A. Bonfanti, G. Bruno, and C. Cipriani. The challenges of the nonlinear regime for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouve...

work page 2024

[7] [7]

Boyd and L

S. Boyd and L. Vandenberghe.Convex optimization. Cambridge University Press, Cambridge, 2004

work page 2004

[8] [8]

T. Cai, R. Gao, J. Hou, S. Chen, D. Wang, D. He, Z. Zhang, and L. Wang. Gram-Gauss-Newton method: Learning overparameterized neural networks for regression problems.arXiv preprint arXiv:1905.11675, 2019

work page arXiv 1905

[9] [9]

Y . Cao, Z. Fang, Y . Wu, D. Zhou, and Q. Gu. Towards understanding the spectral bias of deep learning. In Z. Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 2205–2211, 2021

work page 2021

[10] [10]

Carvalho, J

L. Carvalho, J. a. L. Costa, J. Mour ao, and G. c. Oliveira. The positivity of the neural tangent kernel.SIAM J. Math. Data Sci., 7(2):495–515, 2025

work page 2025

[11] [11]

S. Cayci. A Riemannian optimization perspective of the Gauss-Newton method for feedforward neural networks.arXiv preprint arXiv:2412.14031, 2024

work page arXiv 2024

[12] [12]

Chizat and F

L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

work page 2018

[13] [13]

Chizat, M

L. Chizat, M. Colombo, X. Fernández-Real, and A. Figalli. Infinite-width limit of deep linear neural networks.Comm. Pure Appl. Math., 77(10):3958–4007, 2024

work page 2024

[14] [14]

Chizat, E

L. Chizat, E. Oyallon, and F. R. Bach. On lazy training in differentiable programming. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, C...

work page 2019

[15] [15]

A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust-region methods. MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2000

work page 2000

[16] [16]

G. Cybenko. Approximation by superpositions of a sigmoidal function.Math. Control. Signals Syst., 2(4):303–314, 1989

work page 1989

[17] [17]

Dangel, J

F. Dangel, J. Müller, and M. Zeinhofer. Kronecker-factored approximate curvature for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancou...

work page 2024

[18] [18]

N. A. Daryakenari, K. Shukla, and G. E. Karniadakis. Representation meets optimization: Training PINNs and PIKANs for gray-box discovery in systems pharmacology.Computers in Biology and Medicine, 201:111393, 2026

work page 2026

[19] [19]

Fletcher.Practical methods of optimization

R. Fletcher.Practical methods of optimization. Wiley-Interscience [John Wiley & Sons], New York, second edition, 2001. 35

work page 2001

[20] [20]

D. M. Gomes, Y . Zhang, E. Belilovsky, G. Wolf, and M. S. Hosseini. Adafisher: Adaptive second order optimization via Fisher information.arXiv preprint arXiv:2405.16397, 2024

work page arXiv 2024

[21] [21]

Gupta, T

V . Gupta, T. Koren, and Y . Singer. Shampoo: Preconditioned stochastic tensor optimization. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, pages 1837–1845. PMLR, 2018

work page 2018

[22] [22]

K. Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257, 1991

work page 1991

[23] [23]

Ishikawa and R

S. Ishikawa and R. Karakida. On the parameterization of second-order optimization effective towards the infinite width. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024

work page 2024

[24] [24]

Y . Ito. Nonlinearity creates linear independence.Adv. Comput. Math., 5(2-3):189–203, 1996

work page 1996

[25] [25]

Jacot, C

A. Jacot, C. Hongler, and F. Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2...

work page 2018

[26] [26]

Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

A. Jnini, E. Kiyani, K. Shukla, J. F. Urban, N. A. Daryakenari, J. Muller, M. Zeinhofer, and G. E. Karniadakis. Curvature-aware optimization for high-accuracy physics-informed neural networks.arXiv preprint arXiv:2604.05230, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[27] [27]

Jnini, F

A. Jnini, F. Vella, and M. Zeinhofer. Gauss-Newton natural gradient descent for physics- informed computational fluid dynamics.Computers & Fluids, page 106955, 2025

work page 2025

[28] [28]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024

work page 2024

[29] [29]

Karakida and K

R. Karakida and K. Osawa. Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb...

work page 2020

[30] [30]

Karhadkar, M

K. Karhadkar, M. Murray, and G. F. Montúfar. Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimension. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 20...

work page 2024

[31] [31]

Kiyani, K

E. Kiyani, K. Shukla, J. F. Urbán, J. Darbon, and G. E. Karniadakis. Optimizing the optimizer for physics-informed neural networks and kolmogorov-arnold networks.Computer Methods in Applied Mechanics and Engineering, 446:118308, 2025

work page 2025

[32] [32]

D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1-3):503–528, 1989

work page 1989

[33] [33]

Martens.Second-order optimization for neural networks

J. Martens.Second-order optimization for neural networks. University of Toronto (Canada), 2016

work page 2016

[34] [34]

Martens and R

J. Martens and R. B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In F. R. Bach and D. M. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR Workshop and Conference Proceedings, pages 2408–2417, 2015

work page 2015

[35] [35]

S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. 36

work page 2018

[36] [36]

Mishchenko

K. Mishchenko. Regularized Newton method with global O(1/k2) convergence.SIAM J. Optim., 33(3):1440–1462, 2023

work page 2023

[37] [37]

Müller and M

J. Müller and M. Zeinhofer. Achieving high accuracy with pinns via energy natural gradient descent. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 25471–25485...

work page 2023

[38] [38]

Nocedal and S

J. Nocedal and S. J. Wright.Numerical optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, second edition, 2006

work page 2006

[39] [39]

Riedl, J

K. Riedl, J. A. Sirignano, and K. Spiliopoulos. Global convergence of adjoint-optimized neural PDEs.J. Mach. Learn. Res., 26:295:1–295:94, 2025

work page 2025

[40] [40]

G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: an interacting particle system approach.Comm. Pure Appl. Math., 75(9):1889–1935, 2022

work page 1935

[41] [41]

Sirignano, J

J. Sirignano, J. MacArt, and K. Spiliopoulos. PDE-constrained models with neural network terms: optimization and global convergence.J. Comput. Phys., 481:Paper No. 112016, 35, 2023

work page 2023

[42] [42]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Scaling limit of neural networks with the Xavier initialization and convergence to a global minimum.arXiv preprint arXiv:1907.04108, 2019

work page arXiv 1907

[43] [43]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a central limit theorem. Stochastic Process. Appl., 130(3):1820–1852, 2020

work page 2020

[44] [44]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math., 80(2):725–752, 2020

work page 2020

[45] [45]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of deep neural networks.Math. Oper. Res., 47(1):120–152, 2022

work page 2022

[46] [46]

Spiliopoulos, R

K. Spiliopoulos, R. Sowers, and J. Sirignano.Mathematical Foundations of Deep Learning Models and Algorithms. American Mathematical Society, 2025

work page 2025

[47] [47]

J. A. Tropp. An introduction to matrix concentration inequalities.Found. Trends Mach. Learn., 8(1-2):1–230, 2015

work page 2015

[48] [48]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

work page 2025

[49] [49]

Y . Wang, M. Bennani, J. Martens, S. Racanière, S. Blackwell, A. Matthews, S. Nikolov, G. Cao- Labora, D. S. Park, M. Arjovsky, et al. Discovery of unstable singularities.arXiv preprint arXiv:2509.14185, 2025

work page arXiv 2025

[50] [50]

Zhang, J

G. Zhang, J. Martens, and R. B. Grosse. Fast convergence of natural gradient descent for over- parameterized neural networks. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché- Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 201...

work page 2019