arxiv: 2605.08352 · v1 · submitted 2026-05-08 · 💻 cs.LG · math.PR· stat.ML

Recognition: no theorem link

Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

Konstantin Riedl , Konstantinos Spiliopoulos , Justin Sirignano

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3

classification 💻 cs.LG math.PRstat.ML

keywords neural networksoverparameterized limitNewton methodconvergence analysisneural tangent kernelspectral biasinfinite widthregularization

0 comments

The pith

Regularized Newton's method for neural networks converges exponentially to zero loss in the infinite-width limit uniformly across frequencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the training dynamics of neural networks using a regularized version of Newton's method as the number of hidden units tends to infinity. It establishes that these dynamics converge in probability to a deterministic limit governed by a Newton neural tangent kernel. In this limit the network reaches a global minimizer of the training loss at an exponential rate that does not degrade for high-frequency components of the target data. This stands in contrast to gradient descent whose convergence slows on high-frequency targets because the eigenvalues of its kernel accumulate at zero.

Core claim

As the network width tends to infinity the regularized Newton updates converge in probability to the solution of a deterministic evolution equation driven by the Newton neural tangent kernel. When the regularization parameter is scaled to vanish at a suitable rate with width the kernel eigenvalues remain uniformly bounded away from zero and the training error therefore decays exponentially to zero for any target function in the data space.

What carries the argument

The Newton neural tangent kernel (NNTK), the linearized kernel that appears in the infinite-width limit of the regularized Newton updates and whose eigenvalues stay bounded below once the regularization parameter is chosen to vanish appropriately with width.

If this is right

Training error decays exponentially with a rate independent of the frequency content of the target data.
For all sufficiently large widths the regularized Hessian stays positive definite throughout the entire training trajectory.
Individual parameter updates under the Newton step converge to zero so that the network remains a linearization around its random initialization.
The method overcomes the spectral bias of gradient descent by keeping all kernel eigenvalues uniformly bounded away from zero.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regularization scaling may allow other second-order optimizers to inherit uniform convergence rates in the infinite-width regime.
Practical implementations could pre-compute or approximate the required regularization schedule from width alone without per-iteration tuning.
The uniform eigenvalue bound suggests that second-order methods might reduce the need for frequency-specific architecture choices when modeling complex data.

Load-bearing premise

The regularization parameter must be chosen according to a scaling formula that vanishes at a suitable rate as width grows so that the regularized Hessian remains positive definite for all sufficiently large networks during training.

What would settle it

For a sequence of increasing widths the observed convergence rate for a high-frequency target function fails to remain exponential or the smallest eigenvalue of the regularized Hessian drops below zero at some point in training.

Figures

Figures reproduced from arXiv: 2605.08352 by Justin Sirignano, Konstantinos Spiliopoulos, Konstantin Riedl.

**Figure 2.** Figure 2: Empirical validation of Lemma 9 and Theorem 7 for a shallow [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗

**Figure 3.** Figure 3: The Jacobian J N θ ∈ RM×N(d+2) and its different slicings. Hessian of the loss. The Hessian ∇2 θL N (θ) of the loss L N (θ) from (2) is given by ∇2 θL N (θ) = 1 M X M m=1 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

read the original abstract

A convergence analysis is developed for the regularized Newton method for training neural networks (NNs) in the overparameterized limit. As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a ``Newton neural tangent kernel'' (NNTK). Explicit rates characterizing this convergence are provided and, in the infinite-width limit, we prove that the NN converges exponentially fast to the target data (i.e., a global minimizer with zero loss). We show that this convergence is uniform across the frequency spectrum, addressing the spectral bias inherent in gradient descent. The eigenvalues of the NTK for gradient descent accumulate at zero, leading to slow convergence for target data with high-frequency components. In contrast, the NNTK has uniformly lower bounded eigenvalues if the regularization parameter is selected appropriately, allowing Newton's method to converge more quickly for data with high-frequency components. Mathematical challenges that need to be addressed in our analysis include the implicit parameter update of the Newton method with a potentially indefinite Hessian matrix and the fact that the dimension of this linear system of equations tends to infinity as the NN width grows. This complicates deriving the training dynamics in the overparameterized limit as well as proving the convergence of the finite-width dynamics thereto. The analysis identifies a scaling formula for selecting the regularization parameter, which we show can vanish at a suitable rate as the number of hidden units becomes larger. We prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training and the Newton updates for individual NN parameters converge to zero, showing that the model behaves as a linearization around the initialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a deterministic limit for regularized Newton on infinite-width nets via a new NNTK and proves uniform exponential convergence to zero loss, but the regularization trajectory bound is the part that needs the most scrutiny.

read the letter

The core result is that regularized Newton's method on overparameterized networks converges in the infinite-width limit to a deterministic dynamics governed by the Newton neural tangent kernel, with exponential decay to zero loss that holds uniformly across frequencies. This directly contrasts with the spectral bias of gradient descent, where the standard NTK eigenvalues accumulate near zero for high-frequency targets. The authors supply a width-dependent regularization scaling that is supposed to vanish while keeping the regularized Hessian positive definite along the trajectory, and they claim the finite-width parameter updates converge to this limit in probability with explicit rates. That is the new piece: extending the NTK-style analysis to second-order updates and getting frequency-uniform rates as a consequence. The derivation of the limit equation from the implicit Newton step is the technical heart, and they flag the infinite-dimensional linear system and the need to control the implicit update as the main hurdles they clear. The positive-definiteness claim and the uniform eigenvalue lower bound on the NNTK are what let them get the faster convergence on high-frequency data. Those parts are cleanly stated in the abstract and appear to rest on concentration at initialization plus the scaling formula. The soft spot is exactly the one the stress-test note flags. The regularization must keep the smallest eigenvalue of the unregularized Hessian above the negative of the regularization term for the entire path, not just near initialization. High-frequency targets induce larger parameter moves, so it is not obvious that the bound derived from the initial concentration survives until convergence. If the proof only controls the eigenvalue at the start and then invokes the convergence rate to bound the trajectory size, there is a potential circularity. The paper states that for large enough width the regularized Hessian stays positive definite during training, but the details of how they close that loop will determine whether the uniform rate holds. Readers working on optimization theory for wide networks will get the most from this. It gives a concrete object (the NNTK) and a clear comparison to the gradient-descent case. The work is coherent on its own terms and the claims are specific enough to referee. I would send it to peer review, asking reviewers to focus on the trajectory control for the regularized Hessian and the error bounds in the limit derivation.

Referee Report

3 major / 2 minor

Summary. The paper develops a convergence analysis for the regularized Newton method applied to overparameterized neural networks. As width m tends to infinity, the finite-width training dynamics are shown to converge in probability to a deterministic limit equation governed by a Newton neural tangent kernel (NNTK). In this limit the network converges exponentially fast to a global minimizer of zero loss, with the rate uniform across the frequency spectrum. This is contrasted with the spectral bias of gradient descent, whose NTK eigenvalues accumulate at zero. A scaling formula for the regularization parameter λ_m → 0 is identified that is claimed to keep the regularized Hessian positive definite for all sufficiently large m throughout training, allowing the NNTK spectrum to be bounded away from zero.

Significance. If the central claims hold, the work would constitute a substantial theoretical contribution to the analysis of second-order optimization for neural networks. It supplies explicit convergence rates, derives a new NNTK limit object from the implicit Newton update, and offers a mechanism for overcoming spectral bias via uniform eigenvalue control. The technical handling of infinite-dimensional linear systems and the vanishing regularization scaling are non-trivial and, if rigorously closed, would be of broad interest in the overparameterized regime.

major comments (3)

[section deriving the NNTK and proving positive definiteness during training] The claim that the regularized Hessian remains positive definite with eigenvalues bounded below uniformly in m and throughout the entire training trajectory (abstract and the section deriving the NNTK limit equation) is load-bearing for both the well-posedness of the implicit update and the exponential convergence rate. The provided scaling formula for λ_m appears to be justified by concentration at initialization; however, for high-frequency targets the Newton steps induce parameter deviations whose size is controlled only after invoking the very convergence rate that presupposes the eigenvalue lower bound. A bootstrap or a priori estimate that bounds the trajectory deviation independently of the rate is required to close the argument.
[analysis of the infinite-dimensional linear system and convergence to the limit equation] The convergence in probability of the finite-width Newton iterates to the infinite-width limit equation (the central technical result) rests on error controls for the implicit linear solve whose dimension diverges with m. The manuscript flags this as a mathematical challenge but does not supply the explicit quantitative bounds on the residual or the invertibility error that would justify passing to the limit while preserving the uniform spectral gap of the NNTK.
[proof of exponential convergence uniform across frequencies] The uniformity of the exponential convergence rate across the frequency spectrum is asserted once the NNTK eigenvalues are bounded away from zero. The proof must still verify that the finite-width approximation error does not re-introduce frequency-dependent slowdowns before the infinite-width limit is taken; without an explicit rate that is uniform in frequency for the finite-m dynamics, the claim that Newton's method eliminates spectral bias for all large but finite widths remains incomplete.

minor comments (2)

[preliminaries and definition of NNTK] Notation for the NNTK and its relation to the standard NTK should be introduced with a clear side-by-side comparison, including how the regularization enters the kernel definition.
[convergence statements] Several steps in the derivation of the limit dynamics invoke “sufficiently large m” without quantitative thresholds; adding explicit dependence on m in the error terms would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify key technical gaps in closing the arguments for positive definiteness, limit passage, and uniform rates. We address each point below and will incorporate major revisions to strengthen the manuscript.

read point-by-point responses

Referee: [section deriving the NNTK and proving positive definiteness during training] The claim that the regularized Hessian remains positive definite with eigenvalues bounded below uniformly in m and throughout the entire training trajectory is load-bearing. The scaling for λ_m is justified at initialization; however, for high-frequency targets the Newton steps induce parameter deviations whose size is controlled only after invoking the convergence rate that presupposes the eigenvalue lower bound. A bootstrap or a priori estimate that bounds the trajectory deviation independently of the rate is required to close the argument.

Authors: We acknowledge the circularity concern. The manuscript establishes positive definiteness at initialization via concentration inequalities and shows that parameter updates remain small, but the argument for high-frequency targets does rely on the rate to control deviations. In revision we will insert an explicit bootstrap: first derive an a priori bound on the entire training trajectory deviation that depends only on the regularization scaling λ_m and the boundedness of the loss (independent of the exponential rate), using the fact that the initial regularized Hessian is uniformly positive definite for large m. This bound then justifies the uniform eigenvalue lower bound throughout training, closing the argument without circularity. revision: yes
Referee: [analysis of the infinite-dimensional linear system and convergence to the limit equation] The convergence in probability of the finite-width Newton iterates to the infinite-width limit equation rests on error controls for the implicit linear solve whose dimension diverges with m. The manuscript flags this as a challenge but does not supply the explicit quantitative bounds on the residual or the invertibility error that would justify passing to the limit while preserving the uniform spectral gap of the NNTK.

Authors: We agree that explicit quantitative controls are needed. The current text identifies the diverging-dimension challenge and invokes general concentration but stops short of detailing the residual and invertibility errors. In the revision we will add a dedicated lemma providing explicit bounds: the residual of the regularized linear system is O(1/sqrt(m)) in probability uniformly over training steps, and the operator-norm distance to the NNTK inverse vanishes at the same rate, ensuring the uniform spectral gap is preserved in the limit. revision: yes
Referee: [proof of exponential convergence uniform across frequencies] The uniformity of the exponential convergence rate across the frequency spectrum is asserted once the NNTK eigenvalues are bounded away from zero. The proof must still verify that the finite-width approximation error does not re-introduce frequency-dependent slowdowns before the infinite-width limit is taken; without an explicit rate that is uniform in frequency for the finite-m dynamics, the claim that Newton's method eliminates spectral bias for all large but finite widths remains incomplete.

Authors: The manuscript proves uniformity in the infinite-width limit after establishing the NNTK spectral gap. We concede that finite-m approximation errors could in principle reintroduce frequency dependence. In revision we will augment the error analysis to show that the finite-width discrepancy terms (arising from both the kernel approximation and the implicit solve) are bounded uniformly in frequency by the same λ_m scaling, yielding an explicit exponential rate that holds for all sufficiently large finite m and is independent of the target frequency content. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives the deterministic limit equation and NNTK from the regularized Newton dynamics in the overparameterized regime, then proves exponential convergence to zero loss uniformly in frequency for large widths. The regularization scaling formula is shown (not presupposed) to make the regularized Hessian positive definite throughout training, with explicit rates and handling of the implicit update and infinite-dimensional linear system. No quoted step reduces a prediction or central claim to a fitted input, self-definition, or self-citation chain by construction; the results are obtained via mathematical analysis of the model equations rather than by renaming or fitting to outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claims rest on the existence of the infinite-width limit, a vanishing regularization schedule that preserves positive definiteness, and standard probabilistic convergence tools.

free parameters (1)

regularization scaling rate
A formula for the regularization parameter that is allowed to vanish with width while keeping the Hessian positive definite.

axioms (2)

domain assumption Finite-width Newton dynamics converge in probability to the deterministic NNTK limit equation
Invoked to pass from finite to infinite width.
ad hoc to paper Regularized Hessian remains positive definite for all sufficiently large widths during training
Required for well-posedness of the Newton updates and stated as proved for large enough width.

invented entities (1)

Newton neural tangent kernel (NNTK) no independent evidence
purpose: Deterministic kernel that governs the infinite-width training dynamics of regularized Newton's method
Newly introduced object analogous to the standard NTK but derived from the Newton update rule.

pith-pipeline@v0.9.0 · 5610 in / 1444 out tokens · 52314 ms · 2026-05-12T01:11:15.265890+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

[1]

A. D. Adeoye, P. C. Petersen, and A. Bemporad. Regularized Gauss-Newton for optimizing overparameterized neural networks.arXiv preprint arXiv:2404.14875, 2024

work page arXiv 2024
[2]

R. Anil, V . Gupta, T. Koren, K. Regan, and Y . Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

work page arXiv 2002
[3]

Arbel, R

M. Arbel, R. Menegaux, and P. Wolinski. Rethinking Gauss-Newton for learning over- parameterized models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

work page 2023
[4]

Q. Bai, S. Rosenberg, and W. Xu. Generalized tangent kernel: A unified geometric foundation for natural gradient and standard gradient.Trans. Mach. Learn. Res., 2025, 2025

work page 2025
[5]

Bietti and J

A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages ...

work page 2019
[6]

Bonfanti, G

A. Bonfanti, G. Bruno, and C. Cipriani. The challenges of the nonlinear regime for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouve...

work page 2024
[7]

Boyd and L

S. Boyd and L. Vandenberghe.Convex optimization. Cambridge University Press, Cambridge, 2004

work page 2004
[8]

T. Cai, R. Gao, J. Hou, S. Chen, D. Wang, D. He, Z. Zhang, and L. Wang. Gram-Gauss-Newton method: Learning overparameterized neural networks for regression problems.arXiv preprint arXiv:1905.11675, 2019

work page arXiv 1905
[9]

Y . Cao, Z. Fang, Y . Wu, D. Zhou, and Q. Gu. Towards understanding the spectral bias of deep learning. In Z. Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 2205–2211, 2021

work page 2021
[10]

Carvalho, J

L. Carvalho, J. a. L. Costa, J. Mour ao, and G. c. Oliveira. The positivity of the neural tangent kernel.SIAM J. Math. Data Sci., 7(2):495–515, 2025

work page 2025
[11]

S. Cayci. A Riemannian optimization perspective of the Gauss-Newton method for feedforward neural networks.arXiv preprint arXiv:2412.14031, 2024

work page arXiv 2024
[12]

Chizat and F

L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

work page 2018
[13]

Chizat, M

L. Chizat, M. Colombo, X. Fernández-Real, and A. Figalli. Infinite-width limit of deep linear neural networks.Comm. Pure Appl. Math., 77(10):3958–4007, 2024

work page 2024
[14]

Chizat, E

L. Chizat, E. Oyallon, and F. R. Bach. On lazy training in differentiable programming. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, C...

work page 2019
[15]

A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust-region methods. MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2000

work page 2000
[16]

G. Cybenko. Approximation by superpositions of a sigmoidal function.Math. Control. Signals Syst., 2(4):303–314, 1989

work page 1989
[17]

Dangel, J

F. Dangel, J. Müller, and M. Zeinhofer. Kronecker-factored approximate curvature for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancou...

work page 2024
[18]

N. A. Daryakenari, K. Shukla, and G. E. Karniadakis. Representation meets optimization: Training PINNs and PIKANs for gray-box discovery in systems pharmacology.Computers in Biology and Medicine, 201:111393, 2026

work page 2026
[19]

Fletcher.Practical methods of optimization

R. Fletcher.Practical methods of optimization. Wiley-Interscience [John Wiley & Sons], New York, second edition, 2001. 35

work page 2001
[20]

D. M. Gomes, Y . Zhang, E. Belilovsky, G. Wolf, and M. S. Hosseini. Adafisher: Adaptive second order optimization via Fisher information.arXiv preprint arXiv:2405.16397, 2024

work page arXiv 2024
[21]

Gupta, T

V . Gupta, T. Koren, and Y . Singer. Shampoo: Preconditioned stochastic tensor optimization. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, pages 1837–1845. PMLR, 2018

work page 2018
[22]

K. Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257, 1991

work page 1991
[23]

Ishikawa and R

S. Ishikawa and R. Karakida. On the parameterization of second-order optimization effective towards the infinite width. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024

work page 2024
[24]

Y . Ito. Nonlinearity creates linear independence.Adv. Comput. Math., 5(2-3):189–203, 1996

work page 1996
[25]

Jacot, C

A. Jacot, C. Hongler, and F. Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2...

work page 2018
[26]

Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

A. Jnini, E. Kiyani, K. Shukla, J. F. Urban, N. A. Daryakenari, J. Muller, M. Zeinhofer, and G. E. Karniadakis. Curvature-aware optimization for high-accuracy physics-informed neural networks.arXiv preprint arXiv:2604.05230, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[27]

Jnini, F

A. Jnini, F. Vella, and M. Zeinhofer. Gauss-Newton natural gradient descent for physics- informed computational fluid dynamics.Computers & Fluids, page 106955, 2025

work page 2025
[28]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024

work page 2024
[29]

Karakida and K

R. Karakida and K. Osawa. Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb...

work page 2020
[30]

Karhadkar, M

K. Karhadkar, M. Murray, and G. F. Montúfar. Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimension. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 20...

work page 2024
[31]

Kiyani, K

E. Kiyani, K. Shukla, J. F. Urbán, J. Darbon, and G. E. Karniadakis. Optimizing the optimizer for physics-informed neural networks and kolmogorov-arnold networks.Computer Methods in Applied Mechanics and Engineering, 446:118308, 2025

work page 2025
[32]

D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1-3):503–528, 1989

work page 1989
[33]

Martens.Second-order optimization for neural networks

J. Martens.Second-order optimization for neural networks. University of Toronto (Canada), 2016

work page 2016
[34]

Martens and R

J. Martens and R. B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In F. R. Bach and D. M. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR Workshop and Conference Proceedings, pages 2408–2417, 2015

work page 2015
[35]

S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. 36

work page 2018
[36]

Mishchenko

K. Mishchenko. Regularized Newton method with global O(1/k2) convergence.SIAM J. Optim., 33(3):1440–1462, 2023

work page 2023
[37]

Müller and M

J. Müller and M. Zeinhofer. Achieving high accuracy with pinns via energy natural gradient descent. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 25471–25485...

work page 2023
[38]

Nocedal and S

J. Nocedal and S. J. Wright.Numerical optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, second edition, 2006

work page 2006
[39]

Riedl, J

K. Riedl, J. A. Sirignano, and K. Spiliopoulos. Global convergence of adjoint-optimized neural PDEs.J. Mach. Learn. Res., 26:295:1–295:94, 2025

work page 2025
[40]

G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: an interacting particle system approach.Comm. Pure Appl. Math., 75(9):1889–1935, 2022

work page 1935
[41]

Sirignano, J

J. Sirignano, J. MacArt, and K. Spiliopoulos. PDE-constrained models with neural network terms: optimization and global convergence.J. Comput. Phys., 481:Paper No. 112016, 35, 2023

work page 2023
[42]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Scaling limit of neural networks with the Xavier initialization and convergence to a global minimum.arXiv preprint arXiv:1907.04108, 2019

work page arXiv 1907
[43]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a central limit theorem. Stochastic Process. Appl., 130(3):1820–1852, 2020

work page 2020
[44]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math., 80(2):725–752, 2020

work page 2020
[45]

Sirignano and K

J. Sirignano and K. Spiliopoulos. Mean field analysis of deep neural networks.Math. Oper. Res., 47(1):120–152, 2022

work page 2022
[46]

Spiliopoulos, R

K. Spiliopoulos, R. Sowers, and J. Sirignano.Mathematical Foundations of Deep Learning Models and Algorithms. American Mathematical Society, 2025

work page 2025
[47]

J. A. Tropp. An introduction to matrix concentration inequalities.Found. Trends Mach. Learn., 8(1-2):1–230, 2015

work page 2015
[48]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

work page 2025
[49]

Y . Wang, M. Bennani, J. Martens, S. Racanière, S. Blackwell, A. Matthews, S. Nikolov, G. Cao- Labora, D. S. Park, M. Arjovsky, et al. Discovery of unstable singularities.arXiv preprint arXiv:2509.14185, 2025

work page arXiv 2025
[50]

Zhang, J

G. Zhang, J. Martens, and R. B. Grosse. Fast convergence of natural gradient descent for over- parameterized neural networks. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché- Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 201...

work page 2019