pith. machine review for the scientific record. sign in

arxiv: 2605.08352 · v1 · submitted 2026-05-08 · 💻 cs.LG · math.PR· stat.ML

Recognition: no theorem link

Convergence Analysis of Newton's Method for Neural Networks in the Overparameterized Limit

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:11 UTC · model grok-4.3

classification 💻 cs.LG math.PRstat.ML
keywords neural networksoverparameterized limitNewton methodconvergence analysisneural tangent kernelspectral biasinfinite widthregularization
0
0 comments X

The pith

Regularized Newton's method for neural networks converges exponentially to zero loss in the infinite-width limit uniformly across frequencies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes the training dynamics of neural networks using a regularized version of Newton's method as the number of hidden units tends to infinity. It establishes that these dynamics converge in probability to a deterministic limit governed by a Newton neural tangent kernel. In this limit the network reaches a global minimizer of the training loss at an exponential rate that does not degrade for high-frequency components of the target data. This stands in contrast to gradient descent whose convergence slows on high-frequency targets because the eigenvalues of its kernel accumulate at zero.

Core claim

As the network width tends to infinity the regularized Newton updates converge in probability to the solution of a deterministic evolution equation driven by the Newton neural tangent kernel. When the regularization parameter is scaled to vanish at a suitable rate with width the kernel eigenvalues remain uniformly bounded away from zero and the training error therefore decays exponentially to zero for any target function in the data space.

What carries the argument

The Newton neural tangent kernel (NNTK), the linearized kernel that appears in the infinite-width limit of the regularized Newton updates and whose eigenvalues stay bounded below once the regularization parameter is chosen to vanish appropriately with width.

If this is right

  • Training error decays exponentially with a rate independent of the frequency content of the target data.
  • For all sufficiently large widths the regularized Hessian stays positive definite throughout the entire training trajectory.
  • Individual parameter updates under the Newton step converge to zero so that the network remains a linearization around its random initialization.
  • The method overcomes the spectral bias of gradient descent by keeping all kernel eigenvalues uniformly bounded away from zero.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same regularization scaling may allow other second-order optimizers to inherit uniform convergence rates in the infinite-width regime.
  • Practical implementations could pre-compute or approximate the required regularization schedule from width alone without per-iteration tuning.
  • The uniform eigenvalue bound suggests that second-order methods might reduce the need for frequency-specific architecture choices when modeling complex data.

Load-bearing premise

The regularization parameter must be chosen according to a scaling formula that vanishes at a suitable rate as width grows so that the regularized Hessian remains positive definite for all sufficiently large networks during training.

What would settle it

For a sequence of increasing widths the observed convergence rate for a high-frequency target function fails to remain exponential or the smallest eigenvalue of the regularized Hessian drops below zero at some point in training.

Figures

Figures reproduced from arXiv: 2605.08352 by Justin Sirignano, Konstantinos Spiliopoulos, Konstantin Riedl.

Figure 1
Figure 1. Figure 1: Empirical validation of Remarks 5 and 6 for a shallow [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Empirical validation of Lemma 9 and Theorem 7 for a shallow [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The Jacobian J N θ ∈ RM×N(d+2) and its different slicings. Hessian of the loss. The Hessian ∇2 θL N (θ) of the loss L N (θ) from (2) is given by ∇2 θL N (θ) = 1 M X M m=1 [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗
read the original abstract

A convergence analysis is developed for the regularized Newton method for training neural networks (NNs) in the overparameterized limit. As the number of hidden units tends to infinity, the NN training dynamics converge in probability to the solution of a deterministic limit equation involving a ``Newton neural tangent kernel'' (NNTK). Explicit rates characterizing this convergence are provided and, in the infinite-width limit, we prove that the NN converges exponentially fast to the target data (i.e., a global minimizer with zero loss). We show that this convergence is uniform across the frequency spectrum, addressing the spectral bias inherent in gradient descent. The eigenvalues of the NTK for gradient descent accumulate at zero, leading to slow convergence for target data with high-frequency components. In contrast, the NNTK has uniformly lower bounded eigenvalues if the regularization parameter is selected appropriately, allowing Newton's method to converge more quickly for data with high-frequency components. Mathematical challenges that need to be addressed in our analysis include the implicit parameter update of the Newton method with a potentially indefinite Hessian matrix and the fact that the dimension of this linear system of equations tends to infinity as the NN width grows. This complicates deriving the training dynamics in the overparameterized limit as well as proving the convergence of the finite-width dynamics thereto. The analysis identifies a scaling formula for selecting the regularization parameter, which we show can vanish at a suitable rate as the number of hidden units becomes larger. We prove that, for sufficiently large numbers of hidden units, the regularized Hessian remains positive definite during training and the Newton updates for individual NN parameters converge to zero, showing that the model behaves as a linearization around the initialization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper develops a convergence analysis for the regularized Newton method applied to overparameterized neural networks. As width m tends to infinity, the finite-width training dynamics are shown to converge in probability to a deterministic limit equation governed by a Newton neural tangent kernel (NNTK). In this limit the network converges exponentially fast to a global minimizer of zero loss, with the rate uniform across the frequency spectrum. This is contrasted with the spectral bias of gradient descent, whose NTK eigenvalues accumulate at zero. A scaling formula for the regularization parameter λ_m → 0 is identified that is claimed to keep the regularized Hessian positive definite for all sufficiently large m throughout training, allowing the NNTK spectrum to be bounded away from zero.

Significance. If the central claims hold, the work would constitute a substantial theoretical contribution to the analysis of second-order optimization for neural networks. It supplies explicit convergence rates, derives a new NNTK limit object from the implicit Newton update, and offers a mechanism for overcoming spectral bias via uniform eigenvalue control. The technical handling of infinite-dimensional linear systems and the vanishing regularization scaling are non-trivial and, if rigorously closed, would be of broad interest in the overparameterized regime.

major comments (3)
  1. [section deriving the NNTK and proving positive definiteness during training] The claim that the regularized Hessian remains positive definite with eigenvalues bounded below uniformly in m and throughout the entire training trajectory (abstract and the section deriving the NNTK limit equation) is load-bearing for both the well-posedness of the implicit update and the exponential convergence rate. The provided scaling formula for λ_m appears to be justified by concentration at initialization; however, for high-frequency targets the Newton steps induce parameter deviations whose size is controlled only after invoking the very convergence rate that presupposes the eigenvalue lower bound. A bootstrap or a priori estimate that bounds the trajectory deviation independently of the rate is required to close the argument.
  2. [analysis of the infinite-dimensional linear system and convergence to the limit equation] The convergence in probability of the finite-width Newton iterates to the infinite-width limit equation (the central technical result) rests on error controls for the implicit linear solve whose dimension diverges with m. The manuscript flags this as a mathematical challenge but does not supply the explicit quantitative bounds on the residual or the invertibility error that would justify passing to the limit while preserving the uniform spectral gap of the NNTK.
  3. [proof of exponential convergence uniform across frequencies] The uniformity of the exponential convergence rate across the frequency spectrum is asserted once the NNTK eigenvalues are bounded away from zero. The proof must still verify that the finite-width approximation error does not re-introduce frequency-dependent slowdowns before the infinite-width limit is taken; without an explicit rate that is uniform in frequency for the finite-m dynamics, the claim that Newton's method eliminates spectral bias for all large but finite widths remains incomplete.
minor comments (2)
  1. [preliminaries and definition of NNTK] Notation for the NNTK and its relation to the standard NTK should be introduced with a clear side-by-side comparison, including how the regularization enters the kernel definition.
  2. [convergence statements] Several steps in the derivation of the limit dynamics invoke “sufficiently large m” without quantitative thresholds; adding explicit dependence on m in the error terms would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The comments identify key technical gaps in closing the arguments for positive definiteness, limit passage, and uniform rates. We address each point below and will incorporate major revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [section deriving the NNTK and proving positive definiteness during training] The claim that the regularized Hessian remains positive definite with eigenvalues bounded below uniformly in m and throughout the entire training trajectory is load-bearing. The scaling for λ_m is justified at initialization; however, for high-frequency targets the Newton steps induce parameter deviations whose size is controlled only after invoking the convergence rate that presupposes the eigenvalue lower bound. A bootstrap or a priori estimate that bounds the trajectory deviation independently of the rate is required to close the argument.

    Authors: We acknowledge the circularity concern. The manuscript establishes positive definiteness at initialization via concentration inequalities and shows that parameter updates remain small, but the argument for high-frequency targets does rely on the rate to control deviations. In revision we will insert an explicit bootstrap: first derive an a priori bound on the entire training trajectory deviation that depends only on the regularization scaling λ_m and the boundedness of the loss (independent of the exponential rate), using the fact that the initial regularized Hessian is uniformly positive definite for large m. This bound then justifies the uniform eigenvalue lower bound throughout training, closing the argument without circularity. revision: yes

  2. Referee: [analysis of the infinite-dimensional linear system and convergence to the limit equation] The convergence in probability of the finite-width Newton iterates to the infinite-width limit equation rests on error controls for the implicit linear solve whose dimension diverges with m. The manuscript flags this as a challenge but does not supply the explicit quantitative bounds on the residual or the invertibility error that would justify passing to the limit while preserving the uniform spectral gap of the NNTK.

    Authors: We agree that explicit quantitative controls are needed. The current text identifies the diverging-dimension challenge and invokes general concentration but stops short of detailing the residual and invertibility errors. In the revision we will add a dedicated lemma providing explicit bounds: the residual of the regularized linear system is O(1/sqrt(m)) in probability uniformly over training steps, and the operator-norm distance to the NNTK inverse vanishes at the same rate, ensuring the uniform spectral gap is preserved in the limit. revision: yes

  3. Referee: [proof of exponential convergence uniform across frequencies] The uniformity of the exponential convergence rate across the frequency spectrum is asserted once the NNTK eigenvalues are bounded away from zero. The proof must still verify that the finite-width approximation error does not re-introduce frequency-dependent slowdowns before the infinite-width limit is taken; without an explicit rate that is uniform in frequency for the finite-m dynamics, the claim that Newton's method eliminates spectral bias for all large but finite widths remains incomplete.

    Authors: The manuscript proves uniformity in the infinite-width limit after establishing the NNTK spectral gap. We concede that finite-m approximation errors could in principle reintroduce frequency dependence. In revision we will augment the error analysis to show that the finite-width discrepancy terms (arising from both the kernel approximation and the implicit solve) are bounded uniformly in frequency by the same λ_m scaling, yielding an explicit exponential rate that holds for all sufficiently large finite m and is independent of the target frequency content. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper derives the deterministic limit equation and NNTK from the regularized Newton dynamics in the overparameterized regime, then proves exponential convergence to zero loss uniformly in frequency for large widths. The regularization scaling formula is shown (not presupposed) to make the regularized Hessian positive definite throughout training, with explicit rates and handling of the implicit update and infinite-dimensional linear system. No quoted step reduces a prediction or central claim to a fitted input, self-definition, or self-citation chain by construction; the results are obtained via mathematical analysis of the model equations rather than by renaming or fitting to outcomes.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claims rest on the existence of the infinite-width limit, a vanishing regularization schedule that preserves positive definiteness, and standard probabilistic convergence tools.

free parameters (1)
  • regularization scaling rate
    A formula for the regularization parameter that is allowed to vanish with width while keeping the Hessian positive definite.
axioms (2)
  • domain assumption Finite-width Newton dynamics converge in probability to the deterministic NNTK limit equation
    Invoked to pass from finite to infinite width.
  • ad hoc to paper Regularized Hessian remains positive definite for all sufficiently large widths during training
    Required for well-posedness of the Newton updates and stated as proved for large enough width.
invented entities (1)
  • Newton neural tangent kernel (NNTK) no independent evidence
    purpose: Deterministic kernel that governs the infinite-width training dynamics of regularized Newton's method
    Newly introduced object analogous to the standard NTK but derived from the Newton update rule.

pith-pipeline@v0.9.0 · 5610 in / 1444 out tokens · 52314 ms · 2026-05-12T01:11:15.265890+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 1 internal anchor

  1. [1]

    A. D. Adeoye, P. C. Petersen, and A. Bemporad. Regularized Gauss-Newton for optimizing overparameterized neural networks.arXiv preprint arXiv:2404.14875, 2024

  2. [2]

    R. Anil, V . Gupta, T. Koren, K. Regan, and Y . Singer. Scalable second order optimization for deep learning.arXiv preprint arXiv:2002.09018, 2020

  3. [3]

    Arbel, R

    M. Arbel, R. Menegaux, and P. Wolinski. Rethinking Gauss-Newton for learning over- parameterized models. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, editors,Advances in Neural Information Processing Systems 36: Annual Confer- ence on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

  4. [4]

    Q. Bai, S. Rosenberg, and W. Xu. Generalized tangent kernel: A unified geometric foundation for natural gradient and standard gradient.Trans. Mach. Learn. Res., 2025, 2025

  5. [5]

    Bietti and J

    A. Bietti and J. Mairal. On the inductive bias of neural tangent kernels. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, Canada, pages ...

  6. [6]

    Bonfanti, G

    A. Bonfanti, G. Bruno, and C. Cipriani. The challenges of the nonlinear regime for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouve...

  7. [7]

    Boyd and L

    S. Boyd and L. Vandenberghe.Convex optimization. Cambridge University Press, Cambridge, 2004

  8. [8]

    T. Cai, R. Gao, J. Hou, S. Chen, D. Wang, D. He, Z. Zhang, and L. Wang. Gram-Gauss-Newton method: Learning overparameterized neural networks for regression problems.arXiv preprint arXiv:1905.11675, 2019

  9. [9]

    Y . Cao, Z. Fang, Y . Wu, D. Zhou, and Q. Gu. Towards understanding the spectral bias of deep learning. In Z. Zhou, editor,Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021, Virtual Event / Montreal, Canada, 19-27 August 2021, pages 2205–2211, 2021

  10. [10]

    Carvalho, J

    L. Carvalho, J. a. L. Costa, J. Mour ao, and G. c. Oliveira. The positivity of the neural tangent kernel.SIAM J. Math. Data Sci., 7(2):495–515, 2025

  11. [11]

    S. Cayci. A Riemannian optimization perspective of the Gauss-Newton method for feedforward neural networks.arXiv preprint arXiv:2412.14031, 2024

  12. [12]

    Chizat and F

    L. Chizat and F. Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport.Advances in neural information processing systems, 31, 2018

  13. [13]

    Chizat, M

    L. Chizat, M. Colombo, X. Fernández-Real, and A. Figalli. Infinite-width limit of deep linear neural networks.Comm. Pure Appl. Math., 77(10):3958–4007, 2024

  14. [14]

    Chizat, E

    L. Chizat, E. Oyallon, and F. R. Bach. On lazy training in differentiable programming. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché-Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 2019, December 8-14, 2019, Vancouver, BC, C...

  15. [15]

    A. R. Conn, N. I. M. Gould, and P. L. Toint.Trust-region methods. MPS/SIAM Series on Optimization. Society for Industrial and Applied Mathematics (SIAM), Philadelphia, PA; Mathematical Programming Society (MPS), Philadelphia, PA, 2000

  16. [16]

    G. Cybenko. Approximation by superpositions of a sigmoidal function.Math. Control. Signals Syst., 2(4):303–314, 1989

  17. [17]

    Dangel, J

    F. Dangel, J. Müller, and M. Zeinhofer. Kronecker-factored approximate curvature for physics- informed neural networks. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancou...

  18. [18]

    N. A. Daryakenari, K. Shukla, and G. E. Karniadakis. Representation meets optimization: Training PINNs and PIKANs for gray-box discovery in systems pharmacology.Computers in Biology and Medicine, 201:111393, 2026

  19. [19]

    Fletcher.Practical methods of optimization

    R. Fletcher.Practical methods of optimization. Wiley-Interscience [John Wiley & Sons], New York, second edition, 2001. 35

  20. [20]

    D. M. Gomes, Y . Zhang, E. Belilovsky, G. Wolf, and M. S. Hosseini. Adafisher: Adaptive second order optimization via Fisher information.arXiv preprint arXiv:2405.16397, 2024

  21. [21]

    Gupta, T

    V . Gupta, T. Koren, and Y . Singer. Shampoo: Preconditioned stochastic tensor optimization. In J. G. Dy and A. Krause, editors,Proceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10-15, 2018, Proceedings of Machine Learning Research, pages 1837–1845. PMLR, 2018

  22. [22]

    K. Hornik. Approximation capabilities of multilayer feedforward networks.Neural Networks, 4(2):251–257, 1991

  23. [23]

    Ishikawa and R

    S. Ishikawa and R. Karakida. On the parameterization of second-order optimization effective towards the infinite width. InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024, 2024

  24. [24]

    Y . Ito. Nonlinearity creates linear independence.Adv. Comput. Math., 5(2-3):189–203, 1996

  25. [25]

    Jacot, C

    A. Jacot, C. Hongler, and F. Gabriel. Neural tangent kernel: Convergence and generalization in neural networks. In S. Bengio, H. M. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2...

  26. [26]

    Curvature-Aware Optimization for High-Accuracy Physics-Informed Neural Networks

    A. Jnini, E. Kiyani, K. Shukla, J. F. Urban, N. A. Daryakenari, J. Muller, M. Zeinhofer, and G. E. Karniadakis. Curvature-aware optimization for high-accuracy physics-informed neural networks.arXiv preprint arXiv:2604.05230, 2026

  27. [27]

    Jnini, F

    A. Jnini, F. Vella, and M. Zeinhofer. Gauss-Newton natural gradient descent for physics- informed computational fluid dynamics.Computers & Fluids, page 106955, 2025

  28. [28]

    Jordan, Y

    K. Jordan, Y . Jin, V . Boza, Y . Jiacheng, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024

  29. [29]

    Karakida and K

    R. Karakida and K. Osawa. Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks. In H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors,Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, Decemb...

  30. [30]

    Karhadkar, M

    K. Karhadkar, M. Murray, and G. F. Montúfar. Bounds for the smallest eigenvalue of the NTK for arbitrary spherical data of arbitrary dimension. In A. Globersons, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. M. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 20...

  31. [31]

    Kiyani, K

    E. Kiyani, K. Shukla, J. F. Urbán, J. Darbon, and G. E. Karniadakis. Optimizing the optimizer for physics-informed neural networks and kolmogorov-arnold networks.Computer Methods in Applied Mechanics and Engineering, 446:118308, 2025

  32. [32]

    D. C. Liu and J. Nocedal. On the limited memory BFGS method for large scale optimization. Math. Program., 45(1-3):503–528, 1989

  33. [33]

    Martens.Second-order optimization for neural networks

    J. Martens.Second-order optimization for neural networks. University of Toronto (Canada), 2016

  34. [34]

    Martens and R

    J. Martens and R. B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In F. R. Bach and D. M. Blei, editors,Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, JMLR Workshop and Conference Proceedings, pages 2408–2417, 2015

  35. [35]

    S. Mei, A. Montanari, and P.-M. Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33):E7665–E7671, 2018. 36

  36. [36]

    Mishchenko

    K. Mishchenko. Regularized Newton method with global O(1/k2) convergence.SIAM J. Optim., 33(3):1440–1462, 2023

  37. [37]

    Müller and M

    J. Müller and M. Zeinhofer. Achieving high accuracy with pinns via energy natural gradient descent. In A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 ofProceedings of Machine Learning Research, pages 25471–25485...

  38. [38]

    Nocedal and S

    J. Nocedal and S. J. Wright.Numerical optimization. Springer Series in Operations Research and Financial Engineering. Springer, New York, second edition, 2006

  39. [39]

    Riedl, J

    K. Riedl, J. A. Sirignano, and K. Spiliopoulos. Global convergence of adjoint-optimized neural PDEs.J. Mach. Learn. Res., 26:295:1–295:94, 2025

  40. [40]

    G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: an interacting particle system approach.Comm. Pure Appl. Math., 75(9):1889–1935, 2022

  41. [41]

    Sirignano, J

    J. Sirignano, J. MacArt, and K. Spiliopoulos. PDE-constrained models with neural network terms: optimization and global convergence.J. Comput. Phys., 481:Paper No. 112016, 35, 2023

  42. [42]

    Sirignano and K

    J. Sirignano and K. Spiliopoulos. Scaling limit of neural networks with the Xavier initialization and convergence to a global minimum.arXiv preprint arXiv:1907.04108, 2019

  43. [43]

    Sirignano and K

    J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a central limit theorem. Stochastic Process. Appl., 130(3):1820–1852, 2020

  44. [44]

    Sirignano and K

    J. Sirignano and K. Spiliopoulos. Mean field analysis of neural networks: a law of large numbers. SIAM J. Appl. Math., 80(2):725–752, 2020

  45. [45]

    Sirignano and K

    J. Sirignano and K. Spiliopoulos. Mean field analysis of deep neural networks.Math. Oper. Res., 47(1):120–152, 2022

  46. [46]

    Spiliopoulos, R

    K. Spiliopoulos, R. Sowers, and J. Sirignano.Mathematical Foundations of Deep Learning Models and Algorithms. American Mathematical Society, 2025

  47. [47]

    J. A. Tropp. An introduction to matrix concentration inequalities.Found. Trends Mach. Learn., 8(1-2):1–230, 2015

  48. [48]

    N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: improving and stabilizing shampoo using adam for language modeling. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025, 2025

  49. [49]

    Y . Wang, M. Bennani, J. Martens, S. Racanière, S. Blackwell, A. Matthews, S. Nikolov, G. Cao- Labora, D. S. Park, M. Arjovsky, et al. Discovery of unstable singularities.arXiv preprint arXiv:2509.14185, 2025

  50. [50]

    Zhang, J

    G. Zhang, J. Martens, and R. B. Grosse. Fast convergence of natural gradient descent for over- parameterized neural networks. In H. M. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché- Buc, E. B. Fox, and R. Garnett, editors,Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS 201...