pith. sign in

arxiv: 2605.18180 · v1 · pith:IEBWWZMWnew · submitted 2026-05-18 · 📊 stat.ML · cs.LG

Canonical Regularisation of Wide Feature-Learning Neural Networks

Pith reviewed 2026-05-20 00:28 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords regularizationgradient flowfeature learningkernel regimeneural networksRiemannian geometryinductive biastransfer learning
4
0 comments X

The pith

Ridge regularization biases gradient flow in feature-learning neural networks even as its strength vanishes, and a regime-agnostic function-space energy generalizes it to geodesic ridge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Wide neural networks in the feature-learning regime, which drive modern deep learning, interact with regularization differently than their kernel-regime counterparts. The paper proves that ridge regularization distorts the training trajectory and inductive bias in feature-learning networks, even in the limit of vanishing regularization strength, with particular harm to pretrained models whose implicit priors are informative. The authors resolve the mismatch by axiomatizing the canonical regularizer as a regime-agnostic function-space energy and lift that recovers ridge exactly in the kernel regime. They then use the Riemannian geometry of feature-learning networks to derive geodesic ridge as the generalization. This framework also identifies the canonical prior as a Riemannian Gibbs Process rather than a Gaussian Process, and introduces arc ridge as a practical minimax-robust surrogate that links early stopping to canonical regularization.

Core claim

The authors prove that ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. They resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, they derive geodesic ridge from their framework. Correspondingly, they prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, they propose arc ridge as

What carries the argument

The canonical regulariser, axiomatised as a regime-agnostic function-space energy and lift and extended via the Riemannian geometry of feature-learning networks to produce geodesic ridge.

If this is right

  • Ridge regularization distorts the inductive bias of feature-learning networks over the course of training.
  • Pretrained networks experience particular damage from this distortion when the implicit prior is informative.
  • The canonical function-space prior corresponds to a Riemannian Gibbs Process.
  • Arc ridge serves as a minimax-robust and scalable surrogate to geodesic ridge.
  • A deep relationship exists between early stopping and canonical regularisation across learning regimes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Regularization choices may need to be regime-specific in deep learning to avoid unintended distortions to inductive bias.
  • The proposed link between arc ridge and early stopping implies that practical training heuristics could approximate the effect of the canonical regularizer without explicit geometric computation.
  • The Riemannian-geometry approach could be tested on other optimization trajectories or network architectures to see whether similar generalizations of classical regularization emerge.

Load-bearing premise

A single regime-agnostic function-space energy functional exists whose minimizer under gradient flow recovers ridge in the kernel regime and yields a geometrically meaningful generalization in the feature-learning regime.

What would settle it

A controlled simulation of gradient flow on a simple wide network in the feature-learning regime, comparing the limiting solution reached with infinitesimal ridge regularization against the unregularized case to check for predicted distortion in learned features or function values.

Figures

Figures reproduced from arXiv: 2605.18180 by George Whittle, Juliusz Ziomek, Maike A. Osborne, Natalia Ares, Pranav Vaidhyanathan.

Figure 1
Figure 1. Figure 1: Left: feature-learning regime geometry. The flow manifold Mflow (θ0) (blue surface) is a curved n-dimensional submanifold near θ ⋆ ; Θ⋆ (solid black curve) is the interpolating set, that is, global minima. The gradient flow trajectory (solid blue curve) lies on Mflow (θ0) and converges to θ ⋆ . The geodesic distance dflow (θ0, θ ⋆ ) on Mflow (θ0) (dashed blue curve) underpins the canonical regulariser. The… view at source ↗
Figure 2
Figure 2. Figure 2: Gradient flow trajectories for the minimal overparametrised non-linear model [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Test set MSE on UTKFace (left) and Yelp Review (right) for standard, anchored, and arc [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Empirical verification of Assumption 2.1 along gradient-flow trajectories of a 2-hidden [PITH_FULL_IMAGE:figures/full_fig_p033_4.png] view at source ↗
read the original abstract

Wide neural networks in the feature-learning regime drive modern deep learning, and yet they remain far less studied than their kernel-regime counterparts. We consider a critical yet under-explored difference between these two regimes: the regulariser and prior implied by gradient flow training. This canonical regularisation property is well-studied in kernel regime networks -- of all the infinite global minima, gradient flow selects exactly the vanishing ridge solution -- and underpins the celebrated NN-GP correspondence, precisely allowing the modelling of noise during training. However, we prove ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. Over training, ridge distorts the inductive bias of the network, with a particular damage done to pretrained networks where the implicit prior is informative. We resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, we derive geodesic ridge from our framework, generalising ridge to the feature-learning regime. Correspondingly, we prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, we propose arc ridge as a minimax-robust, scalable surrogate to geodesic ridge, revealing a deep relationship between early stopping and canonical regularisation across learning regimes. Finally, we demonstrate the consequences of our theory empirically on both image processing and NLP transfer-learning problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that ridge regularization biases gradient flow in wide neural networks in the feature-learning regime, even in the infinitesimal limit of vanishing regularization, in contrast to the kernel regime where it selects the vanishing ridge solution. The authors axiomatize a regime-agnostic function-space energy functional that uniquely recovers ridge in the kernel regime and, via the Riemannian geometry of feature-learning networks, generalizes to geodesic ridge; they prove the corresponding canonical prior is a Riemannian Gibbs Process. As a practical surrogate they propose arc ridge, which relates to early stopping, and demonstrate consequences empirically on image processing and NLP transfer-learning tasks.

Significance. If the central claims hold after verification of the derivations, the work would be significant for clarifying implicit regularization differences between kernel and feature-learning regimes in wide networks, generalizing the NN-GP correspondence to feature-learning settings, and providing a geometrically motivated regularization approach with practical implications for pretrained models. The introduction of a Riemannian Gibbs Process prior and the arc ridge surrogate represent potentially useful conceptual and algorithmic contributions if rigorously established.

major comments (2)
  1. [Abstract] Abstract: The assertions that ridge biases gradient flow even at infinitesimal strength in the feature-learning regime and that geodesic ridge is derived from the Riemannian geometry are presented without the full derivations, error analysis, or explicit assumptions on width limits and gradient flow. This leaves the central bias result and the generalization dependent on unverified steps.
  2. [Axiomatisation and derivation of geodesic ridge] Section on axiomatisation and lift to geodesic ridge: The premise that a single regime-agnostic function-space energy E exists such that its gradient flow recovers the vanishing-ridge solution in the kernel regime and the same E lifted via the Riemannian structure yields geodesic ridge is load-bearing for uniqueness and generalization. It is not shown that the axioms exclude other functionals or that the metric on function space is compatible with the actual parameter-space dynamics when features evolve (i.e., when the tangent space changes with the weights).
minor comments (1)
  1. The empirical demonstrations are mentioned but lack sufficient detail on experimental setup, hyperparameters, and controls to allow independent verification of the claimed consequences for pretrained networks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and insightful comments, which have helped us strengthen the presentation of our results. We address each major comment below. The full derivations, assumptions, and proofs are contained in the main text and appendices; we have revised the manuscript to make key elements more explicit without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertions that ridge biases gradient flow even at infinitesimal strength in the feature-learning regime and that geodesic ridge is derived from the Riemannian geometry are presented without the full derivations, error analysis, or explicit assumptions on width limits and gradient flow. This leaves the central bias result and the generalization dependent on unverified steps.

    Authors: We agree that the abstract, being concise, does not include all technical details. The bias result for infinitesimal ridge in the feature-learning regime is derived in Section 3 under the infinite-width limit with gradient flow (Assumptions 1 and 2), and the Riemannian derivation of geodesic ridge appears in Section 4 with the associated error bounds in Appendix C. We have revised the abstract to explicitly state the infinite-width and infinitesimal-regularization assumptions. The error analysis for the approximation of geodesic ridge by arc ridge is now highlighted in the main text as well. revision: yes

  2. Referee: [Axiomatisation and derivation of geodesic ridge] Section on axiomatisation and lift to geodesic ridge: The premise that a single regime-agnostic function-space energy E exists such that its gradient flow recovers the vanishing-ridge solution in the kernel regime and the same E lifted via the Riemannian structure yields geodesic ridge is load-bearing for uniqueness and generalization. It is not shown that the axioms exclude other functionals or that the metric on function space is compatible with the actual parameter-space dynamics when features evolve (i.e., when the tangent space changes with the weights).

    Authors: The axioms in Section 2 are chosen to be the minimal set that (i) recovers the known vanishing-ridge solution under kernel-regime gradient flow (Theorem 1) and (ii) is invariant to reparameterization. Uniqueness under these axioms is established in Appendix B by showing that any other functional satisfying the same properties must coincide with E. For metric compatibility, the Riemannian metric is defined via the pullback of the parameter-space inner product at each point; in the infinite-width limit the tangent space evolves continuously but the induced function-space geometry remains well-defined because the feature maps converge to a deterministic limit (Proposition 3). We have added a clarifying paragraph in Section 4.2 addressing the time-varying tangent space explicitly. revision: partial

Circularity Check

0 steps flagged

Axiomatization and Riemannian lift constitute an independent first-principles framework

full rationale

The paper introduces an axiomatization of a regime-agnostic function-space energy whose gradient flow recovers the known vanishing-ridge solution in the kernel regime and whose lift via the Riemannian geometry of feature-learning networks yields geodesic ridge. This construction is presented as a mathematical definition and derivation rather than a fit to data or a reduction to prior self-citations. No equations in the abstract or described derivation chain equate the output geodesic ridge or Riemannian Gibbs Process directly to the input axioms by construction; the axioms are chosen to match the kernel-regime fact and then extended. The reported bias of ridge in the feature-learning regime is claimed to follow from the geometry after the framework is in place, but the framework itself does not presuppose the bias result. The practical arc-ridge surrogate is introduced separately as an approximation. The derivation is therefore self-contained against external benchmarks (kernel-regime ridge and Riemannian geometry) and receives a score of 0.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on mathematical assumptions about gradient flow selecting minimizers of a function-space energy and on the applicability of Riemannian geometry to the network parameter space; new entities are introduced to generalize existing kernel concepts without independent empirical handles.

axioms (2)
  • domain assumption Gradient flow on the network parameters induces a well-defined regularizer in the space of representable functions.
    Invoked to compare kernel and feature-learning regimes and to define the canonical regulariser.
  • domain assumption The parameter space of wide networks admits a Riemannian metric under which geodesics can be defined and used to generalize ridge.
    Used when deriving geodesic ridge from the axiomatized energy.
invented entities (2)
  • Riemannian Gibbs Process no independent evidence
    purpose: Canonical function-space prior that generalizes the Gaussian Process to the feature-learning regime.
    Proved to be the prior implied by the canonical regulariser; no external falsifiable prediction supplied.
  • Geodesic ridge no independent evidence
    purpose: Generalization of ridge regularization obtained by lifting the canonical energy via Riemannian geometry.
    Derived as the unique minimizer under the new framework; no independent verification outside the derivation.

pith-pipeline@v0.9.0 · 5822 in / 1677 out tokens · 54487 ms · 2026-05-20T00:28:08.904571+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 3 internal anchors

  1. [1]

    Academic Press, 1966

    János Aczél.Lectures on Functional Equations and Their Applications, volume 19 ofMathe- matics in Science and Engineering. Academic Press, 1966

  2. [2]

    Natural gradient works efficiently in learning.Neural computation, 10(2): 251–276, 1998

    Shun-Ichi Amari. Natural gradient works efficiently in learning.Neural computation, 10(2): 251–276, 1998

  3. [3]

    Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

    Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

  4. [4]

    Old optimizer, new norm: An anthology

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning, 2024

  5. [5]

    Gram-gauss-newton method: Learning overparameterized neural networks for regression problems.arXiv preprint arXiv:1905.11675, 2019

    Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. Gram-gauss-newton method: Learning overparameterized neural networks for regression problems.arXiv preprint arXiv:1905.11675, 2019

  6. [6]

    On the global convergence of gradient descent for over- parameterized models using optimal transport

    Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over- parameterized models using optimal transport. InAdvances in Neural Information Processing Systems, volume 31, 2018

  7. [7]

    Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

    Lénaïc Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. InConference on Learning Theory, pages 1305–1338. PMLR, 2020

  8. [8]

    Neural networks can learn represen- tations with gradient descent

    Alex Damian, Jason D Lee, and Mahdi Soltanolkotabi. Neural networks can learn represen- tations with gradient descent. InConference on Learning Theory, pages 5413–5452. PMLR, 2022

  9. [9]

    Mathematics: Theory & Applications

    Manfredo Perdigão do Carmo.Riemannian Geometry. Mathematics: Theory & Applications. Birkhäuser, 1992

  10. [10]

    Gradient descent provably optimizes overparameterized neural networks

    Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. InInternational Conference on Learning Representations, 2019

  11. [11]

    Springer Science & Business Media, 1996

    Heinz Werner Engl, Martin Hanke, and Andreas Neubauer.Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996

  12. [12]

    Implicit regularization in matrix factorization

    Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems, volume 30, 2017

  13. [13]

    Bayesian deep ensembles via the neural tangent kernel.Advances in neural information processing systems, 33:1010–1022, 2020

    Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel.Advances in neural information processing systems, 33:1010–1022, 2020

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  15. [15]

    Springer-Verlag, 1977

    Morris W Hirsch, Charles C Pugh, and Michael Shub.Invariant manifolds, volume 583 of Lecture Notes in Mathematics. Springer-Verlag, 1977

  16. [16]

    Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

    Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970

  17. [17]

    Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

    Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018

  18. [18]

    The implicit bias of gradient descent on nonseparable data

    Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR, 2019. 10

  19. [19]

    Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

  20. [20]

    Ryo Karakida and Kazuki Osawa. Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks.Advances in neural information processing systems, 33:10891–10901, 2020

  21. [21]

    Khalil.Nonlinear Systems

    Hassan K. Khalil.Nonlinear Systems. Prentice Hall, 3 edition, 2002

  22. [22]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  23. [23]

    Fine-tuning can distort pretrained features and underperform out-of-distribution

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022

  24. [24]

    Adaptive kernel predictors from feature-learning infinite limits of neural networks

    Clarissa Lauditi, Blake Bordelon, and Cengiz Pehlevan. Adaptive kernel predictors from feature-learning infinite limits of neural networks. InForty-second International Conference on Machine Learning, 2025

  25. [25]

    Wide neural networks of any depth evolve as linear models under gradient descent

    Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. InAdvances in Neural Information Processing Systems, volume 32, 2019

  26. [26]

    Smooth manifolds

    John M Lee. Smooth manifolds. InIntroduction to smooth manifolds, pages 1–29. Springer, 2003

  27. [27]

    Gradient descent maximizes the margin of homogeneous neural networks

    Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InInternational Conference on Learning Representations, 2020

  28. [28]

    New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

    James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020

  29. [29]

    A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33): E7665–E7671, 2018

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33): E7665–E7671, 2018

  30. [30]

    Envelope theorems for arbitrary choice sets.Econometrica, 70 (2):583–601, 2002

    Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets.Econometrica, 70 (2):583–601, 2002

  31. [31]

    Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics

    Radford M. Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics. Springer, 1996

  32. [32]

    Riemannian metrics for neural networks i: feedforward networks.Information and Inference: A Journal of the IMA, 4(2):108–153, 2015

    Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks.Information and Inference: A Journal of the IMA, 4(2):108–153, 2015

  33. [33]

    A gaussian process view on observation noise and initialization in wide neural networks

    Sergio Calvo Ordoñez, Jonathan Plenk, Richard Bergna, Alvaro Cartea, José Miguel Hernández- Lobato, Konstantina Palla, and Kamil Ciosek. A gaussian process view on observation noise and initialization in wide neural networks. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026

  34. [34]

    Randomized prior functions for deep rein- forcement learning.Advances in neural information processing systems, 31, 2018

    Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep rein- forcement learning.Advances in neural information processing systems, 31, 2018

  35. [35]

    Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

    Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019

  36. [36]

    Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets

    Alethea Power, Yuri Burda, Harrison Edwards, Ian Goodfellow, and Vineet Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022. 11

  37. [37]

    MIT Press, 2006

    Carl Edward Rasmussen and Christopher K I Williams.Gaussian Processes for Machine Learning. MIT Press, 2006

  38. [38]

    Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error

    Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. In Advances in Neural Information Processing Systems, volume 31, 2018

  39. [39]

    DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

    Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019

  40. [40]

    The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19 (70):1–57, 2018

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19 (70):1–57, 2018

  41. [41]

    Solution of incorrectly formulated problems and the regulariza- tion method.Soviet Math

    Andrei Nikolaevich Tikhonov. Solution of incorrectly formulated problems and the regulariza- tion method.Soviet Math. Dokl., 4:1035–1038, 1963

  42. [42]

    Springer Science & Business Media, 1994

    Stephen Wiggins.Normally hyperbolic invariant manifolds in dynamical systems, volume 105. Springer Science & Business Media, 1994

  43. [43]

    Kernel and rich regimes in overparametrized models

    Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. InConference on Learning Theory, pages 3635–3673. PMLR, 2020

  44. [44]

    To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,

    Mingyue Xu, Gal Vardi, and Itay Safran. To grok grokking: Provable grokking in ridge regression. InForty-third International Conference on Machine Learning, 2026. URL https: //arxiv.org/abs/2601.19791

  45. [45]

    Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

    Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021

  46. [46]

    Wide feedforward or recurrent neural networks of any architecture are gaussian processes.Advances in neural information processing systems, 32, 2019

    Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes.Advances in neural information processing systems, 32, 2019

  47. [47]

    Tensor programs ii: Neural tangent kernel for any architecture,

    Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture.arXiv preprint arXiv:2006.14548, 2020

  48. [48]

    Tensor programs iii: Neural matrix laws.arXiv preprint arXiv:2009.10685, 2020

    Greg Yang. Tensor programs iii: Neural matrix laws.arXiv preprint arXiv:2009.10685, 2020

  49. [49]

    Tensor programs iv: Feature learning in infinite-width neural networks

    Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InInternational Conference on Machine Learning, pages 11727–11737. PMLR, 2021

  50. [50]

    On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

    Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007

  51. [51]

    Fast convergence of natural gradient descent for over-parameterized neural networks.Advances in Neural Information Processing Systems, 32, 2019

    Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks.Advances in Neural Information Processing Systems, 32, 2019

  52. [52]

    Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

    Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015

  53. [53]

    Age progression/regression by conditional adver- sarial autoencoder

    Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adver- sarial autoencoder. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 12 A Assumptions We collect here the full list ofatomichypotheses on which the main results rely. Each entry states the assumption, lists the main-text results that d...

  54. [54]

    naturality

    characterise the adaptive kernel that emerges at the feature-learning infinite-width limit. Our work complements these results by characterising theregulariserthat is canonical for the dynamics induced by gradient flow in this regime, rather than describing what those dynamics converge to. Weight decay beyond RKHS regularisation.Weight decay has been stud...