Canonical Regularisation of Wide Feature-Learning Neural Networks
Pith reviewed 2026-05-20 00:28 UTC · model grok-4.3
The pith
Ridge regularization biases gradient flow in feature-learning neural networks even as its strength vanishes, and a regime-agnostic function-space energy generalizes it to geodesic ridge.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors prove that ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. They resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, they derive geodesic ridge from their framework. Correspondingly, they prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, they propose arc ridge as
What carries the argument
The canonical regulariser, axiomatised as a regime-agnostic function-space energy and lift and extended via the Riemannian geometry of feature-learning networks to produce geodesic ridge.
If this is right
- Ridge regularization distorts the inductive bias of feature-learning networks over the course of training.
- Pretrained networks experience particular damage from this distortion when the implicit prior is informative.
- The canonical function-space prior corresponds to a Riemannian Gibbs Process.
- Arc ridge serves as a minimax-robust and scalable surrogate to geodesic ridge.
- A deep relationship exists between early stopping and canonical regularisation across learning regimes.
Where Pith is reading between the lines
- Regularization choices may need to be regime-specific in deep learning to avoid unintended distortions to inductive bias.
- The proposed link between arc ridge and early stopping implies that practical training heuristics could approximate the effect of the canonical regularizer without explicit geometric computation.
- The Riemannian-geometry approach could be tested on other optimization trajectories or network architectures to see whether similar generalizations of classical regularization emerge.
Load-bearing premise
A single regime-agnostic function-space energy functional exists whose minimizer under gradient flow recovers ridge in the kernel regime and yields a geometrically meaningful generalization in the feature-learning regime.
What would settle it
A controlled simulation of gradient flow on a simple wide network in the feature-learning regime, comparing the limiting solution reached with infinitesimal ridge regularization against the unregularized case to check for predicted distortion in learned features or function values.
Figures
read the original abstract
Wide neural networks in the feature-learning regime drive modern deep learning, and yet they remain far less studied than their kernel-regime counterparts. We consider a critical yet under-explored difference between these two regimes: the regulariser and prior implied by gradient flow training. This canonical regularisation property is well-studied in kernel regime networks -- of all the infinite global minima, gradient flow selects exactly the vanishing ridge solution -- and underpins the celebrated NN-GP correspondence, precisely allowing the modelling of noise during training. However, we prove ridge regularisation biases gradient flow in feature-learning regime networks, even in the infinitesimal limit of vanishing regularisation. Over training, ridge distorts the inductive bias of the network, with a particular damage done to pretrained networks where the implicit prior is informative. We resolve this by axiomatising the canonical regulariser as a regime-agnostic function-space energy and lift, which uniquely identifies ridge in the kernel regime, and crucially generalises to the feature-learning regime. By studying the Riemannian geometry of feature-learning networks, we derive geodesic ridge from our framework, generalising ridge to the feature-learning regime. Correspondingly, we prove the canonical function-space prior is a Riemannian Gibbs Process, generalising the more familiar Gaussian Process. As a practical contribution, we propose arc ridge as a minimax-robust, scalable surrogate to geodesic ridge, revealing a deep relationship between early stopping and canonical regularisation across learning regimes. Finally, we demonstrate the consequences of our theory empirically on both image processing and NLP transfer-learning problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that ridge regularization biases gradient flow in wide neural networks in the feature-learning regime, even in the infinitesimal limit of vanishing regularization, in contrast to the kernel regime where it selects the vanishing ridge solution. The authors axiomatize a regime-agnostic function-space energy functional that uniquely recovers ridge in the kernel regime and, via the Riemannian geometry of feature-learning networks, generalizes to geodesic ridge; they prove the corresponding canonical prior is a Riemannian Gibbs Process. As a practical surrogate they propose arc ridge, which relates to early stopping, and demonstrate consequences empirically on image processing and NLP transfer-learning tasks.
Significance. If the central claims hold after verification of the derivations, the work would be significant for clarifying implicit regularization differences between kernel and feature-learning regimes in wide networks, generalizing the NN-GP correspondence to feature-learning settings, and providing a geometrically motivated regularization approach with practical implications for pretrained models. The introduction of a Riemannian Gibbs Process prior and the arc ridge surrogate represent potentially useful conceptual and algorithmic contributions if rigorously established.
major comments (2)
- [Abstract] Abstract: The assertions that ridge biases gradient flow even at infinitesimal strength in the feature-learning regime and that geodesic ridge is derived from the Riemannian geometry are presented without the full derivations, error analysis, or explicit assumptions on width limits and gradient flow. This leaves the central bias result and the generalization dependent on unverified steps.
- [Axiomatisation and derivation of geodesic ridge] Section on axiomatisation and lift to geodesic ridge: The premise that a single regime-agnostic function-space energy E exists such that its gradient flow recovers the vanishing-ridge solution in the kernel regime and the same E lifted via the Riemannian structure yields geodesic ridge is load-bearing for uniqueness and generalization. It is not shown that the axioms exclude other functionals or that the metric on function space is compatible with the actual parameter-space dynamics when features evolve (i.e., when the tangent space changes with the weights).
minor comments (1)
- The empirical demonstrations are mentioned but lack sufficient detail on experimental setup, hyperparameters, and controls to allow independent verification of the claimed consequences for pretrained networks.
Simulated Author's Rebuttal
We thank the referee for their careful reading and insightful comments, which have helped us strengthen the presentation of our results. We address each major comment below. The full derivations, assumptions, and proofs are contained in the main text and appendices; we have revised the manuscript to make key elements more explicit without altering the core claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertions that ridge biases gradient flow even at infinitesimal strength in the feature-learning regime and that geodesic ridge is derived from the Riemannian geometry are presented without the full derivations, error analysis, or explicit assumptions on width limits and gradient flow. This leaves the central bias result and the generalization dependent on unverified steps.
Authors: We agree that the abstract, being concise, does not include all technical details. The bias result for infinitesimal ridge in the feature-learning regime is derived in Section 3 under the infinite-width limit with gradient flow (Assumptions 1 and 2), and the Riemannian derivation of geodesic ridge appears in Section 4 with the associated error bounds in Appendix C. We have revised the abstract to explicitly state the infinite-width and infinitesimal-regularization assumptions. The error analysis for the approximation of geodesic ridge by arc ridge is now highlighted in the main text as well. revision: yes
-
Referee: [Axiomatisation and derivation of geodesic ridge] Section on axiomatisation and lift to geodesic ridge: The premise that a single regime-agnostic function-space energy E exists such that its gradient flow recovers the vanishing-ridge solution in the kernel regime and the same E lifted via the Riemannian structure yields geodesic ridge is load-bearing for uniqueness and generalization. It is not shown that the axioms exclude other functionals or that the metric on function space is compatible with the actual parameter-space dynamics when features evolve (i.e., when the tangent space changes with the weights).
Authors: The axioms in Section 2 are chosen to be the minimal set that (i) recovers the known vanishing-ridge solution under kernel-regime gradient flow (Theorem 1) and (ii) is invariant to reparameterization. Uniqueness under these axioms is established in Appendix B by showing that any other functional satisfying the same properties must coincide with E. For metric compatibility, the Riemannian metric is defined via the pullback of the parameter-space inner product at each point; in the infinite-width limit the tangent space evolves continuously but the induced function-space geometry remains well-defined because the feature maps converge to a deterministic limit (Proposition 3). We have added a clarifying paragraph in Section 4.2 addressing the time-varying tangent space explicitly. revision: partial
Circularity Check
Axiomatization and Riemannian lift constitute an independent first-principles framework
full rationale
The paper introduces an axiomatization of a regime-agnostic function-space energy whose gradient flow recovers the known vanishing-ridge solution in the kernel regime and whose lift via the Riemannian geometry of feature-learning networks yields geodesic ridge. This construction is presented as a mathematical definition and derivation rather than a fit to data or a reduction to prior self-citations. No equations in the abstract or described derivation chain equate the output geodesic ridge or Riemannian Gibbs Process directly to the input axioms by construction; the axioms are chosen to match the kernel-regime fact and then extended. The reported bias of ridge in the feature-learning regime is claimed to follow from the geometry after the framework is in place, but the framework itself does not presuppose the bias result. The practical arc-ridge surrogate is introduced separately as an approximation. The derivation is therefore self-contained against external benchmarks (kernel-regime ridge and Riemannian geometry) and receives a score of 0.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Gradient flow on the network parameters induces a well-defined regularizer in the space of representable functions.
- domain assumption The parameter space of wide networks admits a Riemannian metric under which geodesics can be defined and used to generalize ridge.
invented entities (2)
-
Riemannian Gibbs Process
no independent evidence
-
Geodesic ridge
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We axiomatise the canonical regulariser as a regime-agnostic function-space energy and lift... uniquely identifies ridge in the kernel regime... derive geodesic ridge... Riemannian Gibbs Process
-
IndisputableMonolith/Foundation/BranchSelection.leanbranch_selection unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem C.4 (Output-space energy uniqueness)... G⁻¹-isometry invariance and orthogonal additivity
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
János Aczél.Lectures on Functional Equations and Their Applications, volume 19 ofMathe- matics in Science and Engineering. Academic Press, 1966
work page 1966
-
[2]
Natural gradient works efficiently in learning.Neural computation, 10(2): 251–276, 1998
Shun-Ichi Amari. Natural gradient works efficiently in learning.Neural computation, 10(2): 251–276, 1998
work page 1998
-
[3]
Peter L Bartlett, Andrea Montanari, and Alexander Rakhlin. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020
work page 2020
-
[4]
Old optimizer, new norm: An anthology
Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. InOPT 2024: Optimization for Machine Learning, 2024
work page 2024
-
[5]
Tianle Cai, Ruiqi Gao, Jikai Hou, Siyu Chen, Dong Wang, Di He, Zhihua Zhang, and Liwei Wang. Gram-gauss-newton method: Learning overparameterized neural networks for regression problems.arXiv preprint arXiv:1905.11675, 2019
-
[6]
On the global convergence of gradient descent for over- parameterized models using optimal transport
Lénaïc Chizat and Francis Bach. On the global convergence of gradient descent for over- parameterized models using optimal transport. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[7]
Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss
Lénaïc Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. InConference on Learning Theory, pages 1305–1338. PMLR, 2020
work page 2020
-
[8]
Neural networks can learn represen- tations with gradient descent
Alex Damian, Jason D Lee, and Mahdi Soltanolkotabi. Neural networks can learn represen- tations with gradient descent. InConference on Learning Theory, pages 5413–5452. PMLR, 2022
work page 2022
-
[9]
Mathematics: Theory & Applications
Manfredo Perdigão do Carmo.Riemannian Geometry. Mathematics: Theory & Applications. Birkhäuser, 1992
work page 1992
-
[10]
Gradient descent provably optimizes overparameterized neural networks
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes overparameterized neural networks. InInternational Conference on Learning Representations, 2019
work page 2019
-
[11]
Springer Science & Business Media, 1996
Heinz Werner Engl, Martin Hanke, and Andreas Neubauer.Regularization of inverse problems, volume 375. Springer Science & Business Media, 1996
work page 1996
-
[12]
Implicit regularization in matrix factorization
Suriya Gunasekar, Blake E Woodworth, Srinadh Bhojanapalli, Behnam Neyshabur, and Nathan Srebro. Implicit regularization in matrix factorization. InAdvances in Neural Information Processing Systems, volume 30, 2017
work page 2017
-
[13]
Bobby He, Balaji Lakshminarayanan, and Yee Whye Teh. Bayesian deep ensembles via the neural tangent kernel.Advances in neural information processing systems, 33:1010–1022, 2020
work page 2020
-
[14]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
work page 2016
-
[15]
Morris W Hirsch, Charles C Pugh, and Michael Shub.Invariant manifolds, volume 583 of Lecture Notes in Mathematics. Springer-Verlag, 1977
work page 1977
-
[16]
Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970
Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems.Technometrics, 12(1):55–67, 1970
work page 1970
-
[17]
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in neural information processing systems, 31, 2018
work page 2018
-
[18]
The implicit bias of gradient descent on nonseparable data
Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772–1798. PMLR, 2019. 10
work page 2019
-
[19]
Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024
work page 2024
-
[20]
Ryo Karakida and Kazuki Osawa. Understanding approximate fisher information for fast convergence of natural gradient descent in wide neural networks.Advances in neural information processing systems, 33:10891–10901, 2020
work page 2020
-
[21]
Hassan K. Khalil.Nonlinear Systems. Prentice Hall, 3 edition, 2002
work page 2002
-
[22]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[23]
Fine-tuning can distort pretrained features and underperform out-of-distribution
Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can distort pretrained features and underperform out-of-distribution. InInternational Conference on Learning Representations, 2022
work page 2022
-
[24]
Adaptive kernel predictors from feature-learning infinite limits of neural networks
Clarissa Lauditi, Blake Bordelon, and Cengiz Pehlevan. Adaptive kernel predictors from feature-learning infinite limits of neural networks. InForty-second International Conference on Machine Learning, 2025
work page 2025
-
[25]
Wide neural networks of any depth evolve as linear models under gradient descent
Jaehoon Lee, Lechao Xiao, Samuel Schoenholz, Yasaman Bahri, Roman Novak, Jascha Sohl- Dickstein, and Jeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradient descent. InAdvances in Neural Information Processing Systems, volume 32, 2019
work page 2019
-
[26]
John M Lee. Smooth manifolds. InIntroduction to smooth manifolds, pages 1–29. Springer, 2003
work page 2003
-
[27]
Gradient descent maximizes the margin of homogeneous neural networks
Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. InInternational Conference on Learning Representations, 2020
work page 2020
-
[28]
James Martens. New insights and perspectives on the natural gradient method.Journal of Machine Learning Research, 21(146):1–76, 2020
work page 2020
-
[29]
Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks.Proceedings of the National Academy of Sciences, 115(33): E7665–E7671, 2018
work page 2018
-
[30]
Envelope theorems for arbitrary choice sets.Econometrica, 70 (2):583–601, 2002
Paul Milgrom and Ilya Segal. Envelope theorems for arbitrary choice sets.Econometrica, 70 (2):583–601, 2002
work page 2002
-
[31]
Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics
Radford M. Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics. Springer, 1996
work page 1996
-
[32]
Yann Ollivier. Riemannian metrics for neural networks i: feedforward networks.Information and Inference: A Journal of the IMA, 4(2):108–153, 2015
work page 2015
-
[33]
A gaussian process view on observation noise and initialization in wide neural networks
Sergio Calvo Ordoñez, Jonathan Plenk, Richard Bergna, Alvaro Cartea, José Miguel Hernández- Lobato, Konstantina Palla, and Kamil Ciosek. A gaussian process view on observation noise and initialization in wide neural networks. InThe 29th International Conference on Artificial Intelligence and Statistics, 2026
work page 2026
-
[34]
Ian Osband, John Aslanides, and Albin Cassirer. Randomized prior functions for deep rein- forcement learning.Advances in neural information processing systems, 31, 2018
work page 2018
-
[35]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019
work page 2019
-
[36]
Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets
Alethea Power, Yuri Burda, Harrison Edwards, Ian Goodfellow, and Vineet Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022. 11
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[37]
Carl Edward Rasmussen and Christopher K I Williams.Gaussian Processes for Machine Learning. MIT Press, 2006
work page 2006
-
[38]
Grant M Rotskoff and Eric Vanden-Eijnden. Neural networks as interacting particle systems: Asymptotic convexity of the loss landscape and universal scaling of the approximation error. In Advances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[39]
DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[40]
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data.Journal of Machine Learning Research, 19 (70):1–57, 2018
work page 2018
-
[41]
Solution of incorrectly formulated problems and the regulariza- tion method.Soviet Math
Andrei Nikolaevich Tikhonov. Solution of incorrectly formulated problems and the regulariza- tion method.Soviet Math. Dokl., 4:1035–1038, 1963
work page 1963
-
[42]
Springer Science & Business Media, 1994
Stephen Wiggins.Normally hyperbolic invariant manifolds in dynamical systems, volume 105. Springer Science & Business Media, 1994
work page 1994
-
[43]
Kernel and rich regimes in overparametrized models
Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. InConference on Learning Theory, pages 3635–3673. PMLR, 2020
work page 2020
-
[44]
To grok grokking: Provable grokking in ridge regression.arXiv preprint arXiv:2601.19791,
Mingyue Xu, Gal Vardi, and Itay Safran. To grok grokking: Provable grokking in ridge regression. InForty-third International Conference on Machine Learning, 2026. URL https: //arxiv.org/abs/2601.19791
-
[45]
Ge Yang, Edward Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tuning large neural networks via zero-shot hyperparameter transfer.Advances in Neural Information Processing Systems, 34:17084–17097, 2021
work page 2021
-
[46]
Greg Yang. Wide feedforward or recurrent neural networks of any architecture are gaussian processes.Advances in neural information processing systems, 32, 2019
work page 2019
-
[47]
Tensor programs ii: Neural tangent kernel for any architecture,
Greg Yang. Tensor programs ii: Neural tangent kernel for any architecture.arXiv preprint arXiv:2006.14548, 2020
-
[48]
Tensor programs iii: Neural matrix laws.arXiv preprint arXiv:2009.10685, 2020
Greg Yang. Tensor programs iii: Neural matrix laws.arXiv preprint arXiv:2009.10685, 2020
-
[49]
Tensor programs iv: Feature learning in infinite-width neural networks
Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. InInternational Conference on Machine Learning, pages 11727–11737. PMLR, 2021
work page 2021
-
[50]
On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Constructive approximation, 26(2):289–315, 2007
work page 2007
-
[51]
Guodong Zhang, James Martens, and Roger B Grosse. Fast convergence of natural gradient descent for over-parameterized neural networks.Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[52]
Xiang Zhang, Junbo Zhao, and Yann LeCun. Character-level convolutional networks for text classification.Advances in neural information processing systems, 28, 2015
work page 2015
-
[53]
Age progression/regression by conditional adver- sarial autoencoder
Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adver- sarial autoencoder. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. 12 A Assumptions We collect here the full list ofatomichypotheses on which the main results rely. Each entry states the assumption, lists the main-text results that d...
work page 2017
-
[54]
characterise the adaptive kernel that emerges at the feature-learning infinite-width limit. Our work complements these results by characterising theregulariserthat is canonical for the dynamics induced by gradient flow in this regime, rather than describing what those dynamics converge to. Weight decay beyond RKHS regularisation.Weight decay has been stud...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.