Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods
Pith reviewed 2026-06-27 23:02 UTC · model grok-4.3
The pith
Overparameterized DNNs trained by gradient descent or SGD achieve the first minimax-optimal excess risk rates matching kernels when width grows polynomially with sample size.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By establishing that gradient-based methods on over-parameterized DNNs with smooth activations fully inherit the learning dynamics of their kernel counterparts, the authors derive the first known minimax-optimal rates for the excess population risk of both GD and SGD, under the assumption that network width scales polynomially with the sample size, showing that such DNNs achieve generalization performance comparable to kernel-based methods.
What carries the argument
The equivalence between the learning dynamics of DNNs with smooth activations trained via gradient methods and those of kernel methods, which transfers optimality from kernels to DNNs.
If this is right
- DNNs with polynomially growing width attain the same excess-risk upper bounds as optimal kernel methods.
- Both gradient descent and stochastic gradient descent reach these rates.
- The results hold for regression tasks under standard smoothness assumptions on the activation.
- The polynomial width scaling is sufficient to close the gap between neural networks and kernels.
Where Pith is reading between the lines
- If the dynamics equivalence extends to deeper or non-smooth activations, the same rates could apply to modern architectures beyond the shallow case.
- Empirical checks on whether observed excess risk saturates the derived bounds at polynomial widths would test the practical relevance of the rates.
- The connection suggests that other gradient-based algorithms might inherit kernel optimality once width is large enough.
Load-bearing premise
The learning dynamics of a DNN with smooth activation functions trained via gradient-based methods are equivalent to those of kernel methods.
What would settle it
An explicit construction of a smooth activation, a regression problem, and a polynomially wide network where GD or SGD produces excess risk strictly larger than the known kernel minimax rate would falsify the claim.
read the original abstract
Understanding the generalization performance of over-parameterized neural networks has become a central topic in deep learning theory. While recent advances, particularly works under the Neural Tangent Kernel (NTK) regime, have shed light on the behavior of shallow architectures, the statistical generalization properties of deep neural networks (DNNs), especially in regression tasks, remain far less understood. In this paper, we make significant progress toward closing this gap by providing a comprehensive generalization analysis of DNNs trained using gradient-based methods. First, we establish, for the first time, a crucial connection between the learning dynamics of a DNN with smooth activation functions trained via gradient-based methods and those of kernel methods, showing that gradient-based methods on over-parameterized DNNs can fully inherit the favorable learning dynamics of their kernel counterparts. Building on this connection and the well-established optimality of kernel methods, we derive the first known minimax-optimal rates for the excess population risk of both gradient descent (GD) and stochastic gradient descent (SGD), under the assumption that network width scales polynomially with the sample size. Our results demonstrate that, with sufficient width, DNNs trained by GD or SGD can achieve generalization performance comparable to kernel-based methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper establishes a connection between the learning dynamics of over-parameterized DNNs with smooth activations trained by gradient-based methods and those of kernel methods. It then uses this connection, together with known optimality results for kernels, to derive the first minimax-optimal rates for the excess population risk of both GD and SGD, under the assumption that network width scales polynomially with sample size n.
Significance. If the quantitative error bounds in the dynamics connection hold, the result would be significant: it would supply the first explicit minimax rates for gradient-trained DNNs and demonstrate that polynomial-width networks can match the generalization performance of kernel methods. The work directly addresses a central open question in the NTK and mean-field regimes.
major comments (1)
- [connection result between DNN dynamics and kernel methods (foundational step)] The transfer of minimax optimality from kernels to DNNs rests on the claim that the DNN trajectory under GD/SGD is sufficiently close to the kernel dynamics. The manuscript must supply an explicit quantitative bound showing that the approximation error is o(n^{-r}) (where n^{-r} is the target excess-risk rate) under the stated polynomial-width assumption; the abstract and the description of the foundational step locate the connection as the load-bearing step, but without this o(rate) control the optimality claim does not follow.
Simulated Author's Rebuttal
Thank you for the careful review and for identifying the need for explicit control on the approximation error. We address the single major comment below.
read point-by-point responses
-
Referee: [connection result between DNN dynamics and kernel methods (foundational step)] The transfer of minimax optimality from kernels to DNNs rests on the claim that the DNN trajectory under GD/SGD is sufficiently close to the kernel dynamics. The manuscript must supply an explicit quantitative bound showing that the approximation error is o(n^{-r}) (where n^{-r} is the target excess-risk rate) under the stated polynomial-width assumption; the abstract and the description of the foundational step locate the connection as the load-bearing step, but without this o(rate) control the optimality claim does not follow.
Authors: We agree that an explicit o(n^{-r}) guarantee is required for the optimality transfer to be fully rigorous. Our existing trajectory approximation results already yield polynomial decay in 1/n (under the stated width scaling), which is faster than the target kernel rates r in the regimes we consider. In the revision we will insert a short corollary (or remark) that directly combines the approximation theorem with the specific minimax rates from the kernel literature, confirming the error is o(n^{-r}). This addition clarifies the argument without altering any proofs or assumptions. revision: yes
Circularity Check
No circularity; derivation builds on a claimed novel connection plus external kernel results.
full rationale
The paper states it first establishes (within the manuscript) a connection between DNN gradient dynamics for smooth activations and kernel methods, then transfers minimax optimality from the 'well-established' kernel literature. No load-bearing step reduces by definition, by fitting, or by self-citation chain to the target rates. The connection is presented as newly derived rather than presupposed, and kernel optimality is invoked as independent prior knowledge. No equations or sections exhibit self-definitional, fitted-input, or ansatz-smuggling patterns. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption network width scales polynomially with the sample size
Reference graph
Works this paper leans on
-
[1]
A convergence theory for deep learning via over-parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over-parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019
2019
-
[2]
Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks
Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019
2019
-
[3]
Spectrally-normalized margin bounds for neural networks
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. Advances in Neural Information Processing Systems, 30, 2017
2017
-
[4]
Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024
Alina Braun, Michael Kohler, Sophie Langer, and Harro Walk. Convergence rates for shallow neural networks learned by gradient descent.Bernoulli, 30(1):475–502, 2024
2024
-
[5]
Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024
Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024
arXiv 2024
-
[6]
Generalization bounds of stochastic gradient descent for wide and deep neural networks
Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019
2019
-
[7]
Generalization error bounds of gradient descent for learning over-parameterized deep relu networks
Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349–3356, 2020
2020
-
[8]
Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007
Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007
2007
-
[9]
Learning with sgd and random features
Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018
2018
-
[10]
How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021
Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021
2021
-
[11]
Cambridge University Press, 2007
Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge University Press, 2007
2007
-
[12]
Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016
Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016
2016
-
[13]
Selina Drews and Michael Kohler. Analysis of the expected l 2 error of an over-parametrized deep neural network estimate learned by gradient descent without regularization.arXiv preprint arXiv:2311.14609, 2023
arXiv 2023
-
[14]
Selina Drews and Michael Kohler. On the universal consistency of an over-parametrized deep neural network estimate learned by gradient descent.Annals of the Institute of Statistical Mathematics, 76(3):361–391, 2024. 10
2024
-
[15]
Gradient descent finds global minima of deep neural networks
Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019
2019
-
[16]
Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023
Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023
2023
-
[17]
Size-independent sample complexity of neural networks
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018
2018
-
[18]
Springer Science & Business Media, 2006
L´aszl´o Gy¨orfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006
2006
-
[19]
Deep residual learning for image recognition
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016
2016
-
[20]
Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018
Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018
2018
-
[21]
Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks
Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2020
2020
-
[22]
Michael Kohler. On the rate of convergence of an over-parametrized deep neural network regression estimate learned by gradient descent.arXiv preprint arXiv:2504.03405, 2025
arXiv 2025
-
[23]
Ilja Kuzborskij and Csaba Szepesv´ari. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022
arXiv 2022
-
[24]
Stability and generalization analysis of gradient methods for shallow neural networks
Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022
2022
-
[25]
Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026
Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026
2026
-
[26]
Optimal rates for generalization of gradient descent for deep relu classification
Yuanfan Li, Yunwen Lei, Zheng-Chu Guo, and Yiming Ying. Optimal rates for generalization of gradient descent for deep relu classification. InAdvances in Neural Information Processing Systems, volume 38, pages 28623–28663, 2025
2025
-
[27]
Learning overparameterized neural networks via stochastic gradient descent on structured data
Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018
2018
-
[28]
Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017
Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017
2017
-
[29]
On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020
Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020
2020
-
[30]
Norm-based capacity control in neural networks
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015
2015
-
[31]
Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023
Mike Nguyen and Nicole M¨ucke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023
arXiv 2023
-
[32]
How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024
Mike Nguyen and Nicole M¨ucke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024
2024
-
[33]
Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two-layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019
arXiv 1905
-
[34]
Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime
Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021. 11
2021
-
[35]
Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022
Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022
2022
-
[36]
Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008
Ali Rahimi and Benjamin Recht. Weighted sums of random kitchen sinks: Replacing minimization with random- ization in learning.Advances in Neural Information Processing Systems, 21, 2008
2008
-
[37]
Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017
Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941, 2017
Pith/arXiv arXiv 2017
-
[38]
Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel
Dominic Richards and Ilja Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel. InAdvances in Neural Information Processing Systems, volume 34. PMLR, 2021
2021
-
[39]
Springer Science & Business Media, 2008
Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008
2008
-
[40]
Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024
Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024
2024
-
[41]
Sharper guarantees for learning neural network classifiers with gradient methods, 2025
Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods, 2025
2025
-
[42]
Cambridge university press, 2018
Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018
2018
-
[43]
Cambridge university press, 2019
Martin J Wainwright.High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge university press, 2019
2019
-
[44]
Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025
Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025
2025
-
[45]
Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026
Pith/arXiv arXiv 2026
-
[46]
Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026
Pith/arXiv arXiv 2026
-
[47]
Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94):1–83, 2024
2024
-
[48]
Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005
Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17(9):2077–2098, 2005
2077
-
[49]
Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods
Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over-parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018. 12 Appendix for “Generalization in Deep Neural Networks: Minimax Rates for Gradient Methods” A Problem Reformulation We first introduce some useful notations and express...
Pith/arXiv arXiv 2018
-
[50]
Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx
Further, their Assumption 3 is guaranteed by Assumption 2, their equation (3) holds true with κ2 =∥K∥ ∞ due to ⟨Kx, Kx′⟩HK =K(x,x ′)≤ ∥K∥ ∞. Their equation (47) is guaranteed by Lemma 18 with κ2 =∥K∥ ∞, Γ =n , δ=δ 2, ζi =K xi, Q= R X Kx ⊗K xdρx. In addition, by taking the step-size ηk =η for all k∈[T] , we know Sρνk+1 − Sρµk+1 in [28] is equivalent to our...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.