Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks
Pith reviewed 2026-06-27 23:04 UTC · model grok-4.3
The pith
Gradient descent on deep ReLU networks achieves minimax-optimal generalization rates under polynomial width scaling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.
What carries the argument
Polynomial width scaling with depth and sample size, which allows the derivation of tight generalization bounds matching kernel methods for deep ReLU networks.
Load-bearing premise
The network width scales polynomially with respect to the network depth and training sample size.
What would settle it
Finding a deep ReLU network trained by GD or SGD where the width does not scale polynomially but still achieves the claimed minimax rates, or conversely where polynomial scaling fails to yield optimal rates, would test the claim.
read the original abstract
Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks in the NTK regime. This is achieved under the assumption that network width scales polynomially with network depth and training sample size, allowing the methods to match the generalization performance of kernel methods.
Significance. If the derivations hold, the result would be significant as it extends existing NTK generalization theory from shallow to deep networks, providing a theoretical basis for why sufficiently wide deep ReLU networks can achieve optimal rates. The work highlights the role of width scaling in controlling approximation and optimization errors.
major comments (2)
- [Abstract] Abstract, paragraph on main results: the minimax-optimal rates are explicitly conditioned on the network width scaling polynomially with depth and sample size. This assumption is load-bearing for the central claim, as it is required for the NTK approximation error and optimization dynamics to yield rates on par with kernel methods; without it the optimality does not follow from the stated analysis.
- [Abstract] Abstract: the rates and optimality claim are stated without derivation steps, error bounds, or proof sketches. The full manuscript must supply these to allow verification that the rates are indeed minimax optimal and not post-hoc.
minor comments (1)
- The abstract should state the precise form of the excess risk rate (e.g., dependence on n, depth, width) to make the optimality claim more concrete.
Simulated Author's Rebuttal
We thank the referee for their careful reading of the manuscript and for raising these points about the abstract. We address each major comment below.
read point-by-point responses
-
Referee: [Abstract] Abstract, paragraph on main results: the minimax-optimal rates are explicitly conditioned on the network width scaling polynomially with depth and sample size. This assumption is load-bearing for the central claim, as it is required for the NTK approximation error and optimization dynamics to yield rates on par with kernel methods; without it the optimality does not follow from the stated analysis.
Authors: The abstract already states the results under the explicit assumption of polynomial width scaling with depth and sample size. This scaling is necessary to ensure the NTK approximation holds with controlled error and that optimization dynamics achieve the desired rates; the paper does not claim optimality without it. The main text (Section 2 and Theorems 3.1, 4.1) derives the precise polynomial exponents needed to balance approximation, optimization, and generalization errors, showing they match kernel minimax rates under this condition. revision: no
-
Referee: [Abstract] Abstract: the rates and optimality claim are stated without derivation steps, error bounds, or proof sketches. The full manuscript must supply these to allow verification that the rates are indeed minimax optimal and not post-hoc.
Authors: The abstract serves as a high-level summary. The full manuscript supplies the required derivations: error decompositions, explicit bounds on each term, and proof sketches appear in Sections 3 and 4, with complete proofs in the appendix. Minimax optimality is verified by matching the derived upper bounds to known kernel lower bounds once the width condition holds. revision: no
Circularity Check
No significant circularity; rates derived from NTK analysis under explicit width assumption
full rationale
The paper presents minimax-optimal excess risk rates for GD/SGD on deep ReLU networks as derived results within the NTK regime, conditioned on the polynomial width scaling assumption (abstract). This scaling is stated as a prerequisite for the analysis to hold and is not obtained by fitting or self-definition from the target rates. No quoted steps reduce a claimed prediction to an input by construction, nor do self-citations form a load-bearing chain that substitutes for independent derivation. The result is positioned as extending prior NTK work to deep nets with sufficient width, remaining self-contained against external kernel-method benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Neural tangent kernel regime holds for the deep ReLU networks under study
- standard math Standard minimax lower bounds for kernel methods apply as comparison baseline
Reference graph
Works this paper leans on
-
[1]
A convergence theory for deep learning via over- parameterization
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019
2019
-
[2]
Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks
Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019
2019
-
[3]
Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017
Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017
2017
-
[4]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014
Pith/arXiv arXiv 2014
-
[5]
Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017
Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017
2017
-
[6]
Deep equals shallow for relu networks in kernel regimes
Alberto Bietti and Francis Bach. Deep equals shallow for relu networks in kernel regimes. InInternational Conference on Learning Representations, 2021
2021
-
[7]
On the inductive bias of neural tangent kernels
Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. InAdvances in Neural Information Processing Systems, volume 32, 2019. 36
2019
-
[8]
Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989
Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989
1989
-
[9]
Sgd learns over-parameterized net- works that provably generalize on linearly separable data
Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized net- works that provably generalize on linearly separable data. InInternational Conference on Learning Representa- tions, 2018
2018
-
[10]
Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024
Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024
arXiv 2024
-
[11]
Generalization bounds of stochastic gradient descent for wide and deep neural networks
Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019
2019
-
[12]
Generalization error bounds of gradient descent for learning over-parameterized deep relu networks
Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349– 3356, 2020
2020
-
[13]
Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007
Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007
2007
-
[14]
Learning with sgd and random features
Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018
2018
-
[15]
How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021
Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021
2021
-
[16]
On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002
Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002
2002
-
[17]
Cambridge Uni- versity Press, 2007
Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge Uni- versity Press, 2007
2007
-
[18]
Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016
Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016
2016
-
[19]
Gradient descent finds global minima of deep neural networks
Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019
2019
-
[20]
Gradient descent provably optimizes over- parameterized neural networks
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over- parameterized neural networks. InInternational Conference on Learning Representations, 2018
2018
-
[21]
Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023
Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023
2023
-
[22]
Size-independent sample complexity of neural networks
Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018
2018
-
[23]
Springer Science & Business Media, 2006
László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006
2006
-
[24]
Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012
Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012
2012
-
[25]
Regularization matters: A nonparametric perspective on overparametrized neural network
Tianyang Hu, Wenjia Wang, Cong Lin, and Guang Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. InInternational Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021
2021
-
[26]
Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018
Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018
2018
-
[27]
Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks
Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks. InInternational Conference on Learning Representations, 2020
2020
-
[28]
Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017
2017
-
[29]
Ilja Kuzborskij and Csaba Szepesvári. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022
arXiv 2022
-
[30]
Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023
Jianfa Lai, Manyun Xu, Rui Chen, and Qian Lin. Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023
arXiv 2023
-
[31]
Stability and generalization analysis of gradient methods for shallow neural networks
Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022
2022
-
[32]
Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026
Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026
2026
-
[33]
Learning overparameterized neural networks via stochastic gradient descent on structured data
Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018
2018
-
[34]
Optimal learning for multi-pass stochastic gradient methods
Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. InAdvances in Neural Information Processing Systems, pages 4556–4564, 2016
2016
-
[35]
Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017
Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017
2017
-
[36]
Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017
Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017
2017
-
[37]
Norm-based capacity control in neural networks
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015
2015
-
[38]
Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023
Mike Nguyen and Nicole Mücke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023
arXiv 2023
-
[39]
How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024
Mike Nguyen and Nicole Mücke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024
2024
-
[40]
Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime
Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021
2021
-
[41]
Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two- layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019
arXiv 1905
-
[42]
Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022
Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022
2022
-
[43]
A spectral analysis of dot-product kernels
Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. InInternational Conference on Artificial Intelligence and Statistics, pages 3394–3402. PMLR, 2021
2021
-
[44]
Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016
2016
-
[45]
Springer Science & Business Media, 2008
Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008
2008
-
[46]
On learning over-parameterized neural networks: A functional approximation per- spective
Lili Su and Pengkun Yang. On learning over-parameterized neural networks: A functional approximation per- spective. InAdvances in Neural Information Processing Systems, volume 32, 2019
2019
-
[47]
Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024
Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024
2024
-
[48]
Sharper guarantees for learning neural network classifiers with gradient methods
Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025. 38
2025
-
[49]
Cambridge university press, 2018
Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018
2018
-
[50]
Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025
Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025
2025
-
[51]
Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026
Pith/arXiv arXiv 2026
-
[52]
Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026
Pith/arXiv arXiv 2026
-
[53]
Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94): 1–83, 2024
2024
-
[54]
On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007
Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007
2007
-
[55]
Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008
Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008
2008
-
[56]
Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006
Yiming Ying and Ding-Xuan Zhou. Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006
2006
-
[57]
Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021
2021
-
[58]
Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005
Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005
2077
-
[59]
Stochastic gradient descent optimizes over- parameterized deep relu networks
Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over- parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018
Pith/arXiv arXiv 2018
-
[60]
Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020
Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 39
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.