Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks

Ding-Xuan Zhou; Junyu Zhou; Puyu Wang; Yiming Ying; Yunwen Lei

arxiv: 2606.06764 · v1 · pith:VYYDH6T3new · submitted 2026-06-04 · 📊 stat.ML · cs.AI· cs.LG

Optimal Rates for Generalization of Gradient Descent Methods with Deep Neural Networks

Junyu Zhou , Puyu Wang , Yunwen Lei , Yiming Ying , Ding-Xuan Zhou This is my paper

Pith reviewed 2026-06-27 23:04 UTC · model grok-4.3

classification 📊 stat.ML cs.AIcs.LG

keywords deep neural networksReLU activationgradient descentstochastic gradient descentgeneralization analysisminimax ratesexcess riskoverparameterized networks

0 comments

The pith

Gradient descent on deep ReLU networks achieves minimax-optimal generalization rates under polynomial width scaling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes minimax-optimal rates for the excess population risk of gradient descent and stochastic gradient descent applied to deep ReLU networks. The key condition is that the network width grows polynomially in both the depth and the number of training samples. Under this condition the rates match the best known rates for kernel methods on regression tasks. Prior analyses were mostly restricted to shallow networks, so this work fills the gap for deeper architectures. The results show that overparameterized deep networks trained by gradient methods can generalize optimally without explicit kernel computations.

Core claim

We establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

What carries the argument

Polynomial width scaling with depth and sample size, which allows the derivation of tight generalization bounds matching kernel methods for deep ReLU networks.

Load-bearing premise

The network width scales polynomially with respect to the network depth and training sample size.

What would settle it

Finding a deep ReLU network trained by GD or SGD where the width does not scale polynomially but still achieves the claimed minimax rates, or conversely where polynomial scaling fails to yield optimal rates, would test the claim.

read the original abstract

Recent progress has been made in understanding the statistical generalization performance of gradient descent methods for overparameterized neural networks within the neural tangent kernel (NTK) regime. However, most of the existing work on regression problems is limited to shallow network architectures, leaving a notable gap in the theory of deep neural networks. This paper addresses this gap by presenting a comprehensive generalization analysis for deep ReLU networks trained using gradient descent (GD) and stochastic gradient descent (SGD). Specifically, we establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks, under the assumption that the network width scales polynomially with respect to the network depth and training sample size. Our results demonstrate that with sufficient width, gradient descent methods for deep ReLU networks can achieve optimal generalization rates on par with kernel methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper claims the first minimax-optimal excess risk rates for GD and SGD on deep ReLU nets, but only when width scales polynomially with depth and sample size.

read the letter

The main result is the first claimed minimax-optimal rates for excess population risk under GD and SGD for deep ReLU networks. This holds when width grows polynomially in depth and n, which puts the networks in a regime where they match known kernel rates via NTK analysis.

What is new is the extension of prior shallow-network NTK bounds to deeper architectures. The abstract positions this as closing a documented gap, and the claim is stated directly rather than buried.

The paper is clear about the width assumption and limits the optimality statement to that case. It also covers both GD and SGD without overclaiming.

The load-bearing part is the polynomial width condition. Without it the optimality does not follow, and the analysis would need different tools. The abstract supplies no derivation steps or error bounds, so the full paper must deliver the NTK linearization and optimization arguments without hidden looseness. If those steps are tight, the result stands; otherwise the rates may not be sharp.

No circularity or unfalsifiable quantities appear in the stated claim.

This is for people working in statistical learning theory on NTK generalization. A reader tracking shallow-to-deep comparisons will get the targeted extension they are looking for.

It deserves peer review because it takes on a specific open question with a concrete, conditional claim rather than a vague improvement.

Referee Report

2 major / 1 minor

Summary. The paper claims to establish the first known minimax-optimal rates of excess population risk for both GD and SGD with deep ReLU networks in the NTK regime. This is achieved under the assumption that network width scales polynomially with network depth and training sample size, allowing the methods to match the generalization performance of kernel methods.

Significance. If the derivations hold, the result would be significant as it extends existing NTK generalization theory from shallow to deep networks, providing a theoretical basis for why sufficiently wide deep ReLU networks can achieve optimal rates. The work highlights the role of width scaling in controlling approximation and optimization errors.

major comments (2)

[Abstract] Abstract, paragraph on main results: the minimax-optimal rates are explicitly conditioned on the network width scaling polynomially with depth and sample size. This assumption is load-bearing for the central claim, as it is required for the NTK approximation error and optimization dynamics to yield rates on par with kernel methods; without it the optimality does not follow from the stated analysis.
[Abstract] Abstract: the rates and optimality claim are stated without derivation steps, error bounds, or proof sketches. The full manuscript must supply these to allow verification that the rates are indeed minimax optimal and not post-hoc.

minor comments (1)

The abstract should state the precise form of the excess risk rate (e.g., dependence on n, depth, width) to make the optimality claim more concrete.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for raising these points about the abstract. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract, paragraph on main results: the minimax-optimal rates are explicitly conditioned on the network width scaling polynomially with depth and sample size. This assumption is load-bearing for the central claim, as it is required for the NTK approximation error and optimization dynamics to yield rates on par with kernel methods; without it the optimality does not follow from the stated analysis.

Authors: The abstract already states the results under the explicit assumption of polynomial width scaling with depth and sample size. This scaling is necessary to ensure the NTK approximation holds with controlled error and that optimization dynamics achieve the desired rates; the paper does not claim optimality without it. The main text (Section 2 and Theorems 3.1, 4.1) derives the precise polynomial exponents needed to balance approximation, optimization, and generalization errors, showing they match kernel minimax rates under this condition. revision: no
Referee: [Abstract] Abstract: the rates and optimality claim are stated without derivation steps, error bounds, or proof sketches. The full manuscript must supply these to allow verification that the rates are indeed minimax optimal and not post-hoc.

Authors: The abstract serves as a high-level summary. The full manuscript supplies the required derivations: error decompositions, explicit bounds on each term, and proof sketches appear in Sections 3 and 4, with complete proofs in the appendix. Minimax optimality is verified by matching the derived upper bounds to known kernel lower bounds once the width condition holds. revision: no

Circularity Check

0 steps flagged

No significant circularity; rates derived from NTK analysis under explicit width assumption

full rationale

The paper presents minimax-optimal excess risk rates for GD/SGD on deep ReLU networks as derived results within the NTK regime, conditioned on the polynomial width scaling assumption (abstract). This scaling is stated as a prerequisite for the analysis to hold and is not obtained by fitting or self-definition from the target rates. No quoted steps reduce a claimed prediction to an input by construction, nor do self-citations form a load-bearing chain that substitutes for independent derivation. The result is positioned as extending prior NTK work to deep nets with sufficient width, remaining self-contained against external kernel-method benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the NTK regime for overparameterized networks and the polynomial width scaling; no free parameters or new entities are introduced in the abstract.

axioms (2)

domain assumption Neural tangent kernel regime holds for the deep ReLU networks under study
Abstract opens by referencing recent progress in the NTK regime and builds the deep-network result on that foundation.
standard math Standard minimax lower bounds for kernel methods apply as comparison baseline
The optimality claim is relative to kernel-method rates already established in the literature.

pith-pipeline@v0.9.1-grok · 5684 in / 1340 out tokens · 19728 ms · 2026-06-27T23:04:21.173351+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 4 linked inside Pith

[1]

A convergence theory for deep learning via over- parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

2019
[2]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

2019
[3]

Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017

Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017

2017
[4]

Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

Pith/arXiv arXiv 2014
[5]

Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017

2017
[6]

Deep equals shallow for relu networks in kernel regimes

Alberto Bietti and Francis Bach. Deep equals shallow for relu networks in kernel regimes. InInternational Conference on Learning Representations, 2021

2021
[7]

On the inductive bias of neural tangent kernels

Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. InAdvances in Neural Information Processing Systems, volume 32, 2019. 36

2019
[8]

Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989

1989
[9]

Sgd learns over-parameterized net- works that provably generalize on linearly separable data

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized net- works that provably generalize on linearly separable data. InInternational Conference on Learning Representa- tions, 2018

2018
[10]

Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

arXiv 2024
[11]

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[12]

Generalization error bounds of gradient descent for learning over-parameterized deep relu networks

Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349– 3356, 2020

2020
[13]

Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

2007
[14]

Learning with sgd and random features

Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018

2018
[15]

How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021

2021
[16]

On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002

Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002

2002
[17]

Cambridge Uni- versity Press, 2007

Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge Uni- versity Press, 2007

2007
[18]

Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

2016
[19]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

2019
[20]

Gradient descent provably optimizes over- parameterized neural networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over- parameterized neural networks. InInternational Conference on Learning Representations, 2018

2018
[21]

Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

2023
[22]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018

2018
[23]

Springer Science & Business Media, 2006

László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

2006
[24]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012

2012
[25]

Regularization matters: A nonparametric perspective on overparametrized neural network

Tianyang Hu, Wenjia Wang, Cong Lin, and Guang Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. InInternational Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021

2021
[26]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018
[27]

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

2020
[28]

Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

2017
[29]

Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

Ilja Kuzborskij and Csaba Szepesvári. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

arXiv 2022
[30]

Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023

Jianfa Lai, Manyun Xu, Rui Chen, and Qian Lin. Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023

arXiv 2023
[31]

Stability and generalization analysis of gradient methods for shallow neural networks

Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

2022
[32]

Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

2026
[33]

Learning overparameterized neural networks via stochastic gradient descent on structured data

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018

2018
[34]

Optimal learning for multi-pass stochastic gradient methods

Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. InAdvances in Neural Information Processing Systems, pages 4556–4564, 2016

2016
[35]

Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

2017
[36]

Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017

Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017

2017
[37]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015

2015
[38]

Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

Mike Nguyen and Nicole Mücke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

arXiv 2023
[39]

How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

Mike Nguyen and Nicole Mücke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

2024
[40]

Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

2021
[41]

Gradient descent can learn less over-parameterized two- layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two- layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

arXiv 1905
[42]

Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

2022
[43]

A spectral analysis of dot-product kernels

Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. InInternational Conference on Artificial Intelligence and Statistics, pages 3394–3402. PMLR, 2021

2021
[44]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

2016
[45]

Springer Science & Business Media, 2008

Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008

2008
[46]

On learning over-parameterized neural networks: A functional approximation per- spective

Lili Su and Pengkun Yang. On learning over-parameterized neural networks: A functional approximation per- spective. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019
[47]

Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

2024
[48]

Sharper guarantees for learning neural network classifiers with gradient methods

Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025. 38

2025
[49]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

2018
[50]

Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

2025
[51]

Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Pith/arXiv arXiv 2026
[52]

Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Pith/arXiv arXiv 2026
[53]

Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94): 1–83, 2024

2024
[54]

On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007

2007
[55]

Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008

Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008

2008
[56]

Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006

Yiming Ying and Ding-Xuan Zhou. Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006

2006
[57]

Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

2021
[58]

Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005

Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005

2077
[59]

Stochastic gradient descent optimizes over- parameterized deep relu networks

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over- parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018

Pith/arXiv arXiv 2018
[60]

Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 39

2020

[1] [1]

A convergence theory for deep learning via over- parameterization

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. InInternational Conference on Machine Learning, pages 242–252. PMLR, 2019

2019

[2] [2]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332. PMLR, 2019

2019

[3] [3]

Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017

Francis Bach. Breaking the curse of dimensionality with convex neural networks.Journal of Machine Learning Research, 18(19):1–53, 2017

2017

[4] [4]

Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate.arXiv preprint arXiv:1409.0473, 2014

Pith/arXiv arXiv 2014

[5] [5]

Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017

Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural net- works.Advances in Neural Information Processing Systems, 30, 2017

2017

[6] [6]

Deep equals shallow for relu networks in kernel regimes

Alberto Bietti and Francis Bach. Deep equals shallow for relu networks in kernel regimes. InInternational Conference on Learning Representations, 2021

2021

[7] [7]

On the inductive bias of neural tangent kernels

Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. InAdvances in Neural Information Processing Systems, volume 32, 2019. 36

2019

[8] [8]

Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989

Anselm Blumer, Andrzej Ehrenfeucht, David Haussler, and Manfred K Warmuth. Learnability and the vapnik- chervonenkis dimension.Journal of the ACM (JACM), 36(4):929–965, 1989

1989

[9] [9]

Sgd learns over-parameterized net- works that provably generalize on linearly separable data

Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized net- works that provably generalize on linearly separable data. InInternational Conference on Learning Representa- tions, 2018

2018

[10] [10]

Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

Dinghao Cao, Zheng-Chu Guo, and Lei Shi. Stochastic gradient descent for two-layer neural networks.arXiv preprint arXiv:2407.07670, 2024

arXiv 2024

[11] [11]

Generalization bounds of stochastic gradient descent for wide and deep neural networks

Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[12] [12]

Generalization error bounds of gradient descent for learning over-parameterized deep relu networks

Yuan Cao and Quanquan Gu. Generalization error bounds of gradient descent for learning over-parameterized deep relu networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 3349– 3356, 2020

2020

[13] [13]

Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7(3):331–368, 2007

2007

[14] [14]

Learning with sgd and random features

Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features. InAdvances in Neural Information Processing Systems, pages 10213–10224, 2018

2018

[15] [15]

How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021

Zixiang Chen, Yuan Cao, Difan Zou, and Quanquan Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representations, 2021

2021

[16] [16]

On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002

Felipe Cucker and Steve Smale. On the mathematical foundations of learning.Bulletin of the American mathe- matical society, 39(1):1–49, 2002

2002

[17] [17]

Cambridge Uni- versity Press, 2007

Felipe Cucker and Ding-Xuan Zhou.Learning Theory: an Approximation Theory Viewpoint. Cambridge Uni- versity Press, 2007

2007

[18] [18]

Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.Annals of Statistics, 44(4):1363–1399, 2016

2016

[19] [19]

Gradient descent finds global minima of deep neural networks

Simon Du, Jason Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. InInternational Conference on Machine Learning, pages 1675–1685. PMLR, 2019

2019

[20] [20]

Gradient descent provably optimizes over- parameterized neural networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over- parameterized neural networks. InInternational Conference on Learning Representations, 2018

2018

[21] [21]

Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

Spencer Frei, Niladri S Chatterji, and Peter L Bartlett. Random feature amplification: Feature learning and generalization in neural networks.Journal of Machine Learning Research, 24(303):1–49, 2023

2023

[22] [22]

Size-independent sample complexity of neural networks

Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. InConference On Learning Theory, pages 297–299. PMLR, 2018

2018

[23] [23]

Springer Science & Business Media, 2006

László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk.A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006

2006

[24] [24]

Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012

Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups.IEEE Signal processing magazine, 29(6):82–97, 2012

2012

[25] [25]

Regularization matters: A nonparametric perspective on overparametrized neural network

Tianyang Hu, Wenjia Wang, Cong Lin, and Guang Cheng. Regularization matters: A nonparametric perspective on overparametrized neural network. InInternational Conference on Artificial Intelligence and Statistics, pages 829–837. PMLR, 2021

2021

[26] [26]

Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence and generalization in neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018

[27] [27]

Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks

Ziwei Ji and Matus Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small 37 test error with shallow relu networks. InInternational Conference on Learning Representations, 2020

2020

[28] [28]

Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Communications of the ACM, 60(6):84–90, 2017

2017

[29] [29]

Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

Ilja Kuzborskij and Csaba Szepesvári. Learning lipschitz functions by gd-trained shallow overparameterized relu neural networks.arXiv preprint arXiv:2212.13848, 2022

arXiv 2022

[30] [30]

Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023

Jianfa Lai, Manyun Xu, Rui Chen, and Qian Lin. Generalization ability of wide neural networks onR.arXiv preprint arXiv:2302.05933, 2023

arXiv 2023

[31] [31]

Stability and generalization analysis of gradient methods for shallow neural networks

Yunwen Lei, Rong Jin, and Yiming Ying. Stability and generalization analysis of gradient methods for shallow neural networks. InAdvances in Neural Information Processing Systems, volume 35. PMLR, 2022

2022

[32] [32]

Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

Yunwen Lei, Puyu Wang, Yiming Ying, and Ding-Xuan Zhou. Optimization and generalization of gradient descent for shallow ReLU networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

2026

[33] [33]

Learning overparameterized neural networks via stochastic gradient descent on structured data

Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. InAdvances in Neural Information Processing Systems, volume 31, 2018

2018

[34] [34]

Optimal learning for multi-pass stochastic gradient methods

Junhong Lin and Lorenzo Rosasco. Optimal learning for multi-pass stochastic gradient methods. InAdvances in Neural Information Processing Systems, pages 4556–4564, 2016

2016

[35] [35]

Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

Junhong Lin and Lorenzo Rosasco. Optimal rates for multi-pass stochastic gradient methods.Journal of Machine Learning Research, 18(1):3375–3421, 2017

2017

[36] [36]

Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017

Shao-Bo Lin and Ding-Xuan Zhou. Distributed kernel-based gradient descent algorithms.Constructive Approx- imation, pages 1–28, 2017

2017

[37] [37]

Norm-based capacity control in neural networks

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401. PMLR, 2015

2015

[38] [38]

Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

Mike Nguyen and Nicole Mücke. Random feature approximation for general spectral methods.arXiv preprint arXiv:2308.15434, 2023

arXiv 2023

[39] [39]

How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

Mike Nguyen and Nicole Mücke. How many neurons do we need? a refined analysis for shallow networks trained with gradient descent.Journal of Statistical Planning and Inference, page 106169, 2024

2024

[40] [40]

Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime

Atsushi Nitanda and Suzuki Taiji. Optimal rates for averaged stochastic gradient descent under neural tangent kernel regime. InInternational Conference on Learning Representations, 2021

2021

[41] [41]

Gradient descent can learn less over-parameterized two- layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

Atsushi Nitanda, Geoffrey Chinot, and Taiji Suzuki. Gradient descent can learn less over-parameterized two- layer neural networks on classification problems.arXiv preprint arXiv:1905.09870, 2019

arXiv 1905

[42] [42]

Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

Rahul Parhi and Robert D Nowak. Near-minimax optimal estimation with shallow relu neural networks.IEEE Transactions on Information Theory, 69(2):1125–1140, 2022

2022

[43] [43]

A spectral analysis of dot-product kernels

Meyer Scetbon and Zaid Harchaoui. A spectral analysis of dot-product kernels. InInternational Conference on Artificial Intelligence and Statistics, pages 3394–3402. PMLR, 2021

2021

[44] [44]

Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search.nature, 529(7587):484–489, 2016

2016

[45] [45]

Springer Science & Business Media, 2008

Ingo Steinwart and Andreas Christmann.Support Vector Machines. Springer Science & Business Media, 2008

2008

[46] [46]

On learning over-parameterized neural networks: A functional approximation per- spective

Lili Su and Pengkun Yang. On learning over-parameterized neural networks: A functional approximation per- spective. InAdvances in Neural Information Processing Systems, volume 32, 2019

2019

[47] [47]

Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

Hossein Taheri and Christos Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

2024

[48] [48]

Sharper guarantees for learning neural network classifiers with gradient methods

Hossein Taheri, Christos Thrampoulidis, and Arya Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025. 38

2025

[49] [49]

Cambridge university press, 2018

Roman Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018

2018

[50] [50]

Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

Puyu Wang, Yunwen Lei, Di Wang, Yiming Ying, and Ding-Xuan Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

2025

[51] [51]

Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Puyu Wang, Jan Schuchardt, Nikita Kalinin, Junyu Zhou, Sophie Fellenz, Christoph Lampert, and Marius Kloft. Population risk bounds for kolmogorov-arnold networks trained by dp-sgd with correlated noise.arXiv preprint arXiv:2605.12648, 2026

Pith/arXiv arXiv 2026

[52] [52]

Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Puyu Wang, Junyu Zhou, Philipp Liznerski, and Marius Kloft. Optimization, generalization and differential privacy bounds for gradient descent on kolmogorov-arnold networks.arXiv preprint arXiv:2601.22409, 2026

Pith/arXiv arXiv 2026

[53] [53]

Jiaming Xu and Hanjing Zhu. Overparametrized multi-layer neural networks: Uniform concentration of neural tangent kernel and convergence of stochastic gradient descent.Journal of Machine Learning Research, 25(94): 1–83, 2024

2024

[54] [54]

On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007

Yuan Yao, Lorenzo Rosasco, and Andrea Caponnetto. On early stopping in gradient descent learning.Construc- tive Approximation, 26(2):289–315, 2007

2007

[55] [55]

Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008

Yiming Ying and Massimiliano Pontil. Online gradient descent learning algorithms.Foundations of Computa- tional Mathematics, 8(5):561–596, 2008

2008

[56] [56]

Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006

Yiming Ying and Ding-Xuan Zhou. Online regularized classification algorithms.IEEE Transactions on Infor- mation Theory, 52(11):4775–4788, 2006

2006

[57] [57]

Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021

2021

[58] [58]

Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005

Tong Zhang. Learning bounds for kernel regression using effective data dimensionality.Neural computation, 17 (9):2077–2098, 2005

2077

[59] [59]

Stochastic gradient descent optimizes over- parameterized deep relu networks

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over- parameterized deep relu networks. arxiv e-prints, art.arXiv preprint arXiv:1811.08888, 2018

Pith/arXiv arXiv 2018

[60] [60]

Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Gradient descent optimizes over-parameterized deep relu networks.Machine learning, 109:467–492, 2020. 39

2020