Optimal Rates for Generalization of Gradient Descent for Deep ReLU Classification
Pith reviewed 2026-05-18 10:21 UTC · model grok-4.3
The pith
Under NTK separability with margin γ, gradient descent on deep ReLU networks attains an excess risk of Õ(L^6/(n γ²)).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under the assumption that the data are NTK separable from the margin γ, gradient descent on deep ReLU networks achieves an excess risk of Õ(L^6 / (n γ²)), which aligns with the optimal SVM-type rate Õ(1 / (n γ²)) up to depth-dependent factors. The proof rests on a novel control of activation patterns near a reference model that produces a sharper Rademacher complexity bound for the networks trained by gradient descent.
What carries the argument
Novel control of activation patterns near a reference model that produces a sharper Rademacher complexity bound for deep ReLU networks trained by gradient descent.
If this is right
- Gradient descent on deep ReLU networks can match kernel-method optimal rates when the NTK separability condition holds.
- The generalization bound depends only polynomially on network depth L.
- Optimization and generalization errors can be traded off to remove exponential depth dependence for ReLU activations.
- The result recovers the classical SVM-type rate up to explicit factors in L.
Where Pith is reading between the lines
- Deeper ReLU networks become practically viable for generalization as long as depth grows slower than a suitable power of sample size.
- Similar activation-pattern controls might extend the same rates to other piecewise-linear activations.
- The bound suggests that NTK separability is a sufficient condition for near-optimal performance of GD on ReLU architectures in classification.
Load-bearing premise
The data must be separable under the neural tangent kernel with a positive margin γ.
What would settle it
Observe a dataset that is NTK-separable with margin γ yet yields excess risk for GD-trained deep ReLU nets worse than order L^6 over n γ squared.
read the original abstract
Recent advances have significantly improved our understanding of the generalization performance of gradient descent (GD) methods in deep neural networks. A natural and fundamental question is whether GD can achieve generalization rates comparable to the minimax optimal rates established in the kernel setting. Existing results either yield suboptimal rates of $O(1/\sqrt{n})$, or focus on networks with smooth activation functions, incurring exponential dependence on network depth $L$. In this work, we establish optimal generalization rates for GD with deep ReLU networks by carefully trading off optimization and generalization errors, achieving only polynomial dependence on depth. Specifically, under the assumption that the data are NTK separable from the margin $\gamma$, we prove an excess risk rate of $\widetilde{O}(L^6 / (n \gamma^2))$, which aligns with the optimal SVM-type rate $\widetilde{O}(1 / (n \gamma^2))$ up to depth-dependent factors. A key technical contribution is our novel control of activation patterns near a reference model, enabling a sharper Rademacher complexity bound for deep ReLU networks trained with gradient descent.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish optimal generalization rates for gradient descent applied to deep ReLU networks for classification. Under the assumption that the data are NTK separable with margin γ, they prove an excess risk of Õ(L^6 / (n γ²)), which is optimal up to factors polynomial in the depth L. The proof involves a novel bound on activation patterns near a reference model to sharpen the Rademacher complexity for the trajectory of GD iterates, combined with a careful trade-off between optimization and generalization errors.
Significance. This work is significant for the field of learning theory in deep neural networks. It provides a result with only polynomial depth dependence for ReLU activations, improving on previous results with suboptimal rates or exponential depth factors for smooth activations. The novel control of activation patterns is a technical strength that could have broader applicability. If the proof is complete, it helps explain the generalization behavior of GD on deep ReLU nets in the separable regime.
major comments (1)
- [Activation pattern control and main proof] The bound on the number of activation patterns in a neighborhood of the reference NTK model is central to controlling the Rademacher complexity (key technical contribution). However, for this to apply to the functions produced by GD, the optimization error must ensure that the iterates do not cross activation pattern boundaries, i.e., remain at a distance much smaller than γ / L^C for some C. The abstract mentions trading off the errors, but an explicit verification that the achieved optimization accuracy satisfies this distance requirement is needed to confirm the rate holds for the GD output rather than a hypothetical class.
minor comments (2)
- [Abstract] The abstract is clear, but consider adding a brief mention of the network architecture assumptions (e.g., width, initialization) for completeness.
- [Notation] Clarify the precise norm in which the neighborhood for activation patterns is defined.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation of our work and for the constructive comment on clarifying the optimization-generalization tradeoff. We address the point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: The bound on the number of activation patterns in a neighborhood of the reference NTK model is central to controlling the Rademacher complexity (key technical contribution). However, for this to apply to the functions produced by GD, the optimization error must ensure that the iterates do not cross activation pattern boundaries, i.e., remain at a distance much smaller than γ / L^C for some C. The abstract mentions trading off the errors, but an explicit verification that the achieved optimization accuracy satisfies this distance requirement is needed to confirm the rate holds for the GD output rather than a hypothetical class.
Authors: We thank the referee for this observation. In the proof of the main result (Theorem 3.1 and the surrounding lemmas in Section 4), the optimization accuracy is explicitly chosen as ε_opt = Θ(γ² / L^6) to balance the terms while ensuring the GD trajectory remains inside a neighborhood of radius O(γ / L^C) around the reference NTK model for a sufficiently large C (specifically, the distance bound follows from the Lipschitz property of the network and the margin assumption). This choice guarantees that the activation-pattern control applies directly to the GD iterates rather than a hypothetical class. We will add a short dedicated paragraph immediately after the statement of the main theorem to make this verification fully explicit and self-contained. revision: yes
Circularity Check
No circularity: bound derived from explicit Rademacher + optimization tradeoff under stated separability assumption
full rationale
The paper establishes the excess risk rate via a Rademacher complexity argument that incorporates a novel bound on activation pattern stability in a neighborhood of the NTK reference model, combined with explicit control of optimization error for GD iterates. This construction does not reduce any quantity to a fitted parameter, self-defined target, or self-citation chain; the separability assumption is an external input, and the depth-dependent factors arise from the pattern-counting argument rather than from re-expressing the target rate. No load-bearing step equates the claimed prediction to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Data are NTK separable from the margin γ
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.