Recognition: 2 theorem links
· Lean TheoremTowards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks
Pith reviewed 2026-05-13 23:23 UTC · model grok-4.3
The pith
Initialization-dependent path-norm bounds provide non-vacuous generalization guarantees for overparameterized shallow neural networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor, and show through experiments that the bounds are non-vacuous for overparameterized networks.
What carries the argument
The path-norm of the distance from initialization, controlled via a new peeling technique that manages complexity under the initialization-dependent constraint.
If this is right
- The bounds remain finite and informative even when the number of parameters greatly exceeds the number of training examples.
- They apply to a broad class of activation functions that are Lipschitz continuous.
- Empirical tests confirm the bounds predict actual generalization behavior without being loose.
- The lower bound indicates that the upper bounds cannot be improved by more than a constant factor.
Where Pith is reading between the lines
- Similar path-norm based analyses could be attempted for deeper architectures to explain their generalization.
- Training algorithms that minimize the path-norm distance might lead to better generalization in practice.
- The approach highlights the importance of initialization scale in controlling effective capacity.
Load-bearing premise
The distance from initialization in path-norm stays sufficiently small relative to the weights so the peeling technique keeps the complexity bound tight.
What would settle it
If training an overparameterized network results in large path-norm distance from initialization yet still achieves low test error, the bounds would fail to explain the generalization.
Figures
read the original abstract
Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses measure the distance from initialization by the Frobenius norm, and often imply vacuous bounds in practice for overparamterized models. In this paper, we develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper develops initialization-dependent generalization bounds for shallow neural networks with general Lipschitz activations. The bounds are expressed in terms of the path-norm of the distance from initialization (rather than the Frobenius norm of the weights themselves) and are obtained via a new peeling argument that respects an initialization-dependent constraint. A matching lower bound (tight up to a constant factor) is also derived, and empirical comparisons are presented to argue that the resulting bounds are non-vacuous for overparameterized networks.
Significance. If the peeling argument indeed yields bounds whose only dependence on the activation Lipschitz constant is linear (or can be normalized away) and whose path-norm term remains small in practice, the result would meaningfully strengthen the initialization-distance approach to explaining benign overfitting. The provision of both an upper bound and a nearly matching lower bound, together with empirical verification on overparameterized regimes, would constitute a concrete advance over prior Frobenius-norm analyses that are typically vacuous.
major comments (2)
- [Section 3] Section 3 (upper-bound derivation): the peeling step that converts the Rademacher complexity into a path-norm term must be inspected for its precise dependence on the activation Lipschitz constant L. If L appears only as a multiplicative factor independent of width and initialization scale, the non-vacuous claim holds; otherwise the bound reverts to being vacuous in the overparameterized regime where L is not tiny. The manuscript should explicitly state the algebraic dependence on L after the peeling step.
- [Theorem 4.1] Theorem 4.1 (lower bound): the tightness claim 'up to a constant factor' should be accompanied by an explicit statement of the constant and the regime (width, depth=2, initialization variance) under which the lower bound matches the upper bound; without this, it is unclear whether the lower bound rules out substantially tighter analyses.
minor comments (2)
- [Section 2] Notation for the path-norm of the distance from initialization should be introduced with a self-contained definition before its first use in the main theorem statement.
- [Section 5] The empirical section should report the precise data regime (number of samples, width, activation, initialization variance) and the baseline bounds against which non-vacuousness is claimed.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help us strengthen the presentation of our results. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.
read point-by-point responses
-
Referee: [Section 3] Section 3 (upper-bound derivation): the peeling step that converts the Rademacher complexity into a path-norm term must be inspected for its precise dependence on the activation Lipschitz constant L. If L appears only as a multiplicative factor independent of width and initialization scale, the non-vacuous claim holds; otherwise the bound reverts to being vacuous in the overparameterized regime where L is not tiny. The manuscript should explicitly state the algebraic dependence on L after the peeling step.
Authors: We thank the referee for highlighting this point. In the peeling argument of Section 3, the Rademacher complexity is bounded by a term that depends linearly on the Lipschitz constant L of the activation (specifically, a multiplicative factor of L appears after the peeling step, with no further dependence on width or initialization scale beyond the path-norm of the distance from initialization). This linear factor can be normalized by rescaling the loss function, preserving the non-vacuous character of the bounds in the overparameterized regime. We will add an explicit statement of this algebraic dependence immediately following the peeling step and in the statement of the main upper bound theorem. revision: yes
-
Referee: [Theorem 4.1] Theorem 4.1 (lower bound): the tightness claim 'up to a constant factor' should be accompanied by an explicit statement of the constant and the regime (width, depth=2, initialization variance) under which the lower bound matches the upper bound; without this, it is unclear whether the lower bound rules out substantially tighter analyses.
Authors: We agree that making the tightness claim fully explicit improves clarity. The lower bound in Theorem 4.1 matches the upper bound up to a universal constant factor (independent of width) in the regime of depth-2 networks with width W tending to infinity and standard Gaussian initialization with variance 1/W. We will revise the theorem statement and the accompanying discussion to include this explicit constant and regime description. revision: yes
Circularity Check
No significant circularity; derivation relies on novel peeling technique and direct path-norm definition
full rationale
The paper derives initialization-dependent Rademacher complexity bounds for shallow networks by introducing a new peeling argument that controls complexity via the path-norm of the distance from initialization. This path-norm is defined directly from the network parameters and fixed initialization without reference to the target bound or any fitted quantity. No step reduces a claimed prediction to a self-citation chain, an ansatz imported from the authors' prior work, or a parameter fitted to the same data being bounded. The empirical section compares the resulting bounds to observed generalization but does not feed back into the derivation. The analysis is therefore self-contained against external mathematical benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Activation functions are Lipschitz continuous
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 (Complexity Bounds) ... RS(G) ≤ ... Gγ sup κ(W,V) (2 + √5 / n (∑‖xi‖²)^{1/2} + ... ) with peeling over Hk,j = { (γ(x⊤w)−γ(x⊤w(0)j))/‖w−w(0)j‖ : rk < ‖w−w(0)j‖ ≤ rk+1 }
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Definition 2 (Path-norm) κ(W,V) = ∑ |vkj| ‖wj − w(0)j‖2
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. InAdvances in Neural Information Processing Systems, volume 32, pages 6158– 6169, 2019
work page 2019
- [2]
- [3]
-
[4]
P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002
work page 2002
-
[5]
P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, pages 6240–6249, 2017. 23
work page 2017
-
[6]
P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019
work page 2019
-
[7]
P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020
work page 2020
-
[8]
O. Bousquet and A. Elisseeff. Stability and generalization.Journal of Machine Learning Research, 2 (Mar):499–526, 2002
work page 2002
-
[9]
Z. Chen, Y. Cao, Q. Gu, and T. Zhang. A generalized neural tangent kernel analysis for two-layer neural networks.Advances in Neural Information Processing Systems, 33:13363–13373, 2020
work page 2020
-
[10]
Z. Chen, Y. Cao, D. Zou, and Q. Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021
work page 2021
- [11]
-
[12]
A. Daniely and E. Granot. On the sample complexity of two-layer networks: Lipschitz vs. element-wise lipschitz activation. InInternational Conference on Algorithmic Learning Theory, pages 505–517. PMLR, 2024
work page 2024
- [13]
-
[14]
Z. Ding, C. Duan, Y. Jiao, and J. Z. Yang. Semi-supervised deep sobolev regression: Estimation and variable selection by requ neural network.IEEE Transactions on Information Theory, 2025
work page 2025
-
[15]
S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[16]
G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017
work page Pith review arXiv 2017
-
[17]
N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018
work page 2018
- [18]
- [19]
-
[20]
K. Ji, Y. Zhou, and Y. Liang. Understanding estimation and generalization error of generative adversarial networks.IEEE Transactions on Information Theory, 67(5):3114–3129, 2021
work page 2021
- [21]
- [22]
- [23]
- [24]
-
[25]
Y. Lei, T. Yang, Y. Ying, and D.-X. Zhou. Generalization analysis for contrastive representation learning. InInternational Conference on Machine Learning, pages 19200–19227, 2023
work page 2023
-
[26]
Y. Lei, P. Wang, Y. Ying, and D.-X. Zhou. Optimization and generalization of gradient descent for shallow relu networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026
work page 2026
-
[27]
Y. Li, M. E. Ildiz, D. Papailiopoulos, and S. Oymak. Transformers as algorithms: Generalization and stability in in-context learning. InInternational Conference on Machine Learning, pages 19565–19594, 2023
work page 2023
-
[28]
Y. Li, Y. Lei, Z.-C. Guo, and Y. Ying. Optimal rates for generalization of gradient descent for deep relu classification. InAdvances in Neural Information Processing Systems, 2025
work page 2025
- [29]
-
[30]
C. Liu, L. Zhu, and M. Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020
work page 2020
-
[31]
F. Liu, L. Dadi, and V. Cevher. Learning with norm constrained, over-parameterized, two-layer neural networks.Journal of Machine Learning Research, 25(138):1–42, 2024
work page 2024
-
[32]
C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation: Gradi- ent descent converges linearly for phase retrieval, matrix completion, and blind deconvolution.Foundations of Computational Mathematics, 20:451—-632, 2019
work page 2019
-
[33]
R. Magen and O. Shamir. Initialization-dependent sample complexity of linear predictors and neural networks.Advances in Neural Information Processing Systems, 36:7632–7658, 2023. 24
work page 2023
-
[34]
A. Maurer. A vector-contraction inequality for rademacher complexities. InInternational Conference on Algorithmic Learning Theory, pages 3–17, 2016
work page 2016
- [35]
-
[36]
V. Nagarajan and J. Z. Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019
-
[37]
B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep neural networks.Advances in Neural Information Processing Systems, 28:2422–2430, 2015
work page 2015
-
[38]
B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. InConference on Learning Theory, pages 1376–1401. PMLR, 2015
work page 2015
-
[39]
B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018
work page 2018
-
[40]
B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. InInternational Conference on Learning Rep- resentations, 2019
work page 2019
-
[41]
S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: global convergence guarantees for training shallow neural networks.IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020
work page 2020
-
[42]
R. Parhi and R. D. Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(43):1–40, 2021
work page 2021
-
[43]
D. Richards and I. Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel.Advances in neural information processing systems, 34:8609–8621, 2021
work page 2021
-
[44]
M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.IEEE Transactions on Information Theory, 65(2):742–769, 2018
work page 2018
- [45]
-
[46]
H. Taheri and C. Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024
work page 2024
- [47]
-
[48]
Vapnik.The nature of statistical learning theory
V. Vapnik.The nature of statistical learning theory. Springer, 2013
work page 2013
-
[49]
P. Wang, Y. Lei, D. Wang, Y. Ying, and D.-X. Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025
work page 2025
- [50]
-
[51]
Y. Yang and D.-X. Zhou. Nonparametric regression using over-parameterized shallow relu neural networks. Journal of Machine Learning Research, 25(165):1–35, 2024
work page 2024
- [52]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.