pith. machine review for the scientific record. sign in

arxiv: 2604.00505 · v3 · submitted 2026-04-01 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords generalization boundsoverparameterized networksinitialization dependentpath normshallow neural networksLipschitz activationsbenign overfitting
0
0 comments X

The pith

Initialization-dependent path-norm bounds provide non-vacuous generalization guarantees for overparameterized shallow neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to explain the good generalization of overparameterized neural networks by linking it to the small distance the model travels from its initial random weights during training. It replaces the usual Frobenius norm measure of this distance with a path-norm version that allows much tighter bounds even when the network is very wide. A new peeling technique is introduced to derive these bounds for networks using any Lipschitz continuous activation function, and a matching lower bound is given. The resulting estimates are shown to be non-vacuous on practical examples, offering a way to theoretically account for benign overfitting without vacuous predictions.

Core claim

We develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor, and show through experiments that the bounds are non-vacuous for overparameterized networks.

What carries the argument

The path-norm of the distance from initialization, controlled via a new peeling technique that manages complexity under the initialization-dependent constraint.

If this is right

  • The bounds remain finite and informative even when the number of parameters greatly exceeds the number of training examples.
  • They apply to a broad class of activation functions that are Lipschitz continuous.
  • Empirical tests confirm the bounds predict actual generalization behavior without being loose.
  • The lower bound indicates that the upper bounds cannot be improved by more than a constant factor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar path-norm based analyses could be attempted for deeper architectures to explain their generalization.
  • Training algorithms that minimize the path-norm distance might lead to better generalization in practice.
  • The approach highlights the importance of initialization scale in controlling effective capacity.

Load-bearing premise

The distance from initialization in path-norm stays sufficiently small relative to the weights so the peeling technique keeps the complexity bound tight.

What would settle it

If training an overparameterized network results in large path-norm distance from initialization yet still achieves low test error, the bounds would fail to explain the generalization.

Figures

Figures reproduced from arXiv: 2604.00505 by Yufeng Xie, Yunwen Lei.

Figure 1
Figure 1. Figure 1: Behavior of ∥V ∥F n Pm j=1 Pn i=1 γ 2 (x ⊤ i w (0) j )  1 2 , ∥V ∥F (Gγbx∥W(0)∥σ)/ √ n, and path norm with respect to the number of hidden units m on the MNIST dataset. The gray area represents the value range under different random trials. than the standard path-norm and insensitive to the model width, showing that the distance from ini￾tialization can be substantially smaller than the norm of the model.… view at source ↗
Figure 2
Figure 2. Figure 2: Dominant terms in the generalization bounds summarized in Table 1 on MNIST dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The comparison of existing generalization bounds with all the terms and constants across [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
read the original abstract

Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses measure the distance from initialization by the Frobenius norm, and often imply vacuous bounds in practice for overparamterized models. In this paper, we develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper develops initialization-dependent generalization bounds for shallow neural networks with general Lipschitz activations. The bounds are expressed in terms of the path-norm of the distance from initialization (rather than the Frobenius norm of the weights themselves) and are obtained via a new peeling argument that respects an initialization-dependent constraint. A matching lower bound (tight up to a constant factor) is also derived, and empirical comparisons are presented to argue that the resulting bounds are non-vacuous for overparameterized networks.

Significance. If the peeling argument indeed yields bounds whose only dependence on the activation Lipschitz constant is linear (or can be normalized away) and whose path-norm term remains small in practice, the result would meaningfully strengthen the initialization-distance approach to explaining benign overfitting. The provision of both an upper bound and a nearly matching lower bound, together with empirical verification on overparameterized regimes, would constitute a concrete advance over prior Frobenius-norm analyses that are typically vacuous.

major comments (2)
  1. [Section 3] Section 3 (upper-bound derivation): the peeling step that converts the Rademacher complexity into a path-norm term must be inspected for its precise dependence on the activation Lipschitz constant L. If L appears only as a multiplicative factor independent of width and initialization scale, the non-vacuous claim holds; otherwise the bound reverts to being vacuous in the overparameterized regime where L is not tiny. The manuscript should explicitly state the algebraic dependence on L after the peeling step.
  2. [Theorem 4.1] Theorem 4.1 (lower bound): the tightness claim 'up to a constant factor' should be accompanied by an explicit statement of the constant and the regime (width, depth=2, initialization variance) under which the lower bound matches the upper bound; without this, it is unclear whether the lower bound rules out substantially tighter analyses.
minor comments (2)
  1. [Section 2] Notation for the path-norm of the distance from initialization should be introduced with a self-contained definition before its first use in the main theorem statement.
  2. [Section 5] The empirical section should report the precise data regime (number of samples, width, activation, initialization variance) and the baseline bounds against which non-vacuousness is claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help us strengthen the presentation of our results. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (upper-bound derivation): the peeling step that converts the Rademacher complexity into a path-norm term must be inspected for its precise dependence on the activation Lipschitz constant L. If L appears only as a multiplicative factor independent of width and initialization scale, the non-vacuous claim holds; otherwise the bound reverts to being vacuous in the overparameterized regime where L is not tiny. The manuscript should explicitly state the algebraic dependence on L after the peeling step.

    Authors: We thank the referee for highlighting this point. In the peeling argument of Section 3, the Rademacher complexity is bounded by a term that depends linearly on the Lipschitz constant L of the activation (specifically, a multiplicative factor of L appears after the peeling step, with no further dependence on width or initialization scale beyond the path-norm of the distance from initialization). This linear factor can be normalized by rescaling the loss function, preserving the non-vacuous character of the bounds in the overparameterized regime. We will add an explicit statement of this algebraic dependence immediately following the peeling step and in the statement of the main upper bound theorem. revision: yes

  2. Referee: [Theorem 4.1] Theorem 4.1 (lower bound): the tightness claim 'up to a constant factor' should be accompanied by an explicit statement of the constant and the regime (width, depth=2, initialization variance) under which the lower bound matches the upper bound; without this, it is unclear whether the lower bound rules out substantially tighter analyses.

    Authors: We agree that making the tightness claim fully explicit improves clarity. The lower bound in Theorem 4.1 matches the upper bound up to a universal constant factor (independent of width) in the regime of depth-2 networks with width W tending to infinity and standard Gaussian initialization with variance 1/W. We will revise the theorem statement and the accompanying discussion to include this explicit constant and regime description. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel peeling technique and direct path-norm definition

full rationale

The paper derives initialization-dependent Rademacher complexity bounds for shallow networks by introducing a new peeling argument that controls complexity via the path-norm of the distance from initialization. This path-norm is defined directly from the network parameters and fixed initialization without reference to the target bound or any fitted quantity. No step reduces a claimed prediction to a self-citation chain, an ansatz imported from the authors' prior work, or a parameter fitted to the same data being bounded. The empirical section compares the resulting bounds to observed generalization but does not feed back into the derivation. The analysis is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The bounds rest on standard Lipschitz continuity of the activation and on the assumption that the path-norm distance from initialization is the right complexity measure; no explicit free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Activation functions are Lipschitz continuous
    Stated as 'general Lipschitz activation functions' in the abstract; required for the complexity control.

pith-pipeline@v0.9.0 · 5467 in / 1154 out tokens · 28075 ms · 2026-05-13T23:23:51.431376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

  1. [1]

    Allen-Zhu, Y

    Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. InAdvances in Neural Information Processing Systems, volume 32, pages 6158– 6169, 2019

  2. [2]

    Arora, R

    S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018

  3. [3]

    Arora, S

    S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332, 2019

  4. [4]

    Bartlett and S

    P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002

  5. [5]

    P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, pages 6240–6249, 2017. 23

  6. [6]

    P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019

  7. [7]

    P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

  8. [8]

    Bousquet and A

    O. Bousquet and A. Elisseeff. Stability and generalization.Journal of Machine Learning Research, 2 (Mar):499–526, 2002

  9. [9]

    Z. Chen, Y. Cao, Q. Gu, and T. Zhang. A generalized neural tangent kernel analysis for two-layer neural networks.Advances in Neural Information Processing Systems, 33:13363–13373, 2020

  10. [10]

    Z. Chen, Y. Cao, D. Zou, and Q. Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

  11. [11]

    Chuang, Y

    C.-Y. Chuang, Y. Mroueh, K. Greenewald, A. Torralba, and S. Jegelka. Measuring generalization with optimal transport.Advances in Neural Information Processing Systems, 34:8294–8306, 2021

  12. [12]

    Daniely and E

    A. Daniely and E. Granot. On the sample complexity of two-layer networks: Lipschitz vs. element-wise lipschitz activation. InInternational Conference on Algorithmic Learning Theory, pages 505–517. PMLR, 2024

  13. [13]

    Deora, R

    P. Deora, R. Ghaderi, H. Taheri, and C. Thrampoulidis. On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2024

  14. [14]

    Z. Ding, C. Duan, Y. Jiao, and J. Z. Yang. Semi-supervised deep sobolev regression: Estimation and variable selection by requ neural network.IEEE Transactions on Information Theory, 2025

  15. [15]

    S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. InInternational Conference on Learning Representations, 2018

  16. [16]

    G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017

  17. [17]

    Golowich, A

    N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018

  18. [18]

    Hardt, B

    M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational Conference on Machine Learning, pages 1225–1234, 2016

  19. [19]

    He and Z

    H. He and Z. Goldfeld. Information-theoretic generalization bounds for deep neural networks.IEEE Transactions on Information Theory, 71(8):6227–6247, 2025

  20. [20]

    K. Ji, Y. Zhou, and Y. Liang. Understanding estimation and generalization error of generative adversarial networks.IEEE Transactions on Information Theory, 67(5):3114–3129, 2021

  21. [21]

    Ji and M

    Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2019

  22. [22]

    LeCun, Y

    Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015

  23. [23]

    Ledent, W

    A. Ledent, W. Mustafa, Y. Lei, and M. Kloft. Norm-based generalisation bounds for deep multi-class convolutional neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8279–8287, 2021

  24. [24]

    Lei and Y

    Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819, 2020

  25. [25]

    Y. Lei, T. Yang, Y. Ying, and D.-X. Zhou. Generalization analysis for contrastive representation learning. InInternational Conference on Machine Learning, pages 19200–19227, 2023

  26. [26]

    Y. Lei, P. Wang, Y. Ying, and D.-X. Zhou. Optimization and generalization of gradient descent for shallow relu networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

  27. [27]

    Y. Li, M. E. Ildiz, D. Papailiopoulos, and S. Oymak. Transformers as algorithms: Generalization and stability in in-context learning. InInternational Conference on Machine Learning, pages 19565–19594, 2023

  28. [28]

    Y. Li, Y. Lei, Z.-C. Guo, and Y. Ying. Optimal rates for generalization of gradient descent for deep relu classification. InAdvances in Neural Information Processing Systems, 2025

  29. [29]

    Z. Li, C. Ma, and L. Wu. Complexity measures for neural networks with general activation functions using path-based norms.arXiv preprint arXiv:2009.06132, 2020

  30. [30]

    C. Liu, L. Zhu, and M. Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

  31. [31]

    F. Liu, L. Dadi, and V. Cevher. Learning with norm constrained, over-parameterized, two-layer neural networks.Journal of Machine Learning Research, 25(138):1–42, 2024

  32. [32]

    C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation: Gradi- ent descent converges linearly for phase retrieval, matrix completion, and blind deconvolution.Foundations of Computational Mathematics, 20:451—-632, 2019

  33. [33]

    Magen and O

    R. Magen and O. Shamir. Initialization-dependent sample complexity of linear predictors and neural networks.Advances in Neural Information Processing Systems, 36:7632–7658, 2023. 24

  34. [34]

    A. Maurer. A vector-contraction inequality for rademacher complexities. InInternational Conference on Algorithmic Learning Theory, pages 3–17, 2016

  35. [35]

    Maurer, M

    A. Maurer, M. Pontil, and B. Romera-Paredes. An inequality with applications to structured sparsity and multitask dictionary learning. InConference on Learning Theory, pages 440–460. PMLR, 2014

  36. [36]

    Nagarajan and J

    V. Nagarajan and J. Z. Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019

  37. [37]

    Neyshabur, R

    B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep neural networks.Advances in Neural Information Processing Systems, 28:2422–2430, 2015

  38. [38]

    Neyshabur, R

    B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. InConference on Learning Theory, pages 1376–1401. PMLR, 2015

  39. [39]

    Neyshabur, S

    B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018

  40. [40]

    Neyshabur, Z

    B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. InInternational Conference on Learning Rep- resentations, 2019

  41. [41]

    Oymak and M

    S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: global convergence guarantees for training shallow neural networks.IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020

  42. [42]

    Parhi and R

    R. Parhi and R. D. Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(43):1–40, 2021

  43. [43]

    Richards and I

    D. Richards and I. Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel.Advances in neural information processing systems, 34:8609–8621, 2021

  44. [44]

    Soltanolkotabi, A

    M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.IEEE Transactions on Information Theory, 65(2):742–769, 2018

  45. [45]

    Suzuki, H

    T. Suzuki, H. Abe, and T. Nishimura. Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network. InInternational Conference on Learning Representations, 2020

  46. [46]

    Taheri and C

    H. Taheri and C. Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

  47. [47]

    Taheri, C

    H. Taheri, C. Thrampoulidis, and A. Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025

  48. [48]

    Vapnik.The nature of statistical learning theory

    V. Vapnik.The nature of statistical learning theory. Springer, 2013

  49. [49]

    P. Wang, Y. Lei, D. Wang, Y. Ying, and D.-X. Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

  50. [50]

    Weinan, C

    E. Weinan, C. Ma, and L. Wu. The barron space and the flow-induced function spaces for neural network models.Constructive Approximation, 55(1):369–406, 2022

  51. [51]

    Yang and D.-X

    Y. Yang and D.-X. Zhou. Nonparametric regression using over-parameterized shallow relu neural networks. Journal of Machine Learning Research, 25(165):1–35, 2024

  52. [52]

    Zhang, S

    C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021. 25