arxiv: 2604.00505 · v3 · submitted 2026-04-01 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Initialization-dependent and Non-vacuous Generalization Bounds for Overparameterized Shallow Neural Networks

Yunwen Lei , Yufeng Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:23 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords generalization boundsoverparameterized networksinitialization dependentpath normshallow neural networksLipschitz activationsbenign overfitting

0 comments

The pith

Initialization-dependent path-norm bounds provide non-vacuous generalization guarantees for overparameterized shallow neural networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper seeks to explain the good generalization of overparameterized neural networks by linking it to the small distance the model travels from its initial random weights during training. It replaces the usual Frobenius norm measure of this distance with a path-norm version that allows much tighter bounds even when the network is very wide. A new peeling technique is introduced to derive these bounds for networks using any Lipschitz continuous activation function, and a matching lower bound is given. The resulting estimates are shown to be non-vacuous on practical examples, offering a way to theoretically account for benign overfitting without vacuous predictions.

Core claim

We develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor, and show through experiments that the bounds are non-vacuous for overparameterized networks.

What carries the argument

The path-norm of the distance from initialization, controlled via a new peeling technique that manages complexity under the initialization-dependent constraint.

If this is right

The bounds remain finite and informative even when the number of parameters greatly exceeds the number of training examples.
They apply to a broad class of activation functions that are Lipschitz continuous.
Empirical tests confirm the bounds predict actual generalization behavior without being loose.
The lower bound indicates that the upper bounds cannot be improved by more than a constant factor.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar path-norm based analyses could be attempted for deeper architectures to explain their generalization.
Training algorithms that minimize the path-norm distance might lead to better generalization in practice.
The approach highlights the importance of initialization scale in controlling effective capacity.

Load-bearing premise

The distance from initialization in path-norm stays sufficiently small relative to the weights so the peeling technique keeps the complexity bound tight.

What would settle it

If training an overparameterized network results in large path-norm distance from initialization yet still achieves low test error, the bounds would fail to explain the generalization.

Figures

Figures reproduced from arXiv: 2604.00505 by Yufeng Xie, Yunwen Lei.

**Figure 1.** Figure 1: Behavior of ∥V ∥F n Pm j=1 Pn i=1 γ 2 (x ⊤ i w (0) j ) 1 2 , ∥V ∥F (Gγbx∥W(0)∥σ)/ √ n, and path norm with respect to the number of hidden units m on the MNIST dataset. The gray area represents the value range under different random trials. than the standard path-norm and insensitive to the model width, showing that the distance from initialization can be substantially smaller than the norm of the model.… view at source ↗

**Figure 2.** Figure 2: Dominant terms in the generalization bounds summarized in Table 1 on MNIST dataset. [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗

**Figure 3.** Figure 3: The comparison of existing generalization bounds with all the terms and constants across [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

read the original abstract

Overparameterized neural networks often show a benign overfitting property in the sense of achieving excellent generalization behavior despite the number of parameters exceeding the number of training examples. A promising direction to explain benign overfitting is to relate generalization to the norm of distance from initialization, motivated by the empirical observations that this distance is often significantly smaller than the norm itself. However, the existing initialization-dependent complexity analyses measure the distance from initialization by the Frobenius norm, and often imply vacuous bounds in practice for overparamterized models. In this paper, we develop initialization-dependent complexity bounds for shallow neural networks with general Lipschitz activation functions. Our bounds depend on the path-norm of the distance from initialization, which are derived by introducing a new peeling technique to handle the challenge along with the initialization-dependent constraint. We also develop a lower bound tight up to a constant factor. Finally, we conduct empirical comparisons and show that our generalization analysis implies non-vacuous bounds for overparameterized networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Path-norm of distance from initialization plus a peeling argument yields non-vacuous bounds for shallow overparameterized nets, though the Lipschitz dependence still needs verification.

read the letter

The main advance is replacing the usual Frobenius norm with path-norm distance from initialization. This produces generalization bounds for shallow nets with arbitrary Lipschitz activations that stay finite and useful even when parameters greatly exceed samples. They also supply a matching lower bound up to constants and run experiments showing the bounds are non-vacuous on concrete data where earlier Frobenius versions collapse. That combination is the concrete progress over prior initialization-dependent analyses. The peeling technique is the technical device that lets them control the complexity under the distance-from-init constraint, and it appears to work cleanly enough for the shallow case. The empirical section is the part that matters most here, because it directly tests whether the theoretical improvement translates to practical numbers. One soft spot is the precise scaling with the activation Lipschitz constant inside the peeling step. If that factor grows with width or initialization scale rather than staying linear, the bounds could still become loose in the regimes the paper targets. The abstract does not spell out the exact dependence, so the proofs need a close read on that algebraic detail. The work is aimed at researchers who track norm-based generalization bounds and the benign-overfitting literature. It is a direct, incremental improvement on existing Frobenius analyses rather than a wholesale new framework. I would send it to peer review. The central claim is well-motivated, the empirical check is present, and the technical gap it tries to close is real even if the peeling argument requires tightening.

Referee Report

2 major / 2 minor

Summary. The paper develops initialization-dependent generalization bounds for shallow neural networks with general Lipschitz activations. The bounds are expressed in terms of the path-norm of the distance from initialization (rather than the Frobenius norm of the weights themselves) and are obtained via a new peeling argument that respects an initialization-dependent constraint. A matching lower bound (tight up to a constant factor) is also derived, and empirical comparisons are presented to argue that the resulting bounds are non-vacuous for overparameterized networks.

Significance. If the peeling argument indeed yields bounds whose only dependence on the activation Lipschitz constant is linear (or can be normalized away) and whose path-norm term remains small in practice, the result would meaningfully strengthen the initialization-distance approach to explaining benign overfitting. The provision of both an upper bound and a nearly matching lower bound, together with empirical verification on overparameterized regimes, would constitute a concrete advance over prior Frobenius-norm analyses that are typically vacuous.

major comments (2)

[Section 3] Section 3 (upper-bound derivation): the peeling step that converts the Rademacher complexity into a path-norm term must be inspected for its precise dependence on the activation Lipschitz constant L. If L appears only as a multiplicative factor independent of width and initialization scale, the non-vacuous claim holds; otherwise the bound reverts to being vacuous in the overparameterized regime where L is not tiny. The manuscript should explicitly state the algebraic dependence on L after the peeling step.
[Theorem 4.1] Theorem 4.1 (lower bound): the tightness claim 'up to a constant factor' should be accompanied by an explicit statement of the constant and the regime (width, depth=2, initialization variance) under which the lower bound matches the upper bound; without this, it is unclear whether the lower bound rules out substantially tighter analyses.

minor comments (2)

[Section 2] Notation for the path-norm of the distance from initialization should be introduced with a self-contained definition before its first use in the main theorem statement.
[Section 5] The empirical section should report the precise data regime (number of samples, width, activation, initialization variance) and the baseline bounds against which non-vacuousness is claimed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help us strengthen the presentation of our results. We address each major comment below and will revise the manuscript to incorporate the requested clarifications.

read point-by-point responses

Referee: [Section 3] Section 3 (upper-bound derivation): the peeling step that converts the Rademacher complexity into a path-norm term must be inspected for its precise dependence on the activation Lipschitz constant L. If L appears only as a multiplicative factor independent of width and initialization scale, the non-vacuous claim holds; otherwise the bound reverts to being vacuous in the overparameterized regime where L is not tiny. The manuscript should explicitly state the algebraic dependence on L after the peeling step.

Authors: We thank the referee for highlighting this point. In the peeling argument of Section 3, the Rademacher complexity is bounded by a term that depends linearly on the Lipschitz constant L of the activation (specifically, a multiplicative factor of L appears after the peeling step, with no further dependence on width or initialization scale beyond the path-norm of the distance from initialization). This linear factor can be normalized by rescaling the loss function, preserving the non-vacuous character of the bounds in the overparameterized regime. We will add an explicit statement of this algebraic dependence immediately following the peeling step and in the statement of the main upper bound theorem. revision: yes
Referee: [Theorem 4.1] Theorem 4.1 (lower bound): the tightness claim 'up to a constant factor' should be accompanied by an explicit statement of the constant and the regime (width, depth=2, initialization variance) under which the lower bound matches the upper bound; without this, it is unclear whether the lower bound rules out substantially tighter analyses.

Authors: We agree that making the tightness claim fully explicit improves clarity. The lower bound in Theorem 4.1 matches the upper bound up to a universal constant factor (independent of width) in the regime of depth-2 networks with width W tending to infinity and standard Gaussian initialization with variance 1/W. We will revise the theorem statement and the accompanying discussion to include this explicit constant and regime description. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on novel peeling technique and direct path-norm definition

full rationale

The paper derives initialization-dependent Rademacher complexity bounds for shallow networks by introducing a new peeling argument that controls complexity via the path-norm of the distance from initialization. This path-norm is defined directly from the network parameters and fixed initialization without reference to the target bound or any fitted quantity. No step reduces a claimed prediction to a self-citation chain, an ansatz imported from the authors' prior work, or a parameter fitted to the same data being bounded. The empirical section compares the resulting bounds to observed generalization but does not feed back into the derivation. The analysis is therefore self-contained against external mathematical benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The bounds rest on standard Lipschitz continuity of the activation and on the assumption that the path-norm distance from initialization is the right complexity measure; no explicit free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Activation functions are Lipschitz continuous
Stated as 'general Lipschitz activation functions' in the abstract; required for the complexity control.

pith-pipeline@v0.9.0 · 5467 in / 1154 out tokens · 28075 ms · 2026-05-13T23:23:51.431376+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 (Complexity Bounds) ... RS(G) ≤ ... Gγ sup κ(W,V) (2 + √5 / n (∑‖xi‖²)^{1/2} + ... ) with peeling over Hk,j = { (γ(x⊤w)−γ(x⊤w(0)j))/‖w−w(0)j‖ : rk < ‖w−w(0)j‖ ≤ rk+1 }
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Definition 2 (Path-norm) κ(W,V) = ∑ |vkj| ‖wj − w(0)j‖2

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 52 canonical work pages

[1]

Allen-Zhu, Y

Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. InAdvances in Neural Information Processing Systems, volume 32, pages 6158– 6169, 2019

work page 2019
[2]

Arora, R

S. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a compression approach. InInternational Conference on Machine Learning, pages 254–263. PMLR, 2018

work page 2018
[3]

Arora, S

S. Arora, S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. InInternational Conference on Machine Learning, pages 322–332, 2019

work page 2019
[4]

Bartlett and S

P. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3:463–482, 2002

work page 2002
[5]

P. L. Bartlett, D. J. Foster, and M. J. Telgarsky. Spectrally-normalized margin bounds for neural networks. InAdvances in Neural Information Processing Systems, pages 6240–6249, 2017. 23

work page 2017
[6]

P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian. Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks.Journal of Machine Learning Research, 20(63):1–17, 2019

work page 2019
[7]

P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler. Benign overfitting in linear regression.Proceedings of the National Academy of Sciences, 117(48):30063–30070, 2020

work page 2020
[8]

Bousquet and A

O. Bousquet and A. Elisseeff. Stability and generalization.Journal of Machine Learning Research, 2 (Mar):499–526, 2002

work page 2002
[9]

Z. Chen, Y. Cao, Q. Gu, and T. Zhang. A generalized neural tangent kernel analysis for two-layer neural networks.Advances in Neural Information Processing Systems, 33:13363–13373, 2020

work page 2020
[10]

Z. Chen, Y. Cao, D. Zou, and Q. Gu. How much over-parameterization is sufficient to learn deep relu networks? InInternational Conference on Learning Representation, 2021

work page 2021
[11]

Chuang, Y

C.-Y. Chuang, Y. Mroueh, K. Greenewald, A. Torralba, and S. Jegelka. Measuring generalization with optimal transport.Advances in Neural Information Processing Systems, 34:8294–8306, 2021

work page 2021
[12]

Daniely and E

A. Daniely and E. Granot. On the sample complexity of two-layer networks: Lipschitz vs. element-wise lipschitz activation. InInternational Conference on Algorithmic Learning Theory, pages 505–517. PMLR, 2024

work page 2024
[13]

Deora, R

P. Deora, R. Ghaderi, H. Taheri, and C. Thrampoulidis. On the optimization and generalization of multi-head attention.Transactions on Machine Learning Research, 2024

work page 2024
[14]

Z. Ding, C. Duan, Y. Jiao, and J. Z. Yang. Semi-supervised deep sobolev regression: Estimation and variable selection by requ neural network.IEEE Transactions on Information Theory, 2025

work page 2025
[15]

S. S. Du, X. Zhai, B. Poczos, and A. Singh. Gradient descent provably optimizes over-parameterized neural networks. InInternational Conference on Learning Representations, 2018

work page 2018
[16]

G. K. Dziugaite and D. M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data.arXiv preprint arXiv:1703.11008, 2017

work page Pith review arXiv 2017
[17]

Golowich, A

N. Golowich, A. Rakhlin, and O. Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018

work page 2018
[18]

Hardt, B

M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. InInternational Conference on Machine Learning, pages 1225–1234, 2016

work page 2016
[19]

He and Z

H. He and Z. Goldfeld. Information-theoretic generalization bounds for deep neural networks.IEEE Transactions on Information Theory, 71(8):6227–6247, 2025

work page 2025
[20]

K. Ji, Y. Zhou, and Y. Liang. Understanding estimation and generalization error of generative adversarial networks.IEEE Transactions on Information Theory, 67(5):3114–3129, 2021

work page 2021
[21]

Ji and M

Z. Ji and M. Telgarsky. Polylogarithmic width suffices for gradient descent to achieve arbitrarily small test error with shallow relu networks. InInternational Conference on Learning Representations, 2019

work page 2019
[22]

LeCun, Y

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning.Nature, 521(7553):436–444, 2015

work page 2015
[23]

Ledent, W

A. Ledent, W. Mustafa, Y. Lei, and M. Kloft. Norm-based generalisation bounds for deep multi-class convolutional neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8279–8287, 2021

work page 2021
[24]

Lei and Y

Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. InInternational Conference on Machine Learning, pages 5809–5819, 2020

work page 2020
[25]

Y. Lei, T. Yang, Y. Ying, and D.-X. Zhou. Generalization analysis for contrastive representation learning. InInternational Conference on Machine Learning, pages 19200–19227, 2023

work page 2023
[26]

Y. Lei, P. Wang, Y. Ying, and D.-X. Zhou. Optimization and generalization of gradient descent for shallow relu networks with minimal width.Journal of Machine Learning Research, 27(34):1–35, 2026

work page 2026
[27]

Y. Li, M. E. Ildiz, D. Papailiopoulos, and S. Oymak. Transformers as algorithms: Generalization and stability in in-context learning. InInternational Conference on Machine Learning, pages 19565–19594, 2023

work page 2023
[28]

Y. Li, Y. Lei, Z.-C. Guo, and Y. Ying. Optimal rates for generalization of gradient descent for deep relu classification. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[29]

Z. Li, C. Ma, and L. Wu. Complexity measures for neural networks with general activation functions using path-based norms.arXiv preprint arXiv:2009.06132, 2020

work page arXiv 2009
[30]

C. Liu, L. Zhu, and M. Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant.Advances in Neural Information Processing Systems, 33:15954–15964, 2020

work page 2020
[31]

F. Liu, L. Dadi, and V. Cevher. Learning with norm constrained, over-parameterized, two-layer neural networks.Journal of Machine Learning Research, 25(138):1–42, 2024

work page 2024
[32]

C. Ma, K. Wang, Y. Chi, and Y. Chen. Implicit regularization in nonconvex statistical estimation: Gradi- ent descent converges linearly for phase retrieval, matrix completion, and blind deconvolution.Foundations of Computational Mathematics, 20:451—-632, 2019

work page 2019
[33]

Magen and O

R. Magen and O. Shamir. Initialization-dependent sample complexity of linear predictors and neural networks.Advances in Neural Information Processing Systems, 36:7632–7658, 2023. 24

work page 2023
[34]

A. Maurer. A vector-contraction inequality for rademacher complexities. InInternational Conference on Algorithmic Learning Theory, pages 3–17, 2016

work page 2016
[35]

Maurer, M

A. Maurer, M. Pontil, and B. Romera-Paredes. An inequality with applications to structured sparsity and multitask dictionary learning. InConference on Learning Theory, pages 440–460. PMLR, 2014

work page 2014
[36]

Nagarajan and J

V. Nagarajan and J. Z. Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672, 2019

work page arXiv 1901
[37]

Neyshabur, R

B. Neyshabur, R. R. Salakhutdinov, and N. Srebro. Path-SGD: Path-normalized optimization in deep neural networks.Advances in Neural Information Processing Systems, 28:2422–2430, 2015

work page 2015
[38]

Neyshabur, R

B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. InConference on Learning Theory, pages 1376–1401. PMLR, 2015

work page 2015
[39]

Neyshabur, S

B. Neyshabur, S. Bhojanapalli, and N. Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. InInternational Conference on Learning Representations, 2018

work page 2018
[40]

Neyshabur, Z

B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. InInternational Conference on Learning Rep- resentations, 2019

work page 2019
[41]

Oymak and M

S. Oymak and M. Soltanolkotabi. Toward moderate overparameterization: global convergence guarantees for training shallow neural networks.IEEE Journal on Selected Areas in Information Theory, 1(1):84–105, 2020

work page 2020
[42]

Parhi and R

R. Parhi and R. D. Nowak. Banach space representer theorems for neural networks and ridge splines. Journal of Machine Learning Research, 22(43):1–40, 2021

work page 2021
[43]

Richards and I

D. Richards and I. Kuzborskij. Stability & generalisation of gradient descent for shallow neural networks without the neural tangent kernel.Advances in neural information processing systems, 34:8609–8621, 2021

work page 2021
[44]

Soltanolkotabi, A

M. Soltanolkotabi, A. Javanmard, and J. D. Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks.IEEE Transactions on Information Theory, 65(2):742–769, 2018

work page 2018
[45]

Suzuki, H

T. Suzuki, H. Abe, and T. Nishimura. Compression based bound for non-compressed network: unified generalization error analysis of large compressible deep neural network. InInternational Conference on Learning Representations, 2020

work page 2020
[46]

Taheri and C

H. Taheri and C. Thrampoulidis. Generalization and stability of interpolating neural networks with minimal width.Journal of Machine Learning Research, 25(156):1–41, 2024

work page 2024
[47]

Taheri, C

H. Taheri, C. Thrampoulidis, and A. Mazumdar. Sharper guarantees for learning neural network classifiers with gradient methods. InInternational Conference on Learning Representations, 2025

work page 2025
[48]

Vapnik.The nature of statistical learning theory

V. Vapnik.The nature of statistical learning theory. Springer, 2013

work page 2013
[49]

P. Wang, Y. Lei, D. Wang, Y. Ying, and D.-X. Zhou. Generalization guarantees of gradient descent for shallow neural networks.Neural Computation, 37(2):344–402, 2025

work page 2025
[50]

Weinan, C

E. Weinan, C. Ma, and L. Wu. The barron space and the flow-induced function spaces for neural network models.Constructive Approximation, 55(1):369–406, 2022

work page 2022
[51]

Yang and D.-X

Y. Yang and D.-X. Zhou. Nonparametric regression using over-parameterized shallow relu neural networks. Journal of Machine Learning Research, 25(165):1–35, 2024

work page 2024
[52]

Zhang, S

C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning (still) requires rethinking generalization.Communications of the ACM, 64(3):107–115, 2021. 25

work page 2021