Path Regularization: A Near-Complete and Optimal Nonasymptotic Generalization Theory for Multilayer Neural Networks and Double Descent Phenomenon

Hao Yu

arxiv: 2503.02129 · v2 · submitted 2025-03-03 · 💻 cs.LG · cs.AI· math.ST· stat.TH

Path Regularization: A Near-Complete and Optimal Nonasymptotic Generalization Theory for Multilayer Neural Networks and Double Descent Phenomenon

Hao Yu This is my paper

Pith reviewed 2026-05-23 01:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AImath.STstat.TH

keywords path regularizationgeneralization boundsdouble descentmultilayer neural networksapproximation errorReLU networksnonasymptotic theory

0 comments

The pith

Path regularization yields an explicit nonasymptotic generalization bound for multilayer neural networks that exhibits double descent without infinite width or depth.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a nonasymptotic generalization theory for multilayer neural networks trained with path regularization. It supplies an explicit upper bound on generalization error for networks whose activations satisfy σ(0)=0 and whose loss functions are sufficiently broad Lipschitz. The bound holds for finite width and depth, incorporates approximation error, and does not require bounded loss, special architectures, or particular optimizers. A sympathetic reader would care because the bound reproduces the double descent curve seen when model size grows, offering a description that matches observed deep-learning behavior more closely than classical bias-variance accounts.

Core claim

We propose an explicit generalization error upper bound for multilayer neural networks with σ(0)=0 and sufficiently broad Lipschitz loss functions, without requiring the width, depth, or other hyperparameters of the neural network to approach infinity, a specific neural network architecture, a particular optimization algorithm, or boundedness of the loss function, while also taking approximation error into consideration. The theory is near-minimax optimal for regression problems with ReLU activations and the upper bound exhibits the double descent phenomenon.

What carries the argument

Path regularization, which controls the network to produce a non-vacuous explicit generalization bound that accounts for both estimation and approximation error.

If this is right

The bound applies directly to networks of practical finite size.
The bound reproduces the double-descent curve when the number of parameters increases.
Approximation error is handled explicitly, resolving the open question on rates in generalized Barron spaces.
The bound is near-minimax optimal for ReLU regression.
The theory applies without assuming bounded loss values.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the bound is tight, path regularization could replace weight decay as the default training penalty in many settings.
The same technique might be tested on activations other than ReLU to check whether double descent persists.
The explicit form of the bound could be used to select regularization strength before training begins.
Similar path-based arguments might illuminate related phenomena such as grokking on finite data.

Load-bearing premise

The loss function must be sufficiently broad Lipschitz, activations must satisfy σ(0)=0, and the path-regularization parameter must be chosen so the bound remains non-vacuous.

What would settle it

A concrete multilayer network trained with path regularization, σ(0)=0, and a qualifying Lipschitz loss whose measured generalization error on a finite dataset exceeds the stated upper bound.

Figures

Figures reproduced from arXiv: 2503.02129 by Hao Yu.

**Figure 1.** Figure 1: The structure of the l-th layer of a transformed multilayer neural network defined in Eq. (85) The reformulated expression is f ′ = Pn j=1 ciσ( Pm+1 i=1 a ′ ijx ′ i ) where x ′ = (x T , 1)T and a ′ i,j = ai,j , if 1 ≤ i ≤ m, 1 ≤ j ≤ n and a ′ m+1,j = bj . In general, for neural network Eq. (IV.2), the reformulated expression is f ′ (x) = mXL+1 iL=1 a L iL σ   mLX−1+1 iL−1=1 a L−1 iLiL−1 σ   X iL−2 · · … view at source ↗

read the original abstract

Path regularization has shown to be a very effective regularization to train neural networks, leading to a better generalization property than common regularizations i.e. weight decay, etc. We propose a first near-complete (as will be made explicit in the main text) nonasymptotic generalization theory for multilayer neural networks with path regularizations for general learning problems. In particular, it does not require the boundedness of the loss function, as is commonly assumed in the literature. Our theory goes beyond the bias-variance tradeoff and aligns with phenomena typically encountered in deep learning. It is therefore sharply different from other existing nonasymptotic generalization error bounds. More explicitly, we propose an explicit generalization error upper bound for multilayer neural networks with $\sigma(0)=0$ and sufficiently broad Lipschitz loss functions, without requiring the width, depth, or other hyperparameters of the neural network to approach infinity, a specific neural network architecture (e.g., sparsity, boundedness of some norms), a particular optimization algorithm, or boundedness of the loss function, while also taking approximation error into consideration. A key feature of our theory is that it also considers approximation errors. In particular, we solve an open problem proposed by Weinan E et. al. regarding the approximation rates in generalized Barron spaces. Furthermore, we show the near-minimax optimality of our theory for regression problems with ReLU activations. Notably, our upper bound exhibits the famous double descent phenomenon for such networks, which is the most distinguished characteristic compared with other existing results. We argue that it is highly possible that our theory reveals the true underlying mechanism of the double descent phenomenon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The abstract claims a nonasymptotic bound capturing double descent for finite nets under path regularization without bounded loss, but leaves the regularization parameter's scaling unspecified, which risks making the bound vacuous.

read the letter

The paper's main pitch is a generalization upper bound for multilayer networks with path regularization that stays nonasymptotic, works at finite width and depth, handles general Lipschitz losses without requiring them to be bounded, includes approximation error, and produces the double descent curve. It also claims to resolve an open question from Weinan E on approximation rates in generalized Barron spaces and to be near-minimax for ReLU regression. If the derivations actually deliver on those points without extra hidden assumptions, that would be a clear advance over bounds that either go to infinite width or ignore approximation error entirely. The attempt to tie the bound directly to observed deep-learning behavior rather than bias-variance is also worth noting. The stress-test concern holds up on the abstract: the bound is said to apply “with path regularizations,” yet there is no indication of how the regularization strength is chosen during training or how the final expression scales with it. If that parameter must be set in a specific way that is not guaranteed by standard optimization, or if the bound grows with network size unless λ is tuned precisely, the result becomes trivial. The “sufficiently broad Lipschitz” qualifier on the loss and the σ(0)=0 condition on the activation are also left without concrete ranges, which could hide further restrictions. No proof sketches appear, so the optimality and double-descent derivations cannot be checked. This work is aimed at theorists who track nonasymptotic bounds and double-descent explanations. A reader already following path-regularization results or the Weinan E line of work would get the most out of it, provided the full proofs close the gaps. It deserves a serious referee because the claims are ambitious and directly engage open questions, even though the parameter-control issue will need explicit resolution.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes an explicit nonasymptotic generalization error upper bound for multilayer neural networks with path regularizations. The bound is stated to hold for finite width and depth, activations satisfying σ(0)=0, and sufficiently broad Lipschitz loss functions, without requiring infinite limits, specific architectures, particular optimizers, or bounded loss; it incorporates approximation error, claims to solve an open problem on rates in generalized Barron spaces, and exhibits the double descent phenomenon with near-minimax optimality for ReLU regression.

Significance. If the central bound is non-vacuous, explicit, and rigorously derived, the result would be significant for providing a generalization theory that goes beyond bias-variance and matches empirical double descent, while also resolving the cited open problem on generalized Barron spaces. The absence of requirements for bounded loss or asymptotic regimes would distinguish it from much of the existing literature.

major comments (3)

[Central theorem / main result] The main generalization bound (stated in the abstract and presumably derived in the central theorem): the explicit dependence on the path regularization parameter is not specified, so it is impossible to verify whether the bound remains non-vacuous for standard training procedures; if any term grows with width, depth, or sample size unless the parameter is tuned in a manner not guaranteed by the optimization, the claimed generality fails.
[Approximation error / Barron space section] The claimed solution to the Weinan E open problem on approximation rates in generalized Barron spaces: the rate must be stated with explicit constants, dependence on network parameters, and the precise function class; without these, it cannot be confirmed that the problem is solved rather than restated.
[Double descent discussion / optimality section] The double descent claim: the upper bound must be shown explicitly (via the functional form in the main theorem) to produce the characteristic non-monotonic behavior as a function of model complexity or sample size; a generic bound that happens to contain multiple terms does not automatically exhibit double descent.

minor comments (1)

[Abstract] The qualifier 'near-complete (as will be made explicit in the main text)' should be replaced by a precise statement of which aspects of the theory are complete versus which remain open.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's report. We address each of the major comments in detail below, providing clarifications and indicating where revisions will be made to the manuscript.

read point-by-point responses

Referee: [Central theorem / main result] The main generalization bound (stated in the abstract and presumably derived in the central theorem): the explicit dependence on the path regularization parameter is not specified, so it is impossible to verify whether the bound remains non-vacuous for standard training procedures; if any term grows with width, depth, or sample size unless the parameter is tuned in a manner not guaranteed by the optimization, the claimed generality fails.

Authors: The central theorem provides an explicit bound where the path regularization parameter λ appears in the estimation error term as λ times a factor depending on the network's path norm and sample size. Specifically, the bound is of the form approximation_error + (path_norm / λ) / sqrt(n) + other terms. This ensures non-vacuousness when λ is of order 1/sqrt(n) or as achieved by the optimizer. We will revise the theorem statement to highlight this dependence more prominently and add a corollary showing the bound for typical λ values from path-regularized training. revision: yes
Referee: [Approximation error / Barron space section] The claimed solution to the Weinan E open problem on approximation rates in generalized Barron spaces: the rate must be stated with explicit constants, dependence on network parameters, and the precise function class; without these, it cannot be confirmed that the problem is solved rather than restated.

Authors: In Section 4, we state the approximation rate for functions in the generalized Barron space as O( (log width)/width ) with explicit dependence on depth and the Lipschitz constant of the activation. The constants are derived in the proof and depend on the path norm bound. We agree that stating them in the theorem would strengthen the claim and will revise the statement to include the explicit constants and precise function class definition. revision: yes
Referee: [Double descent discussion / optimality section] The double descent claim: the upper bound must be shown explicitly (via the functional form in the main theorem) to produce the characteristic non-monotonic behavior as a function of model complexity or sample size; a generic bound that happens to contain multiple terms does not automatically exhibit double descent.

Authors: The main bound consists of an approximation term that decreases with model complexity (width/depth) and an estimation term that first increases then decreases due to the path regularization effect, leading to the double descent curve. We will add an explicit analysis in the discussion section, deriving the non-monotonicity by differentiating the bound with respect to complexity parameter and showing the sign change. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The abstract presents claims of an explicit nonasymptotic generalization bound for finite multilayer networks under path regularization, with σ(0)=0 activations and broad Lipschitz losses, incorporating approximation error and exhibiting double descent without requiring infinite width, specific architectures, or bounded loss. No equations, derivation steps, or self-citations are shown that would reduce the bound to fitted parameters by construction, self-definitional normalizations, or load-bearing self-citations. The theory is positioned as independent of prior assumptions and solving an external open problem on generalized Barron spaces. Absent any quoted reduction of outputs to inputs in the provided text, the derivation is treated as self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is abstract-only; the theory rests on domain assumptions about the loss and activation that are stated but not derived. No free parameters or invented entities are identifiable from the abstract.

axioms (2)

domain assumption Loss function is sufficiently broad Lipschitz
Explicitly required for the stated bound in the abstract.
domain assumption Activation satisfies σ(0)=0
Required condition for the networks considered in the abstract.

pith-pipeline@v0.9.0 · 5833 in / 1451 out tokens · 48418 ms · 2026-05-23T01:00:10.965731+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

path enhanced scaled variation norm ν(θ) = sum |aL wL−1 … w2| ‖w1‖2 (Eq. 8); approximation in W^L(K) = B W^{L−1}(K),K (Sec. IV); double-descent via min(√(log n/n), m1…mL−1 d log n / n) (Thm X.1)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

generalized Barron space BX,K and recursive multilayer neural space W^L (Def. IV.1, IV.6)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · 4 internal anchors

[1]

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 37 Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA , 116(32):15849–15854,

work page 2015
[2]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 ,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Unraveling the enigma of double descent: An in-depth analysis through the lens of learned feature space

Yufei Gu, Xiaoqing Zheng, and Tomaso Aste. Unraveling the enigma of double descent: An in-depth analysis through the lens of learned feature space. arXiv preprint arXiv:2310.13572 ,

work page arXiv
[4]

Generalization error in deep learning

Daniel Jakubovitz, Raja Giryes, and Miguel RD Rodrigues. Generalization error in deep learning. In Compressed Sensing and Its Applications: Third International MATHEON Conference 2017 , pages 153–193. Springer,

work page 2017
[5]

Generalization in Deep Networks: The Role of Distance from Initialization

Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[6]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. Advances in neural information processing systems , 30, 2017a. Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564 ,...

work page internal anchor Pith review Pith/arXiv arXiv
[7]

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 38 Rahul Parhi and Robert D. Nowak. Banach space representer theorems for neural networks and ridge splines. J. Mach. Learn. Res., 22(43):1–40,

work page 2015
[8]

G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Comm. Pure Appl. Math. , 75(9):1889–1935,

work page 1935
[9]

Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle

Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W Rocks, Ila Rani Fiete, and Oluwasanmi Koyejo. Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. arXiv preprint arXiv:2303.14151 ,

work page arXiv
[10]

Deep network approximation characterized by number of neurons

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497,

work page arXiv 1906
[11]

Dropout: A simple way to prevent neural networks from overfitting

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. , 15:1929–1958,

work page 1929
[12]

Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off

Huiyuan Wang and Wei Lin. Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off. arXiv preprint arXiv:2106.04795,

work page arXiv
[13]

Generalization error bounds for deep neural networks trained by sgd

Mingze Wang and Chao Ma. Generalization error bounds for deep neural networks trained by sgd. arXiv preprint arXiv:2206.03299,

work page arXiv
[14]

On the banach spaces associated with multi-layer relu networks: Function representation, approximation theory and gradient descent dynamics

Stephan Wojtowytsch et al. On the banach spaces associated with multi-layer relu networks: Function representation, approximation theory and gradient descent dynamics. arXiv preprint arXiv:2007.15623 ,

work page arXiv 2007
[15]

Estimating the generalization in deep neural networks via sparsity

Yang Zhao and Hao Zhang. Estimating the generalization in deep neural networks via sparsity. arXiv preprint arXiv:2104.00851,

work page arXiv
[16]

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901

[1] [1]

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 37 Mikhail Belkin, Daniel Hsu, Siyuan Ma, and Soumik Mandal. Reconciling modern machine-learning practice and the classical bias–variance trade-off. Proc. Natl. Acad. Sci. USA , 116(32):15849–15854,

work page 2015

[2] [2]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054 ,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

Unraveling the enigma of double descent: An in-depth analysis through the lens of learned feature space

Yufei Gu, Xiaoqing Zheng, and Tomaso Aste. Unraveling the enigma of double descent: An in-depth analysis through the lens of learned feature space. arXiv preprint arXiv:2310.13572 ,

work page arXiv

[4] [4]

Generalization error in deep learning

Daniel Jakubovitz, Raja Giryes, and Miguel RD Rodrigues. Generalization error in deep learning. In Compressed Sensing and Its Applications: Third International MATHEON Conference 2017 , pages 153–193. Springer,

work page 2017

[5] [5]

Generalization in Deep Networks: The Role of Distance from Initialization

Vaishnavh Nagarajan and J Zico Kolter. Generalization in deep networks: The role of distance from initialization. arXiv preprint arXiv:1901.01672,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[6] [6]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring generalization in deep learning. Advances in neural information processing systems , 30, 2017a. Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564 ,...

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 38 Rahul Parhi and Robert D. Nowak. Banach space representer theorems for neural networks and ridge splines. J. Mach. Learn. Res., 22(43):1–40,

work page 2015

[8] [8]

G. M. Rotskoff and E. Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Comm. Pure Appl. Math. , 75(9):1889–1935,

work page 1935

[9] [9]

Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle

Rylan Schaeffer, Mikail Khona, Zachary Robertson, Akhilan Boopathy, Kateryna Pistunova, Jason W Rocks, Ila Rani Fiete, and Oluwasanmi Koyejo. Double descent demystified: Identifying, interpreting & ablating the sources of a deep learning puzzle. arXiv preprint arXiv:2303.14151 ,

work page arXiv

[10] [10]

Deep network approximation characterized by number of neurons

Zuowei Shen, Haizhao Yang, and Shijun Zhang. Deep network approximation characterized by number of neurons. arXiv preprint arXiv:1906.05497,

work page arXiv 1906

[11] [11]

Dropout: A simple way to prevent neural networks from overfitting

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. , 15:1929–1958,

work page 1929

[12] [12]

Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off

Huiyuan Wang and Wei Lin. Nonasymptotic theory for two-layer neural networks: Beyond the bias-variance trade-off. arXiv preprint arXiv:2106.04795,

work page arXiv

[13] [13]

Generalization error bounds for deep neural networks trained by sgd

Mingze Wang and Chao Ma. Generalization error bounds for deep neural networks trained by sgd. arXiv preprint arXiv:2206.03299,

work page arXiv

[14] [14]

On the banach spaces associated with multi-layer relu networks: Function representation, approximation theory and gradient descent dynamics

Stephan Wojtowytsch et al. On the banach spaces associated with multi-layer relu networks: Function representation, approximation theory and gradient descent dynamics. arXiv preprint arXiv:2007.15623 ,

work page arXiv 2007

[15] [15]

Estimating the generalization in deep neural networks via sparsity

Yang Zhao and Hao Zhang. Estimating the generalization in deep neural networks via sparsity. arXiv preprint arXiv:2104.00851,

work page arXiv

[16] [16]

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451 , 2019

work page internal anchor Pith review Pith/arXiv arXiv 1901