arxiv: 2604.06281 · v1 · submitted 2026-04-07 · 📊 stat.ML · math.PR

Recognition: no theorem link

Generalization error bounds for two-layer neural networks with Lipschitz loss function

Jiang Yu Nguwi , Nicolas Privault

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:06 UTC · model grok-4.3

classification 📊 stat.ML math.PR

keywords generalization error boundstwo-layer neural networksWasserstein distanceLipschitz lossstochastic gradient descentempirical measuredimension-free rates

0 comments

The pith

Two-layer neural networks achieve O(n^{-1/2}) generalization error bounds with Lipschitz loss using Wasserstein estimates, without requiring bounded loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes explicit generalization error bounds for training two-layer neural networks when the loss function is Lipschitz continuous. It replaces the standard bounded-loss assumption with Wasserstein-distance estimates between the true data distribution and the empirical measure, plus moment bounds on the stochastic gradient iterates. For independent test data the resulting bound on the n-sample error is dimension-free and of order O(n^{-1/2}). When test data are allowed to be dependent the bound becomes O(n^{-1/(d_in + d_out)}), where the exponents are the input and output dimensions. All constants in the bounds are computable before any training begins.

Core claim

Generalization error bounds of order O(n^{-1/2}) hold for two-layer networks trained with Lipschitz loss when test samples are independent, and bounds of order O(n^{-1/(d_in + d_out)}) hold without that independence; both are obtained from Wasserstein discrepancy between the data law and its empirical measure together with moment control on the stochastic-gradient iterates, and the explicit coefficients can be evaluated prior to training.

What carries the argument

Wasserstein distance between a probability distribution and its empirical measure, combined with moment bounds on the stochastic-gradient iterates.

If this is right

The bounds remain valid for unbounded loss functions provided the Lipschitz condition holds.
The explicit constants allow a priori selection of network width or step-size before training starts.
The same Wasserstein-plus-moment machinery produces rates under both independent and dependent sampling regimes.
Simulations reported in the paper match the predicted scaling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The dimension dependence in the non-independent case suggests that imposing weak dependence conditions on the data could recover faster rates.
The explicit pre-training computability of the bounds makes them usable for model-selection or early-stopping rules.
If analogous moment bounds can be proved for deeper architectures, the same Wasserstein argument would extend the results beyond two layers.

Load-bearing premise

The loss is Lipschitz continuous and the stochastic-gradient iterates satisfy explicit moment bounds.

What would settle it

Numerical runs in which the measured generalization gap for independent test data grows faster than n^{-1/2} for large n would contradict the claimed rate.

Figures

Figures reproduced from arXiv: 2604.06281 by Jiang Yu Nguwi, Nicolas Privault.

**Figure 2.** Figure 2: Mean absolute value of εgen(n, V (T), W(T)) vs. error bound (4.2). In [PITH_FULL_IMAGE:figures/full_fig_p019_2.png] view at source ↗

read the original abstract

We derive generalization error bounds for the training of two-layer neural networks without assuming boundedness of the loss function, using Wasserstein distance estimates on the discrepancy between a probability distribution and its associated empirical measure, together with moment bounds for the associated stochastic gradient method. In the case of independent test data, we obtain a dimension-free rate of order $O(n^{-1/2} )$ on the $n$-sample generalization error, whereas without independence assumption, we derive a bound of order $O(n^{-1 / ( d_{\rm in}+d_{\rm out} )} )$, where $d_{\rm in}$, $d_{\rm out}$ denote input and output dimensions. Our bounds and their coefficients can be explicitly computed prior to the training of the model, and are confirmed by numerical simulations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper supplies explicit, pre-computable generalization bounds for two-layer nets under Lipschitz loss by combining Wasserstein estimates with SGD moment bounds, delivering a dimension-free O(n^{-1/2}) rate for independent data.

read the letter

The main thing to know is that this work produces explicit generalization bounds for two-layer neural networks when the loss is Lipschitz continuous. It controls the gap between population and empirical risk via Wasserstein distance to the empirical measure plus moment bounds on the SGD trajectory, without needing a bounded loss assumption. The headline rates are O(n^{-1/2}) dimension-free for independent test data and the slower O(n^{-1/(d_in + d_out)}) rate when that independence is dropped; both coefficients are stated to be computable ahead of training and are checked numerically.

Referee Report

0 major / 3 minor

Summary. The manuscript derives explicit generalization error bounds for two-layer neural networks with Lipschitz loss functions, without assuming bounded loss. The approach combines Wasserstein-1 distance estimates between the data measure and its empirical counterpart with moment bounds on the SGD trajectory. Under independent test data the bound is dimension-free of order O(n^{-1/2}); without independence the rate is O(n^{-1/(d_in + d_out)}). The coefficients are claimed to be computable before training and the bounds are supported by numerical experiments.

Significance. If the derivations hold, the work is significant for relaxing the bounded-loss assumption common in generalization theory while still obtaining explicit, pre-training bounds. The dimension-free rate under independence is practically relevant for high-dimensional data, and the use of standard Wasserstein and SGD-moment tools is coherent. Numerical confirmation strengthens the contribution.

minor comments (3)

§2.2, Assumption 2.1: the moment bounds on the SGD iterates are stated as holding but their verification for the two-layer architecture is only sketched; a self-contained lemma or reference to a prior result with matching constants would strengthen the claim.
Figure 1 and Table 1: the plotted and tabulated bounds use a specific choice of Lipschitz constant L=1; it is unclear how sensitive the numerical agreement is to larger L, which should be clarified for readers who wish to apply the bounds.
Notation: d_in and d_out are introduced in the abstract but first defined only in §3; adding a brief parenthetical in the abstract would improve readability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the significance of our results (particularly the relaxation of the bounded-loss assumption and the dimension-free rate under independence), and the recommendation of minor revision. No specific major comments were provided in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The derivation relies on external Wasserstein distance estimates between a probability measure and its empirical counterpart, combined with moment bounds on the SGD trajectory for the two-layer network. These are standard tools from optimal transport and stochastic approximation theory, applied to a Lipschitz loss without boundedness assumptions. The O(n^{-1/2}) rate under independent test data follows from scalar concentration inequalities under finite moments, while the slower rate without independence uses known finite-dimensional W_1 convergence rates. Neither rate is obtained by fitting parameters to the target generalization error or by self-referential definitions; the bounds are explicitly computable a priori and confirmed numerically, with no load-bearing self-citations or ansatz smuggling indicated in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions from probability and optimization with no free parameters or new entities introduced in the abstract.

axioms (2)

domain assumption The loss function is Lipschitz continuous
Invoked in the title and abstract to remove the bounded-loss assumption while still controlling the generalization gap.
domain assumption Moment bounds exist for the stochastic gradient method
Used to control the training discrepancy in the Wasserstein estimates.

pith-pipeline@v0.9.0 · 5431 in / 1401 out tokens · 76095 ms · 2026-05-10T19:06:10.811706+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 6 canonical work pages

[1]

Aminian, S.N

G. Aminian, S.N. Cohen, and . Szpruch. Mean-field analysis of generalization errors, 2023. Preprint arXiv:2306.11623

work page arXiv 2023
[2]

Arora, S.S

S. Arora, S.S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning , pages 322--332. PMLR, 2019

2019
[3]

Allen-Zhu, Y

Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems , volume 32, 2019

2019
[4]

Cao and Q

Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems , 32:10836--10846, 2019

2019
[5]

Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data

G.K. Dziugaite and D.M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint arXiv:1703.11008 , 2017

work page Pith review arXiv 2017
[6]

Finlay, J

C. Finlay, J. Calder, B. Abbasi, and A. Oberman. Lipschitz regularized deep neural networks generalize and are adversarially robust, 2018. Preprint arXiv:1808.09540

work page arXiv 2018
[7]

Fournier and A

N. Fournier and A. Guillin. On the rate of convergence in W asserstein distance of the empirical measure. Probability Theory and Related Fields , 162(3):707--738, 2015

2015
[8]

Hoeffding

W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. , 58(301):13--30, 1963

1963
[9]

Hardt, B

M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: S tability of stochastic gradient descent. In International Conference on Machine Learning , pages 1225--1234. PMLR, 2016

2016
[10]

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision , pages 1026--1034, 2015

2015
[11]

ArXiv preprint arXiv:1710.05468 , year=

K. Kawaguchi, L.P. Kaelbling, and Y. Bengio. Generalization in deep learning. Preprint arXiv:1710.05468, 28 pages, 2017

work page arXiv 2017
[12]

Kantorovi c and G.S

L.V. Kantorovi c and G.S. Rubin s te n. On a space of completely additive functions. Vestnik Leningrad. Univ. , 13(7):52--59, 1958

1958
[13]

Lopez and V

A.T. Lopez and V. Jog. Generalization error bounds using W asserstein distances. In 2018 IEEE Information Theory Workshop (ITW) , pages 1--5. IEEE, 2018

2018
[14]

S. Mei, T. Misiakiewicz, and A. Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Proceedings of the 32nd Annual Conference on Learning Theory , volume 99 of Proceedings of Machine Learning Research , pages 1--77, 2019

2019
[15]

W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference on Learning Theory , pages 605--638. PMLR, 2018

2018
[16]

A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks

B. Neyshabur, S. Bhojanapalli, and N. Srebro. A PAC - B ayesian approach to spectrally-normalized margin bounds for neural networks. Preprint arXiv:1707.09564 , 2017

work page Pith review arXiv 2017
[17]

Neu, G.K

G. Neu, G.K. Dziugaite, M. Haghifam, and D.M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. In Conference on Learning Theory , pages 3526--3545. PMLR, 2021

2021
[18]

Neyshabur, Z

B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations , 2018

2018
[19]

Neyshabur, R

B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory , pages 1376--1401. PMLR, 2015

2015
[20]

Pensia, V

A. Pensia, V. Jog, and P.L. Loh. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT) , pages 546--550. IEEE, 2018

2018
[21]

S. Park, U. Simsekli, and M.A. Erdogdu. Generalization bounds for stochastic gradient descent via localized -covers. In Advances in Neural Information Processing Systems , volume 35, pages 2790--2802, 2022

2022
[22]

Raginsky, A

M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient L angevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory , pages 1674--1703. PMLR, 2017

2017
[23]

Rudelson and R

M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II--IV: Invited Lectures , pages 1576--1602. World Scientific, 2010

2010
[24]

H. Wang, M. Diaz, J.C.S. Santos Filho, and F.P. Calmon. An information-theoretic view of generalization via W asserstein distance. In 2019 IEEE International Symposium on Information Theory (ISIT) , pages 577--581. IEEE, 2019

2019
[25]

P. Wang, Y. Lei, D. Wang, Y. Ying, and D.-X. Zhou. Generalization guarantees of gradient descent for shallow neural networks. Neural Computation , 37(1):1--45, 2025

2025
[26]

Xu and M

A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. Preprint arXiv:1705.07809 , 2017

work page arXiv 2017
[27]

Zhang, W

Y. Zhang, W. Zhang, S. Bald, V. Pingali, and C. Chenand M. Goswami. Stability of SGD : Tightness analysis and improved bounds. In Uncertainty in Artificial Intelligence , pages 2364--2373. PMLR, 2022

2022