Recognition: no theorem link
Generalization error bounds for two-layer neural networks with Lipschitz loss function
Pith reviewed 2026-05-10 19:06 UTC · model grok-4.3
The pith
Two-layer neural networks achieve O(n^{-1/2}) generalization error bounds with Lipschitz loss using Wasserstein estimates, without requiring bounded loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Generalization error bounds of order O(n^{-1/2}) hold for two-layer networks trained with Lipschitz loss when test samples are independent, and bounds of order O(n^{-1/(d_in + d_out)}) hold without that independence; both are obtained from Wasserstein discrepancy between the data law and its empirical measure together with moment control on the stochastic-gradient iterates, and the explicit coefficients can be evaluated prior to training.
What carries the argument
Wasserstein distance between a probability distribution and its empirical measure, combined with moment bounds on the stochastic-gradient iterates.
If this is right
- The bounds remain valid for unbounded loss functions provided the Lipschitz condition holds.
- The explicit constants allow a priori selection of network width or step-size before training starts.
- The same Wasserstein-plus-moment machinery produces rates under both independent and dependent sampling regimes.
- Simulations reported in the paper match the predicted scaling.
Where Pith is reading between the lines
- The dimension dependence in the non-independent case suggests that imposing weak dependence conditions on the data could recover faster rates.
- The explicit pre-training computability of the bounds makes them usable for model-selection or early-stopping rules.
- If analogous moment bounds can be proved for deeper architectures, the same Wasserstein argument would extend the results beyond two layers.
Load-bearing premise
The loss is Lipschitz continuous and the stochastic-gradient iterates satisfy explicit moment bounds.
What would settle it
Numerical runs in which the measured generalization gap for independent test data grows faster than n^{-1/2} for large n would contradict the claimed rate.
Figures
read the original abstract
We derive generalization error bounds for the training of two-layer neural networks without assuming boundedness of the loss function, using Wasserstein distance estimates on the discrepancy between a probability distribution and its associated empirical measure, together with moment bounds for the associated stochastic gradient method. In the case of independent test data, we obtain a dimension-free rate of order $O(n^{-1/2} )$ on the $n$-sample generalization error, whereas without independence assumption, we derive a bound of order $O(n^{-1 / ( d_{\rm in}+d_{\rm out} )} )$, where $d_{\rm in}$, $d_{\rm out}$ denote input and output dimensions. Our bounds and their coefficients can be explicitly computed prior to the training of the model, and are confirmed by numerical simulations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript derives explicit generalization error bounds for two-layer neural networks with Lipschitz loss functions, without assuming bounded loss. The approach combines Wasserstein-1 distance estimates between the data measure and its empirical counterpart with moment bounds on the SGD trajectory. Under independent test data the bound is dimension-free of order O(n^{-1/2}); without independence the rate is O(n^{-1/(d_in + d_out)}). The coefficients are claimed to be computable before training and the bounds are supported by numerical experiments.
Significance. If the derivations hold, the work is significant for relaxing the bounded-loss assumption common in generalization theory while still obtaining explicit, pre-training bounds. The dimension-free rate under independence is practically relevant for high-dimensional data, and the use of standard Wasserstein and SGD-moment tools is coherent. Numerical confirmation strengthens the contribution.
minor comments (3)
- §2.2, Assumption 2.1: the moment bounds on the SGD iterates are stated as holding but their verification for the two-layer architecture is only sketched; a self-contained lemma or reference to a prior result with matching constants would strengthen the claim.
- Figure 1 and Table 1: the plotted and tabulated bounds use a specific choice of Lipschitz constant L=1; it is unclear how sensitive the numerical agreement is to larger L, which should be clarified for readers who wish to apply the bounds.
- Notation: d_in and d_out are introduced in the abstract but first defined only in §3; adding a brief parenthetical in the abstract would improve readability.
Simulated Author's Rebuttal
We thank the referee for their positive summary, recognition of the significance of our results (particularly the relaxation of the bounded-loss assumption and the dimension-free rate under independence), and the recommendation of minor revision. No specific major comments were provided in the report.
Circularity Check
No significant circularity
full rationale
The derivation relies on external Wasserstein distance estimates between a probability measure and its empirical counterpart, combined with moment bounds on the SGD trajectory for the two-layer network. These are standard tools from optimal transport and stochastic approximation theory, applied to a Lipschitz loss without boundedness assumptions. The O(n^{-1/2}) rate under independent test data follows from scalar concentration inequalities under finite moments, while the slower rate without independence uses known finite-dimensional W_1 convergence rates. Neither rate is obtained by fitting parameters to the target generalization error or by self-referential definitions; the bounds are explicitly computable a priori and confirmed numerically, with no load-bearing self-citations or ansatz smuggling indicated in the provided text.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The loss function is Lipschitz continuous
- domain assumption Moment bounds exist for the stochastic gradient method
Reference graph
Works this paper leans on
-
[1]
G. Aminian, S.N. Cohen, and . Szpruch. Mean-field analysis of generalization errors, 2023. Preprint arXiv:2306.11623
-
[2]
Arora, S.S
S. Arora, S.S. Du, W. Hu, Z. Li, and R. Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning , pages 322--332. PMLR, 2019
2019
-
[3]
Allen-Zhu, Y
Z. Allen-Zhu, Y. Li, and Y. Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in Neural Information Processing Systems , volume 32, 2019
2019
-
[4]
Cao and Q
Y. Cao and Q. Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. Advances in Neural Information Processing Systems , 32:10836--10846, 2019
2019
-
[5]
G.K. Dziugaite and D.M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Preprint arXiv:1703.11008 , 2017
work page Pith review arXiv 2017
- [6]
-
[7]
Fournier and A
N. Fournier and A. Guillin. On the rate of convergence in W asserstein distance of the empirical measure. Probability Theory and Related Fields , 162(3):707--738, 2015
2015
-
[8]
Hoeffding
W. Hoeffding. Probability inequalities for sums of bounded random variables. J. Amer. Statist. Assoc. , 58(301):13--30, 1963
1963
-
[9]
Hardt, B
M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: S tability of stochastic gradient descent. In International Conference on Machine Learning , pages 1225--1234. PMLR, 2016
2016
-
[10]
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision , pages 1026--1034, 2015
2015
-
[11]
ArXiv preprint arXiv:1710.05468 , year=
K. Kawaguchi, L.P. Kaelbling, and Y. Bengio. Generalization in deep learning. Preprint arXiv:1710.05468, 28 pages, 2017
-
[12]
Kantorovi c and G.S
L.V. Kantorovi c and G.S. Rubin s te n. On a space of completely additive functions. Vestnik Leningrad. Univ. , 13(7):52--59, 1958
1958
-
[13]
Lopez and V
A.T. Lopez and V. Jog. Generalization error bounds using W asserstein distances. In 2018 IEEE Information Theory Workshop (ITW) , pages 1--5. IEEE, 2018
2018
-
[14]
S. Mei, T. Misiakiewicz, and A. Montanari. Mean-field theory of two-layers neural networks: dimension-free bounds and kernel limit. In Proceedings of the 32nd Annual Conference on Learning Theory , volume 99 of Proceedings of Machine Learning Research , pages 1--77, 2019
2019
-
[15]
W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference on Learning Theory , pages 605--638. PMLR, 2018
2018
-
[16]
A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks
B. Neyshabur, S. Bhojanapalli, and N. Srebro. A PAC - B ayesian approach to spectrally-normalized margin bounds for neural networks. Preprint arXiv:1707.09564 , 2017
work page Pith review arXiv 2017
-
[17]
Neu, G.K
G. Neu, G.K. Dziugaite, M. Haghifam, and D.M. Roy. Information-theoretic generalization bounds for stochastic gradient descent. In Conference on Learning Theory , pages 3526--3545. PMLR, 2021
2021
-
[18]
Neyshabur, Z
B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations , 2018
2018
-
[19]
Neyshabur, R
B. Neyshabur, R. Tomioka, and N. Srebro. Norm-based capacity control in neural networks. In Conference on Learning Theory , pages 1376--1401. PMLR, 2015
2015
-
[20]
Pensia, V
A. Pensia, V. Jog, and P.L. Loh. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT) , pages 546--550. IEEE, 2018
2018
-
[21]
S. Park, U. Simsekli, and M.A. Erdogdu. Generalization bounds for stochastic gradient descent via localized -covers. In Advances in Neural Information Processing Systems , volume 35, pages 2790--2802, 2022
2022
-
[22]
Raginsky, A
M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient L angevin dynamics: a nonasymptotic analysis. In Conference on Learning Theory , pages 1674--1703. PMLR, 2017
2017
-
[23]
Rudelson and R
M. Rudelson and R. Vershynin. Non-asymptotic theory of random matrices: extreme singular values. In Proceedings of the International Congress of Mathematicians 2010 (ICM 2010) (In 4 Volumes) Vol. I: Plenary Lectures and Ceremonies Vols. II--IV: Invited Lectures , pages 1576--1602. World Scientific, 2010
2010
-
[24]
H. Wang, M. Diaz, J.C.S. Santos Filho, and F.P. Calmon. An information-theoretic view of generalization via W asserstein distance. In 2019 IEEE International Symposium on Information Theory (ISIT) , pages 577--581. IEEE, 2019
2019
-
[25]
P. Wang, Y. Lei, D. Wang, Y. Ying, and D.-X. Zhou. Generalization guarantees of gradient descent for shallow neural networks. Neural Computation , 37(1):1--45, 2025
2025
- [26]
-
[27]
Zhang, W
Y. Zhang, W. Zhang, S. Bald, V. Pingali, and C. Chenand M. Goswami. Stability of SGD : Tightness analysis and improved bounds. In Uncertainty in Artificial Intelligence , pages 2364--2373. PMLR, 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.