Universality in Deep Neural Networks: An approach via the Lindeberg exchange principle

Filippo Giovagnini; Marco Romito; Sotirios Kotitsas

arxiv: 2605.02771 · v1 · submitted 2026-05-04 · 🧮 math.PR · cs.LG· stat.ML

Universality in Deep Neural Networks: An approach via the Lindeberg exchange principle

Filippo Giovagnini , Sotirios Kotitsas , Marco Romito This is my paper

Pith reviewed 2026-05-08 18:29 UTC · model grok-4.3

classification 🧮 math.PR cs.LGstat.ML

keywords deep neural networksinfinite width limitLindeberg principleWasserstein distanceGaussian limituniversalityquantitative boundsactivation functions

0 comments

The pith

Deep neural networks converge quantitatively to Gaussian limits at infinite width via layer-wise Gaussian swaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes bounds on how closely the output of a deep fully connected neural network matches a Gaussian random variable when the width of the layers becomes large. It does this by developing a version of the Lindeberg principle tailored to neural networks, which allows replacing the weights layer by layer with Gaussian ones while controlling the error in Wasserstein distance. This provides a general quantitative version of the universality phenomenon for deep networks under mild conditions on the activation function. Such bounds help understand why wide networks behave like Gaussian processes and offer rates for the approximation.

Core claim

The authors prove quantitative general bounds on the 2-Wasserstein distance between the network output and its infinite-width Gaussian limit. The proof relies on a Lindeberg principle for deep neural networks that successively replaces the weights on each layer by Gaussian random variables, under appropriate regularity assumptions on the activation function.

What carries the argument

Lindeberg principle for Deep Neural Networks, which successively replaces weights on each layer by Gaussian random variables to bound the distance to the Gaussian limit.

If this is right

The 2-Wasserstein distance between the finite-width network and the Gaussian limit is bounded explicitly in terms of width and depth.
Convergence holds for general weight distributions, not just Gaussians initially.
The result applies to networks with fixed depth as width grows.
Regularity conditions on the activation ensure the quantitative bounds are valid.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This layer-by-layer exchange method might adapt to prove similar limits for convolutional or residual architectures.
The explicit rates could guide how initialization variances affect the speed of convergence to the Gaussian regime.
The technique offers a template for deriving quantitative universality statements in other iterated random mappings.

Load-bearing premise

The activation function must satisfy appropriate regularity assumptions so that the Lindeberg exchange yields quantitative control on the Wasserstein distance.

What would settle it

A concrete counterexample activation function satisfying basic continuity but where the 2-Wasserstein distance to the Gaussian limit fails to approach zero as width grows to infinity would disprove the bounds.

read the original abstract

We consider the infinite-width limit of a fully connected deep neural network with general weights, and we prove quantitative general bounds on the $2$-Wasserstein distance between the network and its infinite-width Gaussian limit, under appropriate regularity assumptions on the activation function. Our main tool is a Lindeberg principle for Deep Neural Networks, which we use to successively replace the weights on each layer by Gaussian random variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper uses layer-by-layer Lindeberg exchanges to get explicit 2-Wasserstein bounds from finite-width deep nets to their Gaussian infinite-width limit.

read the letter

The main point is that the authors apply the Lindeberg exchange principle successively across layers to bound the 2-Wasserstein distance between a deep network and its infinite-width Gaussian limit. This gives quantitative rates rather than just qualitative convergence under suitable conditions on the activation function. The approach works for general weight distributions in fully connected networks and builds the bound by replacing weights with Gaussians one layer at a time. This is new relative to the mean-field and NTK literature, which often stops at existence of the limit without explicit distances. The paper does a straightforward job of stating the regularity assumptions up front and avoiding any circular reduction to quantities defined inside the proof. The strategy is direct and the citation pattern covers the standard references in this area without obvious omissions. The soft spots are modest. The bounds require regularity on the activation, such as bounded derivatives or controlled growth, which rules out unsmoothed ReLU and similar non-smooth cases without extra work. Error terms will accumulate with depth, so the final constants may grow with the number of layers, though the paper does not claim optimality. A reader who already knows Lindeberg arguments will follow the steps without trouble, but the dependence on depth and the precise constants are not explored in detail. This work is for researchers in probability and theoretical machine learning who study infinite-width limits and need rates for approximation or generalization results. It shows clear engagement with the problem and the existing tools. I would send it for peer review because the technical contribution is focused and the argument holds together on its own terms.

Referee Report

0 major / 3 minor

Summary. The manuscript proves quantitative bounds on the 2-Wasserstein distance between a finite-width fully connected deep neural network with general (non-Gaussian) weights and its infinite-width Gaussian limit. The argument proceeds by applying a Lindeberg exchange principle layer by layer, successively replacing the weights in each layer by independent Gaussians while controlling the accumulated error under suitable regularity assumptions on the activation function.

Significance. If the quantitative bounds hold with explicit dependence on width, depth, and activation regularity, the result supplies a flexible, non-asymptotic universality statement that strengthens existing qualitative Gaussian-limit theorems for wide networks. The layer-wise Lindeberg strategy is a clear technical strength, as it avoids mean-field or moment-matching reductions and directly yields Wasserstein control.

minor comments (3)

[Abstract and §1] The dependence of the final bound on network depth is not stated explicitly in the abstract or introduction; clarifying whether the error grows linearly, exponentially, or remains uniform in depth would strengthen the main theorem statement.
[§2] The regularity assumptions on the activation (e.g., Lipschitz constant, bounded third derivative) are invoked repeatedly but never collected in a single hypothesis list; a dedicated assumption block before the main theorem would improve readability.
[§4] No numerical illustration or simulation is provided to check the sharpness of the derived rates; even a small-scale Monte-Carlo comparison for a two-layer network would help readers assess practical relevance.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of our manuscript and the recommendation for minor revision. We appreciate the recognition of the quantitative 2-Wasserstein bounds and the technical merits of the layer-wise Lindeberg exchange approach. As the report lists no specific major comments, we will incorporate minor revisions (such as any typographical corrections or minor clarifications) in the updated version.

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The derivation applies the classical Lindeberg exchange principle in a layer-wise manner to obtain quantitative 2-Wasserstein bounds to the infinite-width Gaussian limit. This is a direct, first-principles use of a standard probabilistic tool under explicitly stated regularity conditions on the activation; no step reduces the target bound to a quantity defined by the paper itself, no self-citation is load-bearing for the central claim, and the argument does not rename or smuggle in prior results by construction. The chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on regularity assumptions on the activation function that are invoked but not specified in detail within the abstract; no free parameters or new entities are introduced.

axioms (1)

domain assumption Regularity assumptions on the activation function
Required to apply the Lindeberg principle and obtain the stated Wasserstein bounds.

pith-pipeline@v0.9.0 · 5363 in / 1069 out tokens · 49598 ms · 2026-05-08T18:29:31.088186+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.FunctionalEquation / Foundation.AlphaCoordinateFixation washburn_uniqueness_aczel — paper's regularity/rate is unrelated to J-cost or φ-calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Letσ∈C 3·2L−1 b (R)... W2(z(L+1)(x),Z(L+1)(x)) ⩽ C(1/√n_L + Σ 1/√n_k)^{1/4}

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Journal of Machine Learning Research , year =

Boris Hanin , title =. Journal of Machine Learning Research , year =

work page
[2]

Invariance principles for homogeneous sums: Universality of

Nourdin, Ivan and Peccati, Giovanni and Reinert, Gesine , year=. Invariance principles for homogeneous sums: Universality of. The Annals of Probability , publisher=. doi:10.1214/10-aop531 , number=

work page doi:10.1214/10-aop531
[3]

Polynomial chaos and scaling limits of disordered systems , author =. J. Eur. Math. Soc. , fjournal =. 2016 , month =

work page 2016
[4]

2005 , eprint=

Noise stability of functions with low influences: invariance and optimality , author=. 2005 , eprint=

work page 2005
[5]

2013 , eprint=

Entropy and the fourth moment phenomenon , author=. 2013 , eprint=

work page 2013
[6]

Hanin, Random neural networks in the infinite width limit as gaussian processes (2021) arXiv:2107.01562 [math.PR]

Boris Hanin , year=. Random Neural Networks in the Infinite Width Limit as. 2107.01562 , archivePrefix=

work page arXiv
[7]

Lecture notes , url =

Discrete stochastic analysis , author=. Lecture notes , url =

work page
[8]

Electronic Journal of Probability , publisher =

Wasserstein-2 bounds in normal approximation under local dependence , author =. Electronic Journal of Probability , publisher =. 2019 , month =

work page 2019
[9]

Balasubramanian and N

Krishnakumar Balasubramanian and Nathan Ross , year=. Finite-Dimensional. 2507.12686 , archivePrefix=

work page arXiv
[10]

Celli and G

Lucia Celli and Giovanni Peccati , year=. Entropic bounds for conditionally. arXiv , primaryClass=:2504.08335 , note =

work page arXiv
[11]

Statistics and Probability Letters , volume =

Entropic approach to. Statistics and Probability Letters , volume =. 2013 , issn =. doi:https://doi.org/10.1016/j.spl.2013.03.020 , url =

work page doi:10.1016/j.spl.2013.03.020 2013
[12]

, year =

Villani, C\'edric , TITLE =. 2009 , PAGES =. doi:10.1007/978-3-540-71050-9 , URL =

work page doi:10.1007/978-3-540-71050-9 2009
[13]

van Hemmen, J. L. and Ando, T. , TITLE =. Comm. Math. Phys. , FJOURNAL =. 1980 , NUMBER =

work page 1980
[14]

Talagrand

Talagrand, M. , TITLE =. Geom. Funct. Anal. , FJOURNAL =. 1996 , NUMBER =. doi:10.1007/BF02249265 , URL =

work page doi:10.1007/bf02249265 1996
[15]

Quantitative

Basteri, Andrea and Trevisan, Dario , journal =. Quantitative. 2024 , month =

work page 2024
[16]

Trevisan

Dario Trevisan , year=. Wide Deep Neural Networks with. 2312.11737 , archivePrefix=

work page arXiv
[17]

Annals of Applied Probability , volume =

Boris Hanin , title =. Annals of Applied Probability , volume =. 2023 , doi =

work page 2023
[18]

International Conference on Learning Representations (ICLR) , year =

Daniele Bracale and Stefano Favaro and Sandra Fortini and Stefano Peluchetti , title =. International Conference on Learning Representations (ICLR) , year =

work page
[19]

and Hanin, B

Favaro, S. and Hanin, B. and Marinucci, D. and Nourdin, I. and Peccati, G. , TITLE =. Probab. Theory Related Fields , FJOURNAL =. 2025 , NUMBER =. doi:10.1007/s00440-025-01360-1 , URL =

work page doi:10.1007/s00440-025-01360-1 2025
[20]

2026 , eprint=

Wide neural networks with general weights: convergence rate and explicit dependence on the hyper-parameters , author=. 2026 , eprint=

work page 2026
[21]

Bayesian Learning for Neural Networks , year =

Priors for Infinite Networks , author =. Bayesian Learning for Neural Networks , year =

work page
[22]

Deep Neural Networks as

Jaehoon Lee and Jascha Sohl-dickstein and Jeffrey Pennington and Roman Novak and Sam Schoenholz and Yasaman Bahri , booktitle =. Deep Neural Networks as. 2018 , url =

work page 2018
[23]

International Conference on Learning Representations , year =

Gaussian Process Behaviour in Wide Deep Neural Networks , author =. International Conference on Learning Representations , year =

work page
[24]

Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =

Yang, Greg , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019
[25]

, TITLE =

Gerrish, F. , TITLE =. Math. Gaz. , FJOURNAL =. 1972 , NUMBER =. doi:10.2307/3615274 , URL =

work page doi:10.2307/3615274 1972
[26]

Normal Approximation of Random

Apollonio, Nicola and De Canditiis, Daniela and Franzina, Giovanni and Stolfi, Paola and Torrisi, Giovanni Luca , journal =. Normal Approximation of Random. 2024 , month =

work page 2024
[27]

Gaussian random field approximation via

Balasubramanian, Krishnakumar and Goldstein, Larry and Ross, Nathan and Salim, Adil , journal =. Gaussian random field approximation via. 2024 , month =

work page 2024
[28]

Non-asymptotic approximations of neural networks by

Eldan, Ronen and Mikulincer, Dan and Schramm, Tselil , booktitle =. Non-asymptotic approximations of neural networks by. 2021 , editor =

work page 2021
[29]

Rate of Convergence of Polynomial Networks to

Klukowski, Adam , booktitle =. Rate of Convergence of Polynomial Networks to. 2022 , editor =

work page 2022
[30]

Quantitative convergence of trained neural networks to

Andrea Agazzi and Eloy Mosig Garc. Quantitative convergence of trained neural networks to. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[31]

Optimal Transport: Old and New , series =

C. Optimal Transport: Old and New , series =

work page

[1] [1]

Journal of Machine Learning Research , year =

Boris Hanin , title =. Journal of Machine Learning Research , year =

work page

[2] [2]

Invariance principles for homogeneous sums: Universality of

Nourdin, Ivan and Peccati, Giovanni and Reinert, Gesine , year=. Invariance principles for homogeneous sums: Universality of. The Annals of Probability , publisher=. doi:10.1214/10-aop531 , number=

work page doi:10.1214/10-aop531

[3] [3]

Polynomial chaos and scaling limits of disordered systems , author =. J. Eur. Math. Soc. , fjournal =. 2016 , month =

work page 2016

[4] [4]

2005 , eprint=

Noise stability of functions with low influences: invariance and optimality , author=. 2005 , eprint=

work page 2005

[5] [5]

2013 , eprint=

Entropy and the fourth moment phenomenon , author=. 2013 , eprint=

work page 2013

[6] [6]

Hanin, Random neural networks in the infinite width limit as gaussian processes (2021) arXiv:2107.01562 [math.PR]

Boris Hanin , year=. Random Neural Networks in the Infinite Width Limit as. 2107.01562 , archivePrefix=

work page arXiv

[7] [7]

Lecture notes , url =

Discrete stochastic analysis , author=. Lecture notes , url =

work page

[8] [8]

Electronic Journal of Probability , publisher =

Wasserstein-2 bounds in normal approximation under local dependence , author =. Electronic Journal of Probability , publisher =. 2019 , month =

work page 2019

[9] [9]

Balasubramanian and N

Krishnakumar Balasubramanian and Nathan Ross , year=. Finite-Dimensional. 2507.12686 , archivePrefix=

work page arXiv

[10] [10]

Celli and G

Lucia Celli and Giovanni Peccati , year=. Entropic bounds for conditionally. arXiv , primaryClass=:2504.08335 , note =

work page arXiv

[11] [11]

Statistics and Probability Letters , volume =

Entropic approach to. Statistics and Probability Letters , volume =. 2013 , issn =. doi:https://doi.org/10.1016/j.spl.2013.03.020 , url =

work page doi:10.1016/j.spl.2013.03.020 2013

[12] [12]

, year =

Villani, C\'edric , TITLE =. 2009 , PAGES =. doi:10.1007/978-3-540-71050-9 , URL =

work page doi:10.1007/978-3-540-71050-9 2009

[13] [13]

van Hemmen, J. L. and Ando, T. , TITLE =. Comm. Math. Phys. , FJOURNAL =. 1980 , NUMBER =

work page 1980

[14] [14]

Talagrand

Talagrand, M. , TITLE =. Geom. Funct. Anal. , FJOURNAL =. 1996 , NUMBER =. doi:10.1007/BF02249265 , URL =

work page doi:10.1007/bf02249265 1996

[15] [15]

Quantitative

Basteri, Andrea and Trevisan, Dario , journal =. Quantitative. 2024 , month =

work page 2024

[16] [16]

Trevisan

Dario Trevisan , year=. Wide Deep Neural Networks with. 2312.11737 , archivePrefix=

work page arXiv

[17] [17]

Annals of Applied Probability , volume =

Boris Hanin , title =. Annals of Applied Probability , volume =. 2023 , doi =

work page 2023

[18] [18]

International Conference on Learning Representations (ICLR) , year =

Daniele Bracale and Stefano Favaro and Sandra Fortini and Stefano Peluchetti , title =. International Conference on Learning Representations (ICLR) , year =

work page

[19] [19]

and Hanin, B

Favaro, S. and Hanin, B. and Marinucci, D. and Nourdin, I. and Peccati, G. , TITLE =. Probab. Theory Related Fields , FJOURNAL =. 2025 , NUMBER =. doi:10.1007/s00440-025-01360-1 , URL =

work page doi:10.1007/s00440-025-01360-1 2025

[20] [20]

2026 , eprint=

Wide neural networks with general weights: convergence rate and explicit dependence on the hyper-parameters , author=. 2026 , eprint=

work page 2026

[21] [21]

Bayesian Learning for Neural Networks , year =

Priors for Infinite Networks , author =. Bayesian Learning for Neural Networks , year =

work page

[22] [22]

Deep Neural Networks as

Jaehoon Lee and Jascha Sohl-dickstein and Jeffrey Pennington and Roman Novak and Sam Schoenholz and Yasaman Bahri , booktitle =. Deep Neural Networks as. 2018 , url =

work page 2018

[23] [23]

International Conference on Learning Representations , year =

Gaussian Process Behaviour in Wide Deep Neural Networks , author =. International Conference on Learning Representations , year =

work page

[24] [24]

Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =

Yang, Greg , title =. Proceedings of the 33rd International Conference on Neural Information Processing Systems , articleno =. 2019 , publisher =

work page 2019

[25] [25]

, TITLE =

Gerrish, F. , TITLE =. Math. Gaz. , FJOURNAL =. 1972 , NUMBER =. doi:10.2307/3615274 , URL =

work page doi:10.2307/3615274 1972

[26] [26]

Normal Approximation of Random

Apollonio, Nicola and De Canditiis, Daniela and Franzina, Giovanni and Stolfi, Paola and Torrisi, Giovanni Luca , journal =. Normal Approximation of Random. 2024 , month =

work page 2024

[27] [27]

Gaussian random field approximation via

Balasubramanian, Krishnakumar and Goldstein, Larry and Ross, Nathan and Salim, Adil , journal =. Gaussian random field approximation via. 2024 , month =

work page 2024

[28] [28]

Non-asymptotic approximations of neural networks by

Eldan, Ronen and Mikulincer, Dan and Schramm, Tselil , booktitle =. Non-asymptotic approximations of neural networks by. 2021 , editor =

work page 2021

[29] [29]

Rate of Convergence of Polynomial Networks to

Klukowski, Adam , booktitle =. Rate of Convergence of Polynomial Networks to. 2022 , editor =

work page 2022

[30] [30]

Quantitative convergence of trained neural networks to

Andrea Agazzi and Eloy Mosig Garc. Quantitative convergence of trained neural networks to. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page

[31] [31]

Optimal Transport: Old and New , series =

C. Optimal Transport: Old and New , series =

work page