pith. sign in

arxiv: 2605.24072 · v1 · pith:VQ74KQOYnew · submitted 2026-05-22 · 📊 stat.ML · cs.LG· math.PR

Optimal Non-Asymptotic Edgeworth Expansions for Multivariate Neural Network Outputs

Pith reviewed 2026-06-30 15:14 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.PR
keywords Edgeworth expansionfinite-width neural networkstotal variation distanceGaussian limitcumulantsBayesian posteriornon-asymptotic boundsconditionally Gaussian vectors
0
0 comments X

The pith

Finite-width neural networks are approximated by Edgeworth expansions with total variation error of order n to the minus m.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that finite-width fully connected neural networks with Gaussian-initialized weights deviate from their infinite-width Gaussian limit through non-vanishing higher-order cumulants. For a network evaluated at a finite number of inputs, multidimensional Edgeworth expansions of order 4m-1 approximate these deviations, and the total variation distance between the true output law and the approximation is bounded by a term of order n to the minus m. Matching lower bounds confirm the rate is sharp. The result requires the limiting Gaussian to have invertible covariance and a polynomially bounded activation function. It also applies more broadly to sequences of conditionally Gaussian vectors converging to such a Gaussian, and it quantifies the error incurred when an Edgeworth-expanded prior replaces the true prior in Bayesian posterior computations.

Core claim

Assuming that the corresponding Gaussian limit has an invertible covariance matrix and that the activation function is polynomially bounded, we establish a bound of order n^{-m} on the total variation distance between the law of the true network output and its Edgeworth approximation, with matching lower bounds.

What carries the argument

Multidimensional Edgeworth expansion of order 4m-1 applied to the finite collection of network outputs at fixed inputs.

If this is right

  • The approximation error between the network output distribution and its Edgeworth series is of order n to the minus m in total variation.
  • Matching lower bounds establish that no faster rate is possible in general under the stated conditions.
  • Replacing the network prior by its Edgeworth expansion produces a Bayesian posterior whose error is controlled at the same rate.
  • The same non-asymptotic bounds hold for any sequence of conditionally Gaussian random vectors that converge to a Gaussian with invertible covariance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The result suggests that higher-order Edgeworth corrections could improve posterior calibration in Bayesian neural networks at modest extra cost when the limiting covariance is well-conditioned.
  • If the conditional-Gaussian property can be verified for other architectures, similar expansion rates may apply beyond fully connected layers.
  • The invertibility requirement implies that the finite set of evaluation points must span a space in which the limiting covariance is full rank, which may fail for highly correlated or low-dimensional inputs.
  • For concrete networks one could test whether increasing m yields the predicted improvement until cumulant estimation becomes the dominant error source.

Load-bearing premise

The Gaussian limit has an invertible covariance matrix.

What would settle it

Direct numerical computation of the total variation distance for a small fixed m, a concrete activation such as ReLU, and increasing network widths, checking whether the observed rate matches the stated upper and lower bounds of order n to the minus m.

Figures

Figures reproduced from arXiv: 2605.24072 by Lucia Celli.

Figure 1
Figure 1. Figure 1: Comparison between the estimated density of a neural network output (NN) as in [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The figure displays the signed pointwise error between the Monte Carlo kernel den [PITH_FULL_IMAGE:figures/full_fig_p034_2.png] view at source ↗
read the original abstract

Finite-width fully connected neural networks with Gaussian-initialized weights deviate from their infinite-width Gaussian limit, exhibiting non-vanishing higher-order cumulants. We approximate these deviations, for a neural network evaluated in a finite number of inputs, using multidimensional Edgeworth expansions of arbitrary order $4m-1$, with $m\in\mathbb{N}$. Assuming that the corresponding Gaussian limit has an invertible covariance matrix and that the activation function is polynomially bounded, we establish a bound of order $n^{-m}$ on the total variation distance between the law of the true network output and its Edgeworth approximation, with matching lower bounds. As an application, we quantify the error in Bayesian posterior distributions when the prior is replaced by its Edgeworth expansion. Our results are more general and also apply to sequences of conditionally Gaussian vectors converging to a Gaussian vector with invertible covariance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper establishes non-asymptotic bounds on the total variation distance between the distribution of outputs from finite-width fully connected neural networks (with Gaussian-initialized weights) and their multidimensional Edgeworth approximations of order 4m-1. Under the assumptions that the limiting Gaussian has invertible covariance and the activation is polynomially bounded, it proves an O(n^{-m}) upper bound with matching lower bounds. The results extend to conditionally Gaussian sequences converging to a Gaussian with invertible covariance and include an application quantifying the error when replacing a prior with its Edgeworth expansion in Bayesian posterior computations.

Significance. If the technical results hold, the work supplies the first optimal non-asymptotic Edgeworth rates tailored to neural-network outputs, together with matching lower bounds that establish sharpness. The explicit assumptions (invertible limiting covariance, polynomial activation bound) are stated clearly in the abstract, and the generalization to conditionally Gaussian vectors plus the Bayesian-posterior application add concrete utility. These elements strengthen the contribution beyond standard Edgeworth theory.

minor comments (3)
  1. The abstract states the order 4m-1 and the n^{-m} rate but does not indicate where in the manuscript the precise statement of the main theorem (including the dependence on m) appears; adding an explicit theorem label in the abstract would improve readability.
  2. The extension to conditionally Gaussian sequences is mentioned only briefly; a short dedicated subsection or remark clarifying how the NN case is recovered as a special instance would help readers trace the argument.
  3. Notation for the total-variation distance and the Edgeworth polynomial is introduced without an early reference to the precise definition used in the proofs; a short notational table or paragraph at the end of the introduction would reduce ambiguity.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation of minor revision. The provided summary accurately reflects the paper's contributions on non-asymptotic Edgeworth expansions for finite neural network outputs, including the O(n^{-m}) bounds with matching lower bounds under the stated assumptions, the extension to conditionally Gaussian sequences, and the Bayesian application.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper derives an O(n^{-m}) TV bound for Edgeworth approximations of order 4m-1 applied to finite-width NN outputs (and conditionally Gaussian sequences) under the explicit assumptions of invertible limiting covariance and polynomially bounded activations. The central result rests on standard Edgeworth expansion techniques plus these assumptions, with matching lower bounds stated directly; no step reduces by construction to a fitted input, self-definition, or load-bearing self-citation chain. The abstract and claim structure are independent of any internal renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on two domain assumptions stated in the abstract plus standard results from probability theory on Edgeworth expansions and total variation; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The Gaussian limit has an invertible covariance matrix
    Explicitly required in the abstract for the bound to hold.
  • domain assumption The activation function is polynomially bounded
    Explicitly required in the abstract to control moments.

pith-pipeline@v0.9.1-grok · 5669 in / 1365 out tokens · 30490 ms · 2026-06-30T15:14:33.277192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 11 canonical work pages

  1. [1]

    Abramowitz and I.A

    M. Abramowitz and I.A. Stegun.Handbook of Mathematical Functions: With Formulas, Graphs, and Mathematical Tables. Applied mathematics series. Dover Publications, 1965. URL:https://books.google.lu/books?id=MtU8uP7XMvoC

  2. [2]

    Why bigger is not always better: on finite and infinite neural networks

    Laurence Aitchison. Why bigger is not always better: on finite and infinite neural networks. In Hal Daum´ e III and Aarti Singh, editors,Proceedings of the 37th International Con- ference on Machine Learning, volume 119 ofProceedings of Machine Learning Research, pages 156–164. PMLR, 13–18 Jul 2020. URL:https://proceedings.mlr.press/v119/ aitchison20a.html

  3. [3]

    J. M. Antognini. Finite size corrections for neural network gaussian processes, 2019. URL: https://arxiv.org/abs/1908.10030,arXiv:1908.10030

  4. [4]

    Normal approximation of random gaussian neural networks.Stochastic Systems, 15(1):88–110, 2025

    Nicola Apollonio, Daniela De Canditiis, Giovanni Franzina, Paola Stolfi, and Giovanni Luca Torrisi. Normal approximation of random gaussian neural networks.Stochastic Systems, 15(1):88–110, 2025

  5. [5]

    T. M. Apostol.Calculus, Volume 1: One-Variable Calculus, with an Introduction to Linear Algebra. John Wiley & Sons, 1967

  6. [6]

    Balasubramanian and N

    K. Balasubramanian and N. Ross. Finite-dimensional gaussian approximation for deep neural networks: Universality in random weights, 2025. URL:https://arxiv.org/abs/ 2507.12686,arXiv:2507.12686

  7. [7]

    Basteri and D

    A. Basteri and D. Trevisan. Quantitative gaussian approximation of randomly initialized deep neural networks.Mach. Learn., 113:6373–6393, 2024

  8. [8]

    Bhattacharya and R.R

    R.N. Bhattacharya and R.R. Rao.Normal Approximation and Asymptotic Expansions. Classics in Applied Mathematics. Society for Industrial and Applied Mathematics, 1986. URL:https://books.google.lu/books?id=H1lOIVHcRDEC

  9. [9]

    Bordino, S

    A. Bordino, S. Favaro, and S. Fortini. Non-asymptotic approximations of gaussian neu- ral networks via second-order poincar´ e inequalities. InProceedings of Machine Learning Research (AABI24), 2024. 31

  10. [10]

    C. K. I. Williams C. E. Rasmussen.Gaussian Processes for Machine Learning. MIT Press, ISBN 026218253X, 2006

  11. [11]

    Carvalho, J

    L. Carvalho, J. L. Costa, J. Mour˜ ao, and G. Oliveira. The positivity of the neural tangent kernel.SIAM Journal on Mathematics of Data Science, 7(2):495–515, 2025.doi:10.1137/ 24M1659534

  12. [12]

    L. Celli. Edgworth expansion for fcnns (simulations), 2026.doi:10.5281/zenodo. 19737987

  13. [13]

    L. Celli. Wide neural networks with general weights: convergence rate and explicit de- pendence on the hyper-parameters, 2026. URL:https://arxiv.org/abs/2601.21539, arXiv:2601.21539

  14. [14]

    Celli and G

    L. Celli and G. Peccati. Entropic bounds for conditionally gaussian vectors and applications to neural networks, 2025. URL:https://arxiv.org/abs/2504.08335,arXiv:2504.08335

  15. [15]

    G. Cybenko. Approximation by superpositions of a sigmoidal function.Math. Control Signals Syst., 2(4):303–314, 1989.doi:10.1007/BF02551274

  16. [16]

    Favaro, B

    S. Favaro, B. Hanin, D. Marinucci, I. Nourdin, and G. Peccati. Quantitative clts in deep neural networks.Probability Theory and Related Fields, 191(3):933–977, 2025

  17. [17]

    V. Fortuin. Priors in bayesian deep learning: A review.International Statistical Re- view, 90(3):563–591, 2022. URL:https://onlinelibrary.wiley.com/doi/full/10. 1111/insr.12502

  18. [18]

    Hall.The Bootstrap and Edgeworth Expansion

    P. Hall.The Bootstrap and Edgeworth Expansion. Springer, New York, 1992

  19. [19]

    B. Hanin. Random neural networks in the infinite width limit as gaussian processes.Ann. Appl. Probab., 33(6A):4798–4819, 2023

  20. [20]

    B. Hanin. Random fully connected neural networks as perturbatively solvable hierarchies. Journal of Machine Learning Research, 2024

  21. [21]

    J. Hron, Y. Bahri, R. Novak, J. Pennington, and J. Sohl-Dickstein. Exact posterior distributions of wide bayesian neural networks.CoRR, abs/2006.10541, 2020. URL: https://arxiv.org/abs/2006.10541,arXiv:2006.10541

  22. [22]

    Klukowski

    A. Klukowski. Rate of convergence of polynomial networks to gaussian processes. InCon- ference on Learning Theory, Proceedings of Machine Learning Research, pages 701–722, 2022

  23. [23]

    Kolassa.Series Approximation Methods in Statistics

    J.E. Kolassa.Series Approximation Methods in Statistics. Lecture Notes in Statistics. Springer New York, 2006. URL:https://books.google.lu/books?id=aLVY_gVEomgC

  24. [24]

    J. Lee, Y. Bahri, R. Novak, S. Schoenholz, J. Pennington, and J. Sohl-Dickstein. Deep neural networks as gaussian processes. InInternational Conference on Learning Representation, 2018

  25. [25]

    C.-K. Lu. Bayesian inference with finitely wide neural networks.Phys. Rev. E, 108:014311, Jul 2023. URL:https://link.aps.org/doi/10.1103/PhysRevE.108.014311,doi:10. 1103/PhysRevE.108.014311. 32

  26. [26]

    Mansanarez, G

    P. Mansanarez, G. Poly, and Y. Swan. Edgeworth expansion on wiener chaos, 2025. URL: https://arxiv.org/abs/2510.14002,arXiv:2510.14002

  27. [27]

    Matthews, J

    A. Matthews, J. Hron, M. Rowland, R. Turner, and Z. Ghahramani. Gaussian process behaviour in wide deep neural networks. InInternational Conference on Learning Repre- sentation, 2018

  28. [28]

    McCullagh.Tensor Methods in Statistics

    P. McCullagh.Tensor Methods in Statistics. Monographs on Statistics and Applied Proba- bility. Chapman and Hall/CRC, 1987.doi:10.1201/9781351077118

  29. [29]

    Naveh, O

    G. Naveh, O. B. David, H. Sompolinsky, and Z. Ringel. Predicting the outputs of finite deep neural networks trained with noisy gradients.Physical review. E, 104 6-1:064301,

  30. [30]

    URL:https://api.semanticscholar.org/CorpusID:238226559

  31. [31]

    Neal.Bayesian learning for neural networks, volume 118

    R. Neal.Bayesian learning for neural networks, volume 118. Springer, 1996

  32. [32]

    Nica and J

    M. Nica and J. Ortmann. Improving the gaussian approximation in neural networks: Para- gaussians and edgeworth expansions. InNeurIPS 2024 Workshop on Mathematics of Modern Machine Learning, 2024. URL:https://openreview.net/forum?id=92q7WV4od7

  33. [33]

    Cambridge Tracts in Mathematics

    Ivan Nourdin and Giovanni Peccati.Normal Approximations with Malliavin Calculus: From Stein’s Method to Universality. Cambridge Tracts in Mathematics. Cambridge University Press, 2012

  34. [34]

    Pacelli, S

    R. Pacelli, S. Ariosto, M. Pastore, F. Ginelli, M. Gherardi, and P. Rotondo. A statistical mechanics framework for bayesian deep neural networks beyond the infinite-width limit. Nature Machine Intelligence, 5(12):1497–1507, 2023

  35. [35]

    Peccati and M.S

    G. Peccati and M.S. Taqqu.Wiener Chaos: Moments, Cumulants and Diagrams: A survey with Computer Implementation. Bocconi & Springer Series. Springer Milan, 2011. URL: https://books.google.lu/books?id=qizrXkh1LrkC

  36. [36]

    Pleiss and J

    G. Pleiss and J. P. Cunningham. The limitations of large width in neural networks: A deep gaussian process perspective.Advances in Neural Information Processing Systems, 34:3349–3363, 2021

  37. [37]

    A. Shah, A. Wilson, and Z. Ghahramani. Student-t Processes as Alternatives to Gaussian Processes. In Samuel Kaski and Jukka Corander, editors,Proceedings of the Seventeenth International Conference on Artificial Intelligence and Statistics, volume 33 ofProceedings of Machine Learning Research, pages 877–885, Reykjavik, Iceland, 22–25 Apr 2014. PMLR. URL:h...

  38. [38]

    R. P. Stanley.Enumerative Combinatorics. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2 edition, 2011

  39. [39]

    Trevisan

    D. Trevisan. Wide deep neural networks with gaussian weights are very close to gaussian processes, 2023. arXiv:2312.06092 [math.ST]

  40. [40]

    Wefelmeyer and J

    W. Wefelmeyer and J. Pfanzagl.Asymptotic Expansions for General Statistical Models. Lecture Notes in Statistics. Springer New York, 2013. URL:https://books.google.lu/ books?id=L14FCAAAQBAJ. 33 Figure 2: The figure displays the signed pointwise error between the Monte Carlo kernel den- sity estimate (KDE) of the output distribution of a shallow neural netw...