Bayesian Inference with Shaped Deep Non-linear MLPs

Boris Hanin; Tianze Jiang

arxiv: 2605.30860 · v1 · pith:3E6SSU2Xnew · submitted 2026-05-29 · 🧮 math.ST · cs.LG· math.PR· stat.TH

Bayesian Inference with Shaped Deep Non-linear MLPs

Boris Hanin , Tianze Jiang This is my paper

Pith reviewed 2026-06-28 20:39 UTC · model grok-4.3

classification 🧮 math.ST cs.LGmath.PRstat.TH

keywords Bayesian inferencedeep MLPsmodel evidencedata-dependent kernelsNeural Covariance SDElarge width limitseffective depth

0 comments

The pith

To first order in LP/N, Bayesian inference in deep non-linear MLPs reduces to a data-dependent kernel method, with a criterion for when depth raises model evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes Bayesian inference for deep MLPs in the regime where training set size P, width N, layer count L, and input dimension are all large but LP/N remains order one. Using the Neural Covariance SDE, it derives that the predictive posterior matches the posterior of a kernel method whose kernel is fixed by the training data. It also identifies a simple condition on the data-generating process under which raising LP/N increases the Bayesian model evidence. A reader would care because the large-model and large-data limits fail to commute, so this intermediate scaling clarifies when depth helps Bayesian performance without solving the full nonlinear dynamics.

Core claim

In the regime where LP/N equals theta of one, to first order in this parameter the Bayesian predictive posterior of deep non-linear MLPs is equivalent to that of a data-dependent kernel method. There is also a criterion on the data-generating process that determines whether increasing LP/N raises the Bayesian model evidence. The results cover smooth and ReLU activations at arbitrary temperature.

What carries the argument

The first-order expansion in the effective depth LP/N of the Neural Covariance SDE, which governs the layer-by-layer evolution of activation covariances.

Load-bearing premise

The Neural Covariance SDE framework continues to describe network behavior accurately when P, N, and L grow large together with LP/N held at order one.

What would settle it

Compute the Bayesian model evidence numerically for MLPs of several depths at fixed LP/N and check whether the change with depth matches the sign predicted by the data criterion.

read the original abstract

A central aim of deep learning theory is to characterize how neural networks make predictions in the regime of simultaneously large model and training set size. Since the limits of diverging number of model parameters and dataset size do not commute it is not clear a priori what limits exist. In this work, we shed new light on these questions by studying Bayesian inference in deep non-linear MLPs in the regime where the number of training samples ($P$), the input dimension ($N_0$), the hidden layer width ($N$), and the number of hidden layers ($L$) can all be large. We build on the Neural Covariance SDE (Li et al., 2022) to analyze predictive posteriors in the regime where $LP/N\in\Theta(1)$, playing the role of an effective network depth. Our framework covers both smooth and ReLU activation functions and applies to arbitrary temperature. We find to first order in $LP/N$ a simple criterion for which data generating processes benefit from depth in the sense that larger $LP/N$ increases the Bayesian model evidence. We also give a novel derivation of a prior result from the physics literature that at least to first order in $LP/N$, the Bayesian predictive posterior is remarkably simple and is simply equivalent to that of a data-dependent kernel method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper extends the Neural Covariance SDE to a joint L,P,N limit with LP/N fixed and extracts a first-order criterion for when depth raises Bayesian evidence plus a kernel equivalence for the posterior.

read the letter

The core contribution is a first-order analysis in the effective depth parameter LP/N. For certain data-generating processes, increasing this parameter improves the model evidence, and the predictive posterior reduces to that of a data-dependent kernel method. Both results are derived under the Neural Covariance SDE framework and apply to smooth or ReLU activations at any temperature.

The paper does a clean job of setting up the non-commuting limits and stating the regime explicitly. Treating LP/N as the controlling parameter organizes the role of depth in a way that is easy to state. The derivation of the kernel equivalence is presented as new even though the end result echoes earlier physics work.

The main limitation is that the analysis assumes the Neural Covariance SDE from Li et al. 2022 continues to hold without correction when L, P, and N all diverge while their product over N stays order one. No fresh error bound or convergence argument for this joint limit appears. If O(1) corrections or non-perturbative terms enter at that scaling, both the sign of the depth correction to the evidence and the claimed equivalence could shift. The first-order truncation is acknowledged, but the paper does not supply checks for when it breaks.

This is for theorists already working with SDE or infinite-width methods in Bayesian deep learning. Readers who want a concrete handle on when depth helps evidence will find the criterion useful. The work is coherent on its own terms and shows clear engagement with the literature, so it merits referee time even if the limit justification needs tightening.

Referee Report

1 major / 2 minor

Summary. The paper analyzes Bayesian inference for deep non-linear MLPs (smooth and ReLU activations) in the joint large-P, large-N, large-L regime with LP/N = Θ(1) as an effective depth parameter. Building on the Neural Covariance SDE of Li et al. (2022), it derives to first order in LP/N a criterion identifying data-generating processes for which increasing LP/N raises the Bayesian model evidence, and shows that the predictive posterior is equivalent to that of a data-dependent kernel method. The framework applies at arbitrary temperature.

Significance. If the first-order results hold, the work supplies a concrete, testable criterion for when depth improves Bayesian evidence and a simplification of the predictive posterior to kernel form, with a novel derivation of a prior physics result. The extension to ReLU activations and general temperature is a strength. The analysis is grounded in an explicit small-parameter expansion rather than heuristic limits.

major comments (1)

[Introduction / Neural Covariance SDE application] The extension of the Neural Covariance SDE (Li et al., 2022) to the joint L,P,N→∞ limit with LP/N=Θ(1) is invoked without a new convergence argument or error bound (see the setup paragraph citing Li et al. and the subsequent derivation of the O(LP/N) expansion). This assumption is load-bearing for both the sign of the depth correction to the model evidence and the claimed equivalence of the predictive posterior to a data-dependent kernel method; an O(1) correction in the joint limit would alter the leading-order claims.

minor comments (2)

[Introduction] Define the effective depth parameter LP/N explicitly in the first paragraph of the introduction rather than deferring to the abstract.
[Main derivation] Clarify whether the first-order truncation is uniform in the activation function (smooth vs. ReLU) or requires separate error estimates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We respond point-by-point to the major comment below.

read point-by-point responses

Referee: [Introduction / Neural Covariance SDE application] The extension of the Neural Covariance SDE (Li et al., 2022) to the joint L,P,N→∞ limit with LP/N=Θ(1) is invoked without a new convergence argument or error bound (see the setup paragraph citing Li et al. and the subsequent derivation of the O(LP/N) expansion). This assumption is load-bearing for both the sign of the depth correction to the model evidence and the claimed equivalence of the predictive posterior to a data-dependent kernel method; an O(1) correction in the joint limit would alter the leading-order claims.

Authors: We appreciate the referee pointing out that the joint limit with LP/N=Θ(1) is invoked by direct application of the Neural Covariance SDE from Li et al. (2022) without a fresh convergence proof or explicit error bound in the present manuscript. The setup and O(LP/N) expansion indeed rely on the SDE limit established in the cited work (under large-width assumptions) and then perform a perturbative expansion treating LP/N as the small parameter. We agree this assumption is load-bearing: an uncontrolled O(1) correction would invalidate the leading-order sign of the depth correction to the evidence and the kernel equivalence. Our contribution is the first-order perturbative analysis rather than a new rigorous limit theorem. To address the concern we will revise the manuscript to (i) restate the precise assumptions inherited from Li et al., (ii) explicitly note that all claims are to first order in LP/N with higher-order terms neglected, and (iii) add a short remark that a full joint-limit convergence analysis remains open. This makes the scope of the results transparent without altering the derivations. revision: partial

Circularity Check

1 steps flagged

Central claims depend on unverified extension of Neural Covariance SDE to joint L,P,N limit with LP/N=Θ(1)

specific steps

self citation load bearing [Abstract]
"We build on the Neural Covariance SDE (Li et al., 2022) to analyze predictive posteriors in the regime where LP/N∈Θ(1), playing the role of an effective network depth. ... We find to first order in LP/N a simple criterion for which data generating processes benefit from depth in the sense that larger LP/N increases the Bayesian model evidence. We also give a novel derivation of a prior result from the physics literature that at least to first order in LP/N, the Bayesian predictive posterior is remarkably simple and is simply equivalent to that of a data-dependent kernel method."

The criterion and kernel equivalence are obtained by extending the cited SDE to the joint limit with LP/N fixed at order 1. No new justification is provided for why the SDE approximation holds without O(1) corrections when L scales with P/N; the results therefore reduce to the prior framework's validity in this regime.

full rationale

The paper's strongest claims—a criterion for depth benefiting model evidence and equivalence of the predictive posterior to a data-dependent kernel—are derived to first order in LP/N by invoking the Neural Covariance SDE framework. The analysis explicitly assumes this SDE remains valid in the simultaneous large-P, N, L regime without supplying an independent convergence argument or error bound for the joint limit. This matches the moderate circularity pattern of a load-bearing self-citation (or prior framework by overlapping authors) whose applicability to the new scaling is taken as given rather than re-derived or bounded. The central claims retain some independent content in the first-order expansion and the novel derivation step, preventing a higher score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the assumed validity of the Neural Covariance SDE and the perturbative regime.

pith-pipeline@v0.9.1-grok · 5759 in / 1275 out tokens · 26571 ms · 2026-06-28T20:39:54.618004+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 19 canonical work pages · 8 internal anchors

[1]

High-dimensional dynamics of generalization error in neural networks

arXiv:1710.03667 [stat.ML].url: https://arxiv.org/abs/1710.03667. [Bas+25] Federico Bassetti et al.Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Reconciling modern machine-learning practice and the classical bias–variance trade-off

arXiv:2406 . 03260 [stat.ML].url:https://arxiv.org/abs/2406.03260. 13 [Bel+19] Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias–variance trade-off”. In:Proceedings of the National Academy of Sciences116.32 (July 2019), pp. 15849–15854.issn: 1091-6490.doi:10.1073/ pnas.1903070116.url:http://dx.doi.org/10.1073/pnas....

work page doi:10.1073/pnas.1903070116 2019
[3]

15267 [stat.ML].url:https://arxiv.org/abs/2411.15267

arXiv:2411. 15267 [stat.ML].url:https://arxiv.org/abs/2411.15267. [Blu+15] Charles Blundell et al.Weight Uncertainty in Neural Networks

work page arXiv
[4]

Weight Uncertainty in Neural Networks

arXiv: 1505.05424 [stat.ML].url:https://arxiv.org/abs/1505.05424. [BP22] Lucas Benigni and Sandrine P´ ech´ e.Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks

work page internal anchor Pith review Pith/arXiv arXiv
[5]

[Cam+25] Francesco Camilli et al.Information-theoretic reduction of deep neural net- works to linear models in the overparametrized proportional regime

arXiv:2201.04753 [math.PR].url: https://arxiv.org/abs/2201.04753. [Cam+25] Francesco Camilli et al.Information-theoretic reduction of deep neural net- works to linear models in the overparametrized proportional regime

work page arXiv
[6]

Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks

arXiv: 2505.03577 [math.ST].url:https://arxiv.org/abs/2505.03577. [Cho23] Cl´ ement Chouard. “Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks”. In:arXiv preprint arXiv:2306.05850 (2023). [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach. “On lazy training in differ- entiable programming”. In:Advances...

work page arXiv 2023
[7]

Nonlinear Approximation and (Deep) ReLU Networks

arXiv:1905.02199 [cs.LG].url:https://arxiv.org/abs/1905.02199. [Du+19] Simon S. Du et al.Gradient Descent Provably Optimizes Over-parameterized Neural Networks

work page internal anchor Pith review Pith/arXiv arXiv 1905
[8]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

arXiv:1810.02054 [cs.LG].url:https://arxiv. org/abs/1810.02054. [El 10] Noureddine El Karoui. “The spectrum of kernel random matrices”. In: (2010). [FW20] Zhou Fan and Zhichao Wang. “Spectra of the Conjugate Kernel and Neu- ral Tangent Kernel for linear-width neural networks”. In:Advances in Neural Information Processing Systems. Ed. by H. Larochelle et al. Vol

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

7710–7721.url:https://proceedings.neurips

Curran Associates, Inc., 2020, pp. 7710–7721.url:https://proceedings.neurips. cc/paper_files/paper/2020/file/572201a4497b0b9f02d4f279b09ec30d- Paper.pdf. [GG16] Yarin Gal and Zoubin Ghahramani.Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

2020
[10]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

arXiv:1506.02142 [stat.ML].url:https://arxiv.org/abs/1506.02142. [Han19] Boris Hanin. “Universal function approximation by deep neural nets with bounded width and relu activations”. In:Mathematics7.10 (2019), p

work page internal anchor Pith review Pith/arXiv arXiv 2019
[11]

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

14 [Han24] Boris Hanin. “Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies”. In:Journal of Machine Learning Research25.267 (2024), pp. 1–58.url:http://jmlr.org/papers/v25/23-0643.html. [HDR19] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. “Training dynamics of deep networks using stochastic gradient descent via neural tan...

2024
[12]

Cedzich, J

arXiv:2503.07872 [math.PR]. url:https://arxiv.org/abs/2503.07872. [HN19a] Boris Hanin and Mihai Nica. “Finite depth and width corrections to the neural tangent kernel”. In:arXiv preprint arXiv:1909.05989(2019). [HN19b] Boris Hanin and Mihai Nica. “Products of Many Large Random Matrices and Gradients in Deep Neural Networks”. In:Communications in Mathemat-...

work page doi:10.1007/s00220- 1909
[13]

Neural tangent kernel: Convergence and generalization in neural networks

arXiv:1806.07572 [cs.LG].url:https://arxiv.org/abs/1806.07572. [Lee+17] Jaehoon Lee et al. “Deep neural networks as gaussian processes”. In:arXiv preprint arXiv:1711.00165(2017). [Li+26] Mufan Li et al.Geometric Dyson Brownian Motions and the Free Log-Normal Limit for a Non-Square Product of Random Matrices

work page arXiv 2017
[14]

The neural covariance SDE: Shaped infinite depth-and-width networks at initialization

arXiv:2310 . 12079 [stat.ML].url: https://arxiv.org/abs/2310.12079. [LNR22] Mufan Li, Mihai Nica, and Dan Roy. “The neural covariance SDE: Shaped infinite depth-and-width networks at initialization”. In:Advances in Neural Information Processing Systems35 (2022), pp. 10795–10808. 15 [LS21] Qianyi Li and Haim Sompolinsky. “Statistical Mechanics of Deep Line...

work page doi:10.1103/physrevx.11.031059 2022
[15]

A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit

arXiv:2306 . 17759 [stat.ML].url:https : //arxiv.org/abs/2306.17759. [Pac+23] R. Pacelli et al. “A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit”. In:Nature Machine Intelligence5.12 (Dec. 2023), pp. 1497–1507.issn: 2522-5839.doi:10 . 1038 / s42256 - 023 - 00767-6.url:http://dx.doi.org/10.1038/s42256-023-...

work page doi:10.1038/s42256-023-00767-6 2023
[16]

Exponential expressivity in deep neural networks through transient chaos

arXiv:1606.05340 [stat.ML].url:https://arxiv.org/ abs/1606.05340. [PSW13] Nicholas G Polson, James G Scott, and Jesse Windle. “Bayesian inference for logistic models using P´ olya–Gamma latent variables”. In:Journal of the Amer- ican statistical Association108.504 (2013), pp. 1339–1349. [PW17] Jeffrey Pennington and Pratik Worah. “Nonlinear random matrix ...

work page internal anchor Pith review Pith/arXiv arXiv 2013
[17]

Deep Information Propagation

arXiv:1611. 01232 [stat.ML].url:https://arxiv.org/abs/1611.01232. [SCS88] H. Sompolinsky, A. Crisanti, and H. J. Sommers. “Chaos in Random Neu- ral Networks”. In:Phys. Rev. Lett.61 (3 July 1988), pp. 259–262.doi:10. 1103/PhysRevLett.61.259.url:https://link.aps.org/doi/10.1103/ PhysRevLett.61.259. [SNR23] Inbar Seroussi, Gadi Naveh, and Zohar Ringel. “Sepa...

work page internal anchor Pith review Pith/arXiv arXiv 1988
[18]

Mean field analysis of neural networks: A law of large numbers

[SS20] Justin Sirignano and Konstantinos Spiliopoulos. “Mean field analysis of neural networks: A law of large numbers”. In:SIAM Journal on Applied Mathematics 80.2 (2020), pp. 725–752. [Tre23] Dario Trevisan.Wide Deep Neural Networks with Gaussian Weights are Very Close to Gaussian Processes

2020
[19]

Computing with Infinite Networks

arXiv:2312.11737 [math.ST].url:https: //arxiv.org/abs/2312.11737. [Wil96] Christopher Williams. “Computing with Infinite Networks”. In:Advances in Neural Information Processing Systems. Ed. by M.C. Mozer, M. Jordan, and T. Petsche. Vol

work page arXiv
[20]

cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf

MIT Press, 1996.url:https://proceedings.neurips. cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf. [WWF24] Zhichao Wang, Denny Wu, and Zhou Fan.Nonlinear spiked covariance matri- ces and signal propagation in deep neural networks

1996
[21]

[Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws

arXiv:2402.10127 [stat.ML].url:https://arxiv.org/abs/2402.10127. [Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws

work page arXiv
[22]

Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers

arXiv:2009.10685 [cs.NE].url:https://arxiv.org/abs/2009.10685. [ZBM22] Guodong Zhang, Aleksandar Botev, and James Martens. “Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers”. In:arXiv preprint arXiv:2203.08120(2022). [Zha+17] Chiyuan Zhang et al.Understanding deep learning requires rethinking general- ization

work page arXiv 2009
[23]

Understanding deep learning requires rethinking generalization

arXiv:1611.03530 [cs.LG].url:https://arxiv.org/abs/ 1611.03530. 17 A Computing the prior: Neural Covariance SDE Let us first lay out the foundations of NSDE as we need. Recall again the forward pass (1): z1 = 1√N0 W0x, ϕ ℓ =ϕ s(zℓ), z ℓ+1 = r c N Wℓϕℓ, y=z out = r c N WoutϕL ∈R Consider network weights at initialization with shaped activationϕs in (1) on ...

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

High-dimensional dynamics of generalization error in neural networks

arXiv:1710.03667 [stat.ML].url: https://arxiv.org/abs/1710.03667. [Bas+25] Federico Bassetti et al.Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Reconciling modern machine-learning practice and the classical bias–variance trade-off

arXiv:2406 . 03260 [stat.ML].url:https://arxiv.org/abs/2406.03260. 13 [Bel+19] Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias–variance trade-off”. In:Proceedings of the National Academy of Sciences116.32 (July 2019), pp. 15849–15854.issn: 1091-6490.doi:10.1073/ pnas.1903070116.url:http://dx.doi.org/10.1073/pnas....

work page doi:10.1073/pnas.1903070116 2019

[3] [3]

15267 [stat.ML].url:https://arxiv.org/abs/2411.15267

arXiv:2411. 15267 [stat.ML].url:https://arxiv.org/abs/2411.15267. [Blu+15] Charles Blundell et al.Weight Uncertainty in Neural Networks

work page arXiv

[4] [4]

Weight Uncertainty in Neural Networks

arXiv: 1505.05424 [stat.ML].url:https://arxiv.org/abs/1505.05424. [BP22] Lucas Benigni and Sandrine P´ ech´ e.Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

[Cam+25] Francesco Camilli et al.Information-theoretic reduction of deep neural net- works to linear models in the overparametrized proportional regime

arXiv:2201.04753 [math.PR].url: https://arxiv.org/abs/2201.04753. [Cam+25] Francesco Camilli et al.Information-theoretic reduction of deep neural net- works to linear models in the overparametrized proportional regime

work page arXiv

[6] [6]

Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks

arXiv: 2505.03577 [math.ST].url:https://arxiv.org/abs/2505.03577. [Cho23] Cl´ ement Chouard. “Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks”. In:arXiv preprint arXiv:2306.05850 (2023). [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach. “On lazy training in differ- entiable programming”. In:Advances...

work page arXiv 2023

[7] [7]

Nonlinear Approximation and (Deep) ReLU Networks

arXiv:1905.02199 [cs.LG].url:https://arxiv.org/abs/1905.02199. [Du+19] Simon S. Du et al.Gradient Descent Provably Optimizes Over-parameterized Neural Networks

work page internal anchor Pith review Pith/arXiv arXiv 1905

[8] [8]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

arXiv:1810.02054 [cs.LG].url:https://arxiv. org/abs/1810.02054. [El 10] Noureddine El Karoui. “The spectrum of kernel random matrices”. In: (2010). [FW20] Zhou Fan and Zhichao Wang. “Spectra of the Conjugate Kernel and Neu- ral Tangent Kernel for linear-width neural networks”. In:Advances in Neural Information Processing Systems. Ed. by H. Larochelle et al. Vol

work page internal anchor Pith review Pith/arXiv arXiv 2010

[9] [9]

7710–7721.url:https://proceedings.neurips

Curran Associates, Inc., 2020, pp. 7710–7721.url:https://proceedings.neurips. cc/paper_files/paper/2020/file/572201a4497b0b9f02d4f279b09ec30d- Paper.pdf. [GG16] Yarin Gal and Zoubin Ghahramani.Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

2020

[10] [10]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

arXiv:1506.02142 [stat.ML].url:https://arxiv.org/abs/1506.02142. [Han19] Boris Hanin. “Universal function approximation by deep neural nets with bounded width and relu activations”. In:Mathematics7.10 (2019), p

work page internal anchor Pith review Pith/arXiv arXiv 2019

[11] [11]

Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

14 [Han24] Boris Hanin. “Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies”. In:Journal of Machine Learning Research25.267 (2024), pp. 1–58.url:http://jmlr.org/papers/v25/23-0643.html. [HDR19] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. “Training dynamics of deep networks using stochastic gradient descent via neural tan...

2024

[12] [12]

Cedzich, J

arXiv:2503.07872 [math.PR]. url:https://arxiv.org/abs/2503.07872. [HN19a] Boris Hanin and Mihai Nica. “Finite depth and width corrections to the neural tangent kernel”. In:arXiv preprint arXiv:1909.05989(2019). [HN19b] Boris Hanin and Mihai Nica. “Products of Many Large Random Matrices and Gradients in Deep Neural Networks”. In:Communications in Mathemat-...

work page doi:10.1007/s00220- 1909

[13] [13]

Neural tangent kernel: Convergence and generalization in neural networks

arXiv:1806.07572 [cs.LG].url:https://arxiv.org/abs/1806.07572. [Lee+17] Jaehoon Lee et al. “Deep neural networks as gaussian processes”. In:arXiv preprint arXiv:1711.00165(2017). [Li+26] Mufan Li et al.Geometric Dyson Brownian Motions and the Free Log-Normal Limit for a Non-Square Product of Random Matrices

work page arXiv 2017

[14] [14]

The neural covariance SDE: Shaped infinite depth-and-width networks at initialization

arXiv:2310 . 12079 [stat.ML].url: https://arxiv.org/abs/2310.12079. [LNR22] Mufan Li, Mihai Nica, and Dan Roy. “The neural covariance SDE: Shaped infinite depth-and-width networks at initialization”. In:Advances in Neural Information Processing Systems35 (2022), pp. 10795–10808. 15 [LS21] Qianyi Li and Haim Sompolinsky. “Statistical Mechanics of Deep Line...

work page doi:10.1103/physrevx.11.031059 2022

[15] [15]

A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit

arXiv:2306 . 17759 [stat.ML].url:https : //arxiv.org/abs/2306.17759. [Pac+23] R. Pacelli et al. “A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit”. In:Nature Machine Intelligence5.12 (Dec. 2023), pp. 1497–1507.issn: 2522-5839.doi:10 . 1038 / s42256 - 023 - 00767-6.url:http://dx.doi.org/10.1038/s42256-023-...

work page doi:10.1038/s42256-023-00767-6 2023

[16] [16]

Exponential expressivity in deep neural networks through transient chaos

arXiv:1606.05340 [stat.ML].url:https://arxiv.org/ abs/1606.05340. [PSW13] Nicholas G Polson, James G Scott, and Jesse Windle. “Bayesian inference for logistic models using P´ olya–Gamma latent variables”. In:Journal of the Amer- ican statistical Association108.504 (2013), pp. 1339–1349. [PW17] Jeffrey Pennington and Pratik Worah. “Nonlinear random matrix ...

work page internal anchor Pith review Pith/arXiv arXiv 2013

[17] [17]

Deep Information Propagation

arXiv:1611. 01232 [stat.ML].url:https://arxiv.org/abs/1611.01232. [SCS88] H. Sompolinsky, A. Crisanti, and H. J. Sommers. “Chaos in Random Neu- ral Networks”. In:Phys. Rev. Lett.61 (3 July 1988), pp. 259–262.doi:10. 1103/PhysRevLett.61.259.url:https://link.aps.org/doi/10.1103/ PhysRevLett.61.259. [SNR23] Inbar Seroussi, Gadi Naveh, and Zohar Ringel. “Sepa...

work page internal anchor Pith review Pith/arXiv arXiv 1988

[18] [18]

Mean field analysis of neural networks: A law of large numbers

[SS20] Justin Sirignano and Konstantinos Spiliopoulos. “Mean field analysis of neural networks: A law of large numbers”. In:SIAM Journal on Applied Mathematics 80.2 (2020), pp. 725–752. [Tre23] Dario Trevisan.Wide Deep Neural Networks with Gaussian Weights are Very Close to Gaussian Processes

2020

[19] [19]

Computing with Infinite Networks

arXiv:2312.11737 [math.ST].url:https: //arxiv.org/abs/2312.11737. [Wil96] Christopher Williams. “Computing with Infinite Networks”. In:Advances in Neural Information Processing Systems. Ed. by M.C. Mozer, M. Jordan, and T. Petsche. Vol

work page arXiv

[20] [20]

cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf

MIT Press, 1996.url:https://proceedings.neurips. cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf. [WWF24] Zhichao Wang, Denny Wu, and Zhou Fan.Nonlinear spiked covariance matri- ces and signal propagation in deep neural networks

1996

[21] [21]

[Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws

arXiv:2402.10127 [stat.ML].url:https://arxiv.org/abs/2402.10127. [Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws

work page arXiv

[22] [22]

Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers

arXiv:2009.10685 [cs.NE].url:https://arxiv.org/abs/2009.10685. [ZBM22] Guodong Zhang, Aleksandar Botev, and James Martens. “Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers”. In:arXiv preprint arXiv:2203.08120(2022). [Zha+17] Chiyuan Zhang et al.Understanding deep learning requires rethinking general- ization

work page arXiv 2009

[23] [23]

Understanding deep learning requires rethinking generalization

arXiv:1611.03530 [cs.LG].url:https://arxiv.org/abs/ 1611.03530. 17 A Computing the prior: Neural Covariance SDE Let us first lay out the foundations of NSDE as we need. Recall again the forward pass (1): z1 = 1√N0 W0x, ϕ ℓ =ϕ s(zℓ), z ℓ+1 = r c N Wℓϕℓ, y=z out = r c N WoutϕL ∈R Consider network weights at initialization with shaped activationϕs in (1) on ...

work page internal anchor Pith review Pith/arXiv arXiv