pith. sign in

arxiv: 2605.30860 · v1 · pith:3E6SSU2Xnew · submitted 2026-05-29 · 🧮 math.ST · cs.LG· math.PR· stat.TH

Bayesian Inference with Shaped Deep Non-linear MLPs

Pith reviewed 2026-06-28 20:39 UTC · model grok-4.3

classification 🧮 math.ST cs.LGmath.PRstat.TH
keywords Bayesian inferencedeep MLPsmodel evidencedata-dependent kernelsNeural Covariance SDElarge width limitseffective depth
0
0 comments X

The pith

To first order in LP/N, Bayesian inference in deep non-linear MLPs reduces to a data-dependent kernel method, with a criterion for when depth raises model evidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes Bayesian inference for deep MLPs in the regime where training set size P, width N, layer count L, and input dimension are all large but LP/N remains order one. Using the Neural Covariance SDE, it derives that the predictive posterior matches the posterior of a kernel method whose kernel is fixed by the training data. It also identifies a simple condition on the data-generating process under which raising LP/N increases the Bayesian model evidence. A reader would care because the large-model and large-data limits fail to commute, so this intermediate scaling clarifies when depth helps Bayesian performance without solving the full nonlinear dynamics.

Core claim

In the regime where LP/N equals theta of one, to first order in this parameter the Bayesian predictive posterior of deep non-linear MLPs is equivalent to that of a data-dependent kernel method. There is also a criterion on the data-generating process that determines whether increasing LP/N raises the Bayesian model evidence. The results cover smooth and ReLU activations at arbitrary temperature.

What carries the argument

The first-order expansion in the effective depth LP/N of the Neural Covariance SDE, which governs the layer-by-layer evolution of activation covariances.

Load-bearing premise

The Neural Covariance SDE framework continues to describe network behavior accurately when P, N, and L grow large together with LP/N held at order one.

What would settle it

Compute the Bayesian model evidence numerically for MLPs of several depths at fixed LP/N and check whether the change with depth matches the sign predicted by the data criterion.

read the original abstract

A central aim of deep learning theory is to characterize how neural networks make predictions in the regime of simultaneously large model and training set size. Since the limits of diverging number of model parameters and dataset size do not commute it is not clear a priori what limits exist. In this work, we shed new light on these questions by studying Bayesian inference in deep non-linear MLPs in the regime where the number of training samples ($P$), the input dimension ($N_0$), the hidden layer width ($N$), and the number of hidden layers ($L$) can all be large. We build on the Neural Covariance SDE (Li et al., 2022) to analyze predictive posteriors in the regime where $LP/N\in\Theta(1)$, playing the role of an effective network depth. Our framework covers both smooth and ReLU activation functions and applies to arbitrary temperature. We find to first order in $LP/N$ a simple criterion for which data generating processes benefit from depth in the sense that larger $LP/N$ increases the Bayesian model evidence. We also give a novel derivation of a prior result from the physics literature that at least to first order in $LP/N$, the Bayesian predictive posterior is remarkably simple and is simply equivalent to that of a data-dependent kernel method.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper analyzes Bayesian inference for deep non-linear MLPs (smooth and ReLU activations) in the joint large-P, large-N, large-L regime with LP/N = Θ(1) as an effective depth parameter. Building on the Neural Covariance SDE of Li et al. (2022), it derives to first order in LP/N a criterion identifying data-generating processes for which increasing LP/N raises the Bayesian model evidence, and shows that the predictive posterior is equivalent to that of a data-dependent kernel method. The framework applies at arbitrary temperature.

Significance. If the first-order results hold, the work supplies a concrete, testable criterion for when depth improves Bayesian evidence and a simplification of the predictive posterior to kernel form, with a novel derivation of a prior physics result. The extension to ReLU activations and general temperature is a strength. The analysis is grounded in an explicit small-parameter expansion rather than heuristic limits.

major comments (1)
  1. [Introduction / Neural Covariance SDE application] The extension of the Neural Covariance SDE (Li et al., 2022) to the joint L,P,N→∞ limit with LP/N=Θ(1) is invoked without a new convergence argument or error bound (see the setup paragraph citing Li et al. and the subsequent derivation of the O(LP/N) expansion). This assumption is load-bearing for both the sign of the depth correction to the model evidence and the claimed equivalence of the predictive posterior to a data-dependent kernel method; an O(1) correction in the joint limit would alter the leading-order claims.
minor comments (2)
  1. [Introduction] Define the effective depth parameter LP/N explicitly in the first paragraph of the introduction rather than deferring to the abstract.
  2. [Main derivation] Clarify whether the first-order truncation is uniform in the activation function (smooth vs. ReLU) or requires separate error estimates.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on the manuscript. We respond point-by-point to the major comment below.

read point-by-point responses
  1. Referee: [Introduction / Neural Covariance SDE application] The extension of the Neural Covariance SDE (Li et al., 2022) to the joint L,P,N→∞ limit with LP/N=Θ(1) is invoked without a new convergence argument or error bound (see the setup paragraph citing Li et al. and the subsequent derivation of the O(LP/N) expansion). This assumption is load-bearing for both the sign of the depth correction to the model evidence and the claimed equivalence of the predictive posterior to a data-dependent kernel method; an O(1) correction in the joint limit would alter the leading-order claims.

    Authors: We appreciate the referee pointing out that the joint limit with LP/N=Θ(1) is invoked by direct application of the Neural Covariance SDE from Li et al. (2022) without a fresh convergence proof or explicit error bound in the present manuscript. The setup and O(LP/N) expansion indeed rely on the SDE limit established in the cited work (under large-width assumptions) and then perform a perturbative expansion treating LP/N as the small parameter. We agree this assumption is load-bearing: an uncontrolled O(1) correction would invalidate the leading-order sign of the depth correction to the evidence and the kernel equivalence. Our contribution is the first-order perturbative analysis rather than a new rigorous limit theorem. To address the concern we will revise the manuscript to (i) restate the precise assumptions inherited from Li et al., (ii) explicitly note that all claims are to first order in LP/N with higher-order terms neglected, and (iii) add a short remark that a full joint-limit convergence analysis remains open. This makes the scope of the results transparent without altering the derivations. revision: partial

Circularity Check

1 steps flagged

Central claims depend on unverified extension of Neural Covariance SDE to joint L,P,N limit with LP/N=Θ(1)

specific steps
  1. self citation load bearing [Abstract]
    "We build on the Neural Covariance SDE (Li et al., 2022) to analyze predictive posteriors in the regime where LP/N∈Θ(1), playing the role of an effective network depth. ... We find to first order in LP/N a simple criterion for which data generating processes benefit from depth in the sense that larger LP/N increases the Bayesian model evidence. We also give a novel derivation of a prior result from the physics literature that at least to first order in LP/N, the Bayesian predictive posterior is remarkably simple and is simply equivalent to that of a data-dependent kernel method."

    The criterion and kernel equivalence are obtained by extending the cited SDE to the joint limit with LP/N fixed at order 1. No new justification is provided for why the SDE approximation holds without O(1) corrections when L scales with P/N; the results therefore reduce to the prior framework's validity in this regime.

full rationale

The paper's strongest claims—a criterion for depth benefiting model evidence and equivalence of the predictive posterior to a data-dependent kernel—are derived to first order in LP/N by invoking the Neural Covariance SDE framework. The analysis explicitly assumes this SDE remains valid in the simultaneous large-P, N, L regime without supplying an independent convergence argument or error bound for the joint limit. This matches the moderate circularity pattern of a load-bearing self-citation (or prior framework by overlapping authors) whose applicability to the new scaling is taken as given rather than re-derived or bounded. The central claims retain some independent content in the first-order expansion and the novel derivation step, preventing a higher score.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities beyond the assumed validity of the Neural Covariance SDE and the perturbative regime.

pith-pipeline@v0.9.1-grok · 5759 in / 1275 out tokens · 26571 ms · 2026-06-28T20:39:54.618004+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 19 canonical work pages · 8 internal anchors

  1. [1]

    High-dimensional dynamics of generalization error in neural networks

    arXiv:1710.03667 [stat.ML].url: https://arxiv.org/abs/1710.03667. [Bas+25] Federico Bassetti et al.Feature learning in finite-width Bayesian deep linear networks with multiple outputs and convolutional layers

  2. [2]

    Reconciling modern machine-learning practice and the classical bias–variance trade-off

    arXiv:2406 . 03260 [stat.ML].url:https://arxiv.org/abs/2406.03260. 13 [Bel+19] Mikhail Belkin et al. “Reconciling modern machine-learning practice and the classical bias–variance trade-off”. In:Proceedings of the National Academy of Sciences116.32 (July 2019), pp. 15849–15854.issn: 1091-6490.doi:10.1073/ pnas.1903070116.url:http://dx.doi.org/10.1073/pnas....

  3. [3]

    15267 [stat.ML].url:https://arxiv.org/abs/2411.15267

    arXiv:2411. 15267 [stat.ML].url:https://arxiv.org/abs/2411.15267. [Blu+15] Charles Blundell et al.Weight Uncertainty in Neural Networks

  4. [4]

    Weight Uncertainty in Neural Networks

    arXiv: 1505.05424 [stat.ML].url:https://arxiv.org/abs/1505.05424. [BP22] Lucas Benigni and Sandrine P´ ech´ e.Largest Eigenvalues of the Conjugate Kernel of Single-Layered Neural Networks

  5. [5]

    [Cam+25] Francesco Camilli et al.Information-theoretic reduction of deep neural net- works to linear models in the overparametrized proportional regime

    arXiv:2201.04753 [math.PR].url: https://arxiv.org/abs/2201.04753. [Cam+25] Francesco Camilli et al.Information-theoretic reduction of deep neural net- works to linear models in the overparametrized proportional regime

  6. [6]

    Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks

    arXiv: 2505.03577 [math.ST].url:https://arxiv.org/abs/2505.03577. [Cho23] Cl´ ement Chouard. “Deterministic equivalent of the conjugate kernel matrix associated to artificial neural networks”. In:arXiv preprint arXiv:2306.05850 (2023). [COB19] Lenaic Chizat, Edouard Oyallon, and Francis Bach. “On lazy training in differ- entiable programming”. In:Advances...

  7. [7]

    Nonlinear Approximation and (Deep) ReLU Networks

    arXiv:1905.02199 [cs.LG].url:https://arxiv.org/abs/1905.02199. [Du+19] Simon S. Du et al.Gradient Descent Provably Optimizes Over-parameterized Neural Networks

  8. [8]

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks

    arXiv:1810.02054 [cs.LG].url:https://arxiv. org/abs/1810.02054. [El 10] Noureddine El Karoui. “The spectrum of kernel random matrices”. In: (2010). [FW20] Zhou Fan and Zhichao Wang. “Spectra of the Conjugate Kernel and Neu- ral Tangent Kernel for linear-width neural networks”. In:Advances in Neural Information Processing Systems. Ed. by H. Larochelle et al. Vol

  9. [9]

    7710–7721.url:https://proceedings.neurips

    Curran Associates, Inc., 2020, pp. 7710–7721.url:https://proceedings.neurips. cc/paper_files/paper/2020/file/572201a4497b0b9f02d4f279b09ec30d- Paper.pdf. [GG16] Yarin Gal and Zoubin Ghahramani.Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

  10. [10]

    Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

    arXiv:1506.02142 [stat.ML].url:https://arxiv.org/abs/1506.02142. [Han19] Boris Hanin. “Universal function approximation by deep neural nets with bounded width and relu activations”. In:Mathematics7.10 (2019), p

  11. [11]

    Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies

    14 [Han24] Boris Hanin. “Random Fully Connected Neural Networks as Perturbatively Solvable Hierarchies”. In:Journal of Machine Learning Research25.267 (2024), pp. 1–58.url:http://jmlr.org/papers/v25/23-0643.html. [HDR19] Soufiane Hayou, Arnaud Doucet, and Judith Rousseau. “Training dynamics of deep networks using stochastic gradient descent via neural tan...

  12. [12]

    Cedzich, J

    arXiv:2503.07872 [math.PR]. url:https://arxiv.org/abs/2503.07872. [HN19a] Boris Hanin and Mihai Nica. “Finite depth and width corrections to the neural tangent kernel”. In:arXiv preprint arXiv:1909.05989(2019). [HN19b] Boris Hanin and Mihai Nica. “Products of Many Large Random Matrices and Gradients in Deep Neural Networks”. In:Communications in Mathemat-...

  13. [13]

    Neural tangent kernel: Convergence and generalization in neural networks

    arXiv:1806.07572 [cs.LG].url:https://arxiv.org/abs/1806.07572. [Lee+17] Jaehoon Lee et al. “Deep neural networks as gaussian processes”. In:arXiv preprint arXiv:1711.00165(2017). [Li+26] Mufan Li et al.Geometric Dyson Brownian Motions and the Free Log-Normal Limit for a Non-Square Product of Random Matrices

  14. [14]

    The neural covariance SDE: Shaped infinite depth-and-width networks at initialization

    arXiv:2310 . 12079 [stat.ML].url: https://arxiv.org/abs/2310.12079. [LNR22] Mufan Li, Mihai Nica, and Dan Roy. “The neural covariance SDE: Shaped infinite depth-and-width networks at initialization”. In:Advances in Neural Information Processing Systems35 (2022), pp. 10795–10808. 15 [LS21] Qianyi Li and Haim Sompolinsky. “Statistical Mechanics of Deep Line...

  15. [15]

    A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit

    arXiv:2306 . 17759 [stat.ML].url:https : //arxiv.org/abs/2306.17759. [Pac+23] R. Pacelli et al. “A statistical mechanics framework for Bayesian deep neural networks beyond the infinite-width limit”. In:Nature Machine Intelligence5.12 (Dec. 2023), pp. 1497–1507.issn: 2522-5839.doi:10 . 1038 / s42256 - 023 - 00767-6.url:http://dx.doi.org/10.1038/s42256-023-...

  16. [16]

    Exponential expressivity in deep neural networks through transient chaos

    arXiv:1606.05340 [stat.ML].url:https://arxiv.org/ abs/1606.05340. [PSW13] Nicholas G Polson, James G Scott, and Jesse Windle. “Bayesian inference for logistic models using P´ olya–Gamma latent variables”. In:Journal of the Amer- ican statistical Association108.504 (2013), pp. 1339–1349. [PW17] Jeffrey Pennington and Pratik Worah. “Nonlinear random matrix ...

  17. [17]

    Deep Information Propagation

    arXiv:1611. 01232 [stat.ML].url:https://arxiv.org/abs/1611.01232. [SCS88] H. Sompolinsky, A. Crisanti, and H. J. Sommers. “Chaos in Random Neu- ral Networks”. In:Phys. Rev. Lett.61 (3 July 1988), pp. 259–262.doi:10. 1103/PhysRevLett.61.259.url:https://link.aps.org/doi/10.1103/ PhysRevLett.61.259. [SNR23] Inbar Seroussi, Gadi Naveh, and Zohar Ringel. “Sepa...

  18. [18]

    Mean field analysis of neural networks: A law of large numbers

    [SS20] Justin Sirignano and Konstantinos Spiliopoulos. “Mean field analysis of neural networks: A law of large numbers”. In:SIAM Journal on Applied Mathematics 80.2 (2020), pp. 725–752. [Tre23] Dario Trevisan.Wide Deep Neural Networks with Gaussian Weights are Very Close to Gaussian Processes

  19. [19]

    Computing with Infinite Networks

    arXiv:2312.11737 [math.ST].url:https: //arxiv.org/abs/2312.11737. [Wil96] Christopher Williams. “Computing with Infinite Networks”. In:Advances in Neural Information Processing Systems. Ed. by M.C. Mozer, M. Jordan, and T. Petsche. Vol

  20. [20]

    cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf

    MIT Press, 1996.url:https://proceedings.neurips. cc/paper_files/paper/1996/file/ae5e3ce40e0404a45ecacaaf05e5f735- Paper.pdf. [WWF24] Zhichao Wang, Denny Wu, and Zhou Fan.Nonlinear spiked covariance matri- ces and signal propagation in deep neural networks

  21. [21]

    [Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws

    arXiv:2402.10127 [stat.ML].url:https://arxiv.org/abs/2402.10127. [Yan21] Greg Yang.Tensor Programs III: Neural Matrix Laws

  22. [22]

    Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers

    arXiv:2009.10685 [cs.NE].url:https://arxiv.org/abs/2009.10685. [ZBM22] Guodong Zhang, Aleksandar Botev, and James Martens. “Deep learning with- out shortcuts: Shaping the kernel with tailored rectifiers”. In:arXiv preprint arXiv:2203.08120(2022). [Zha+17] Chiyuan Zhang et al.Understanding deep learning requires rethinking general- ization

  23. [23]

    Understanding deep learning requires rethinking generalization

    arXiv:1611.03530 [cs.LG].url:https://arxiv.org/abs/ 1611.03530. 17 A Computing the prior: Neural Covariance SDE Let us first lay out the foundations of NSDE as we need. Recall again the forward pass (1): z1 = 1√N0 W0x, ϕ ℓ =ϕ s(zℓ), z ℓ+1 = r c N Wℓϕℓ, y=z out = r c N WoutϕL ∈R Consider network weights at initialization with shaped activationϕs in (1) on ...