Collective Kernel EFT for Pre-activation ResNets
Pith reviewed 2026-05-10 08:11 UTC · model grok-4.3
The pith
Pre-activation ResNets admit an exact stochastic recursion for the empirical kernel G that Gaussian approximations convert into ODEs for the mean, covariance, and 1/n correction, yet the G-only closure fails for the correction term.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Exploiting the exact conditional Gaussianity of residual increments yields an exact stochastic recursion for the empirical kernel G in finite-width pre-activation ResNets. Applying Gaussian approximations to this recursion produces a continuous-depth ODE system for the mean kernel K0, the kernel covariance V4, and the 1/n mean correction K1,EFT that arises as a one-loop tadpole correction. The mean kernel equation stays accurate across depths, but the V4 equation accumulates an O(1) residual error driven by approximation errors in the G-only transport term, and the K1,EFT equation fails because the source closure exhibits a systematic mismatch already at initialization.
What carries the argument
The G-only closure hierarchy that reduces the state space to the empirical kernel G and its moments, enabling both the exact recursion and the subsequent approximate ODEs for its mean and fluctuations.
If this is right
- The mean kernel K0 can be tracked accurately at arbitrary depth using the derived ODE.
- The covariance V4 develops accumulating O(1) errors at finite depth from the G-only transport approximation.
- The 1/n correction K1,EFT is unreliable because its source closure breaks immediately.
- Any G-only state-space reduction will inherit the same source-closure limitation for higher-order corrections.
Where Pith is reading between the lines
- Including the sigma-kernel in the state space would likely remove the source mismatch and restore accuracy for the 1/n term.
- The exact recursion provides a benchmark that can test whether other kernel approximations in residual networks suffer similar closure failures.
- Architectures that share the same residual-increment structure may exhibit analogous limitations when only the kernel is retained in the state.
- The initialization mismatch suggests that even static finite-width corrections require a larger state space from the outset.
Load-bearing premise
The source closure for the 1/n correction term remains accurate, which is already violated at initialization.
What would settle it
Direct simulation of pre-activation ResNets at initialization that compares the measured source term for the 1/n correction against the closed expression; a systematic discrepancy confirms the closure is invalid.
read the original abstract
In finite-width deep neural networks, the empirical kernel $G$ evolves stochastically across layers. We develop a collective kernel effective field theory (EFT) for pre-activation ResNets based on a $G$-only closure hierarchy and diagnose its finite validity window. Exploiting the exact conditional Gaussianity of residual increments, we derive an exact stochastic recursion for $G$. Applying Gaussian approximations systematically yields a continuous-depth ODE system for the mean kernel $K_0$, the kernel covariance $V_4$, and the $1/n$ mean correction $K_{1,\mathrm{EFT}}$, which emerges diagrammatically as a one-loop tadpole correction. Numerically, $K_0$ remains accurate at all depths. However, the $V_4$ equation residual accumulates to an $O(1)$ error at finite time, primarily driven by approximation errors in the $G$-only transport term. Furthermore, $K_{1,\mathrm{EFT}}$ fails due to the breakdown of the source closure, which exhibits a systematic mismatch even at initialization. These findings highlight the limitations of $G$-only state-space reduction and suggest extending the state space to incorporate the sigma-kernel.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript develops a collective kernel effective field theory (EFT) for pre-activation ResNets. Exploiting the exact conditional Gaussianity of residual increments, it derives an exact stochastic recursion for the empirical kernel G. Systematic Gaussian approximations then produce a continuous-depth ODE system governing the mean kernel K0, the kernel covariance V4, and the 1/n mean correction K1,EFT (interpreted diagrammatically as a one-loop tadpole). Numerical experiments show K0 remains accurate across depths, while V4 accumulates O(1) residuals driven primarily by errors in the G-only transport term and K1,EFT fails outright because the source closure for the 1/n term mismatches even at initialization. The work diagnoses the finite validity window of the G-only closure hierarchy and proposes extending the state space to include the sigma-kernel.
Significance. The exact recursion for G, derived directly from the network architecture and conditional Gaussianity, is a clear technical strength. The paper's self-diagnosis of where the Gaussian approximations break down—quantifying the O(1) accumulation in V4 and the initialization-level failure of the source closure for K1,EFT—adds substantial value by delineating the limitations of G-only state-space reductions in finite-width kernel theories. This provides a concrete roadmap for more accurate extensions rather than over-claiming the approximations' reach.
major comments (2)
- [derivation of the ODE system and source closure] The G-only source closure for the 1/n correction term (used to close the hierarchy for K1,EFT) is shown to mismatch even at initialization; this is load-bearing for the claim that K1,EFT can be obtained as a controlled one-loop correction, since the mismatch drives the outright failure of that equation. A more detailed expansion of the closure assumption (e.g., which moments are neglected) would clarify whether the breakdown is fundamental to the G-only reduction or fixable by higher-order terms.
- [numerical diagnosis of V4 residuals] The V4 equation residual is reported to accumulate to O(1) primarily from the G-only transport term approximation. Because this error is O(1) rather than perturbative, it undermines the utility of the continuous-depth ODE for V4 at finite depths; the manuscript should quantify the separate contributions of each Gaussian approximation (transport vs. others) to isolate the dominant source.
minor comments (3)
- [abstract and numerical results] The abstract states that K0 remains accurate 'at all depths' but provides no error bars or quantitative thresholds; the numerical section should report these explicitly (e.g., relative error vs. depth and width) to support the accuracy claim.
- [introduction] Notation for K1,EFT should be introduced with a brief comparison to other 1/n corrections appearing in the kernel-EFT literature to avoid confusion for readers.
- [figures] Figure captions for the numerical experiments should specify the exact ResNet depths, widths, and initialization variances used, as these directly affect the observed validity window of the approximations.
Simulated Author's Rebuttal
We thank the referee for the thorough review and constructive feedback. The comments help clarify the presentation of our diagnostic results on the limitations of the G-only collective kernel EFT. We respond to each major comment below and will incorporate revisions accordingly.
read point-by-point responses
-
Referee: [derivation of the ODE system and source closure] The G-only source closure for the 1/n correction term (used to close the hierarchy for K1,EFT) is shown to mismatch even at initialization; this is load-bearing for the claim that K1,EFT can be obtained as a controlled one-loop correction, since the mismatch drives the outright failure of that equation. A more detailed expansion of the closure assumption (e.g., which moments are neglected) would clarify whether the breakdown is fundamental to the G-only reduction or fixable by higher-order terms.
Authors: We agree with the referee that a more detailed exposition of the source closure assumptions is warranted to better characterize the breakdown. In the revised manuscript, we will expand the relevant section (and add an appendix if necessary) to explicitly list the moments neglected in the G-only closure for the 1/n term. This expansion will show that the mismatch originates from the absence of sigma-kernel correlations, which are not captured in the G-only state space. Consequently, the breakdown is fundamental to the G-only reduction and motivates our proposed extension to include the sigma-kernel, rather than being addressable by higher-order terms within the current framework. revision: yes
-
Referee: [numerical diagnosis of V4 residuals] The V4 equation residual is reported to accumulate to O(1) primarily from the G-only transport term approximation. Because this error is O(1) rather than perturbative, it undermines the utility of the continuous-depth ODE for V4 at finite depths; the manuscript should quantify the separate contributions of each Gaussian approximation (transport vs. others) to isolate the dominant source.
Authors: We appreciate this suggestion for strengthening the numerical diagnosis. Our current analysis attributes the O(1) accumulation primarily to the G-only transport term based on the structure of the approximations. To address the request, we will perform and report an additional decomposition in the revision: by comparing the full residual against residuals obtained when selectively disabling individual Gaussian approximations (e.g., transport only, or variance closures only). This will provide quantitative isolation of contributions and confirm the dominant role of the transport term, while also highlighting the non-perturbative nature of the error. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper starts from the exact conditional Gaussianity of residual increments (a direct consequence of the pre-activation ResNet architecture) to derive an exact stochastic recursion for the empirical kernel G. It then applies explicit Gaussian approximations to close the hierarchy and obtain the continuous-depth ODE system for K0, V4, and the 1/n correction K1,EFT. The manuscript itself diagnoses the breakdown of the G-only source closure (mismatch already at initialization) and the O(1) accumulation in the V4 residual, without any parameter fitting to target quantities, self-definitional loops, or load-bearing self-citations. All reported quantities follow from the stated architecture and approximations rather than reducing to their own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Exact conditional Gaussianity of residual increments
- ad hoc to paper Gaussian approximations for the G-only closure hierarchy
Reference graph
Works this paper leans on
-
[1]
Neal.Bayesian Learning for Neural Networks, volume 118 ofLecture Notes in Statistics
R.M. Neal,Bayesian Learning for Neural Networks, vol. 118 ofLecture Notes in Statistics, Springer, New York, NY (1996), 10.1007/978-1-4612-0745-0
-
[2]
Williams,Computing with infinite networks, inAdvances in Neural Information Processing Systems, vol
C.K.I. Williams,Computing with infinite networks, inAdvances in Neural Information Processing Systems, vol. 9, MIT Press, 1996
work page 1996
-
[3]
J. Lee, Y. Bahri, R. Novak, S.S. Schoenholz, J. Pennington and J. Sohl-Dickstein,Deep neural networks as Gaussian processes, inInternational Conference on Learning Representations (ICLR), 2018
work page 2018
-
[4]
A.G.d.G. Matthews, M. Rowland, J. Hron, R.E. Turner and Z. Ghahramani,Gaussian process behaviour in wide deep neural networks, inInternational Conference on Learning Representations (ICLR), 2018
work page 2018
- [5]
-
[6]
S. Yaida,Non-Gaussian processes and neural networks at finite widths, inProceedings of The First Mathematical and Scientific Machine Learning Conference, J. Lu and R. Ward, eds., vol. 107 ofProceedings of Machine Learning Research, pp. 165–192, PMLR, 2020
work page 2020
-
[7]
https://doi.org/10.1017/9781009023405
D.A. Roberts and S. Yaida,The Principles of Deep Learning Theory, Cambridge University Press (2022), 10.1017/9781009023405
-
[8]
B. Hanin,Random fully connected neural networks as perturbatively solvable hierarchies, Journal of Machine Learning Research25(2024) 1
work page 2024
- [9]
-
[10]
K. He, X. Zhang, S. Ren and J. Sun,Deep residual learning for image recognition, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778, 2016, DOI
work page 2016
-
[11]
K. He, X. Zhang, S. Ren and J. Sun,Identity mappings in deep residual networks, in Computer Vision – ECCV 2016, vol. 9908 ofLecture Notes in Computer Science, pp. 630–645, Springer (2016), DOI
work page 2016
-
[12]
S. Hayou, E. Clerico, B. He, G. Deligiannidis, A. Doucet and J. Rousseau,Stable ResNet, in Proceedings of the 24th International Conference on Artificial Intelligence and Statistics, A. Banerjee and K. Fukumizu, eds., vol. 130 ofProceedings of Machine Learning Research, pp. 1324–1332, PMLR, 2021
work page 2021
-
[13]
M. Guillen, P. Misof and J.E. Gerken,Finite-width neural tangent kernels from feynman diagrams,arXiv preprint arXiv:2508.11522(2025)
-
[14]
G. Yang and S.S. Schoenholz,Mean field residual networks: On the edge of chaos, in Advances in Neural Information Processing Systems, vol. 30, Curran Associates, Inc., 2017
work page 2017
-
[15]
S.S. Schoenholz, J. Gilmer, S. Ganguli and J. Sohl-Dickstein,Deep information propagation, inInternational Conference on Learning Representations (ICLR), 2017
work page 2017
- [16]
-
[17]
M.B. Li, M. Nica and D.M. Roy,The future is log-Gaussian: ResNets and their infinite-depth-and-width limit at initialization, inAdvances in Neural Information Processing Systems, vol. 34, Curran Associates, Inc., 2021
work page 2021
-
[18]
M.B. Li, M. Nica and D.M. Roy,The neural covariance SDE: Shaped infinite depth-and-width networks at initialization, inAdvances in Neural Information Processing Systems, vol. 35, Curran Associates, Inc., 2022
work page 2022
-
[19]
S. Hayou and G. Yang,Width and depth limits commute in residual networks, inProceedings of the 40th International Conference on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato and J. Scarlett, eds., vol. 202 ofProceedings of Machine Learning Research, pp. 12700–12723, PMLR, 2023
work page 2023
-
[20]
S. Peluchetti and S. Favaro,Infinitely deep neural networks as diffusion processes, in Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics, S. Chiappa and R. Calandra, eds., vol. 108 ofProceedings of Machine Learning Research, pp. 1126–1136, PMLR, 2020
work page 2020
-
[21]
S. Peluchetti and S. Favaro,Doubly infinite residual neural networks: A diffusion process approach,Journal of Machine Learning Research22(2021) 1
work page 2021
-
[22]
E. Littwin, T. Galanti and L. Wolf,On random kernels of residual architectures, in Proceedings of the Thirty-Seventh Conference on Uncertainty in Artificial Intelligence, C. de Campos and M.H. Maathuis, eds., vol. 161 ofProceedings of Machine Learning Research, pp. 897–907, PMLR, 2021
work page 2021
-
[23]
K. Fischer, D. Dahmen and M. Helias,Field theory for optimal signal propagation in ResNets,Physical Review E112(2025) 065301
work page 2025
-
[24]
P.C. Martin, E.D. Siggia and H.A. Rose,Statistical dynamics of classical systems,Physical Review A8(1973) 423
work page 1973
-
[25]
H.-K. Janssen,On a Lagrangean for classical field dynamics and renormalization group calculations of dynamical critical properties,Zeitschrift für Physik B Condensed Matter23 (1976) 377
work page 1976
-
[26]
C. de Dominicis,Techniques de renormalisation de la théorie des champs et dynamique des phénomènes critiques,Journal de Physique Colloques37(1976) C1. – 19 –
work page 1976
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.