pith. machine review for the scientific record. sign in

arxiv: 2605.06563 · v1 · submitted 2026-05-07 · 💻 cs.LG · hep-th

Recognition: unknown

Criticality and Saturation in Orthogonal Neural Networks

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:22 UTC · model grok-4.3

classification 💻 cs.LG hep-th
keywords orthogonal initializationfinite-width expansionneural network statisticsrecursion relationstensor stabilityactivation functionscriticalitysaturation
0
0 comments X

The pith

Orthogonal weight initializations stabilize the finite-width correction tensors of neural networks as depth increases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit recursion relations that track the evolution of finite-width correction tensors layer by layer when weights are initialized orthogonally. These relations show that the tensors reach a stable value after sufficiently many layers for common activation functions, in contrast to the growth seen with independent random weights. A reader would care because the recursions supply a direct mathematical reason for the empirical advantage of orthogonal initializations in deep training. The same relations also extend diagrammatic techniques to all orders of the inverse-width expansion.

Core claim

We derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of Feynman diagrams for the corresponding recursions which are valid to all orders in 1/width. We show explicitly that the recursions reproduce the stability of the finite-width tensors observed for activation functions with vanishing fixed point. Numerical solutions of the recursions and their large-depth expansions agree with Monte-Carlo estimates from network ensembles.

What carries the argument

Layer-wise recursion relations for the tensors in the 1/width expansion of network statistics under orthogonal initialization, which track moment evolution across layers and produce saturation at large depth.

If this is right

  • The recursions allow computation of network statistics at arbitrary depth without full ensemble simulation.
  • Stability of the tensors holds specifically when the activation function has a vanishing fixed point.
  • The diagrammatic extension covers all orders in the inverse-width expansion for the orthogonal case.
  • The derived relations close the theoretical account of why orthogonal weights prevent divergence of finite-width corrections in deep networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The recursions could be solved analytically to predict the depth at which saturation begins for given width and activation.
  • Similar recursion structures might appear under other structured initializations that preserve norm.
  • The method supplies a route to study how criticality conditions interact with finite-width effects in deeper architectures.

Load-bearing premise

The leading terms of the inverse-width power series continue to dominate the network behavior even when depth becomes large.

What would settle it

Numerical iteration of the derived recursions yields tensors that grow without bound as depth increases, while direct Monte-Carlo sampling from finite-width orthogonal networks produces saturating tensors.

Figures

Figures reproduced from arXiv: 2605.06563 by Jan E. Gerken, Max Guillen.

Figure 1
Figure 1. Figure 1: Gradient stability. Monte Carlo estimates of the NTK Θ (ℓ ) 𝛼𝛽 for a tanh orthogonal network (𝑛 = 50, 𝐿 = 30) across varying 𝐶𝑊, with the critical case 𝐶𝑊 = 1 shown in the center. Means are computed over 600 initializations; shaded regions denote standard errors (typically not visible). Results are consistent with Theorem 5.1. 2 4 6 8 10 0.5 0.4 0.3 0.2 0.1 0.0 0.1 Normalized NTK Tensors Experimental Predi… view at source ↗
Figure 2
Figure 2. Figure 2: Orthogonal saturation. (a) Normalized NTK tensors from Monte Carlo simulations of tanh orthogonal networks (𝑛 = 50, 𝐿 = 10) at criticality, compared with theoretical predictions. Means are computed over 600 initializations; shaded regions denote standard deviations (typically not visible), showing quantitative agreement. (b) Large-ℓ expansions of the NNGP and quartic vertex 𝑉, compared with exact solutions… view at source ↗
Figure 3
Figure 3. Figure 3: The NNGP and NTK. Solution of the single-input recursion relations (6) and (11). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solutions of (141) and (11), while orange boxes represent the large-ℓ expansions (157) and (159), evaluated at integer ℓ. The tensor magnitudes remain smaller than their Gaussian counterparts and exhibit earl… view at source ↗
Figure 4
Figure 4. Figure 4: The NTK mixed tensors. Solution of the single-input recursion relations (86) and (87). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solutions of (86) and (87), while orange boxes represent the large-ℓ expansions (160) and (161), evaluated at integer ℓ. The tensor magnitudes remain smaller than their Gaussian counterparts and exhibit… view at source ↗
Figure 5
Figure 5. Figure 5: The NTK variance tensors. Solution of the single-input recursion relations (88) and (89). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solutions of (88) and (89), while orange boxes represent the large-ℓ expansion (162) and (163), evaluated at integer ℓ. The magnitude of the tensors remains smaller than its Gaussian counterpart and … view at source ↗
Figure 6
Figure 6. Figure 6: The dNTK tensors. Solutions of the single-input recursion relations (90) and 91. We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solution of (90) and 91, while orange boxes represent the large-ℓ expansions (164) and (165), evaluated at integer ℓ. The magnitude of the tensors remains smaller than its Gaussian counterpart and exhibits ea… view at source ↗
Figure 7
Figure 7. Figure 7: The dINTK tensor. Solution of the single-input recursion relation (92). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solution of (92), while orange boxes represent the large-ℓ expansion (166), evaluated at integer ℓ. The magnitude of the tensor remains smaller than its Gaussian counterpart and exhibits early-layer saturation. 1 2 3 … view at source ↗
Figure 8
Figure 8. Figure 8: The ddIINTK tensors. Solutions to the single-input recursion relations (93), (94), and (95). We consider a tanh network with width 𝑛 = 50 and depth 𝐿 = 10, with inputs drawn from (0, 1). Blue points show the exact solutions of (93), (94), and (95), while orange boxes represent the corresponding large-ℓ expansions (167), (168), and (169), evaluated at integer ℓ. The tensor magnitude remains below its Gaussi… view at source ↗
Figure 9
Figure 9. Figure 9: Orthogonal saturation. Exact solutions of the single-input recursion relations at order 1/𝑛 (see Theorems 4.1 and 4.2). We consider a tanh network of width 𝑛 = 50 with inputs drawn from the real interval (0, 1). All tensors attain smaller values than their Gaussian counterparts and exhibit early-layer saturation. These predictions are consistent with the empirical results of [25]. where Δc𝐺 (ℓ ) 𝛼𝛽 = 𝜎 (ℓ … view at source ↗
Figure 10
Figure 10. Figure 10: The sextic vertex. Solution to the single-input recursion relation (133) at order 1/𝑛 2 . We consider a tanh network with width 𝑛 = 50 and depth 𝐿 = 30, with inputs drawn from (0, 1). Blue points show the exact solution of (133), while orange boxes represent the large-ℓ expansion (170), evaluated at integer ℓ. The tensor magnitude remains below its Gaussian counterpart and exhibits early-layer saturation,… view at source ↗
Figure 11
Figure 11. Figure 11: Variance stability. Components of the Monte Carlo estimate of the NNGP 𝐾𝛼𝛽 for a tanh MLP, shown as a function of layer depth ℓ for identical and distinct inputs, across three values of 𝐶 (ℓ ) 𝑊 . Hidden layers have width 50. Means are computed over 600 initializations for both non-critical (left, right) and critical (middle) cases. Error bars are shown in all panels (see text). Θ (ℓ ) 𝛼𝛽 = 1 𝑁net ∑︁ 𝑁net… view at source ↗
Figure 12
Figure 12. Figure 12: Stability beyond the perturbative regime. Comparison of the diagonal components of the Monte Carlo estimate, single-input exact solution, and large-ℓ expansion for the NNGP 𝐾 and quartic vertex 𝑉 in a tanh MLP with orthogonal initialization. Hidden layers have width 50; means are computed over 600 initializations. (a) The NNGP estimates are in quantitative agreement with the exact solution at both small a… view at source ↗
Figure 13
Figure 13. Figure 13: Stability of the four-point cumulant 𝑉4 at criticality. Selected components of the Monte Carlo estimate 𝑉4 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁st… view at source ↗
Figure 14
Figure 14. Figure 14: Instability of the four-point cumulant 𝑉4 away from criticality.Selected components of the Monte Carlo estimate 𝑉4 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 4 10 3 10 2 Dj 0 0 0 0j y = 0:064` ¡1:45 `start = 10 10 6 10 5 10 4 10 3 10 2 10 1 Dj 0101j y = 3`… view at source ↗
Figure 15
Figure 15. Figure 15: Stability of the 𝐷 tensor at criticality. Selected components of the Monte Carlo estimate 𝐷 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗
Figure 16
Figure 16. Figure 16: Instability of the 𝐷 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐷 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 1 Fj 0 0 0 0j y = 0:37` ¡0:89 `start = 10 10 2 Fj 0101j y = 0:12` ¡0:88 `start = 10 10 2 10 1 Fj 0 0 2 2j y = 0:… view at source ↗
Figure 17
Figure 17. Figure 17: Stability of the 𝐹 tensor at criticality. Selected components of the Monte Carlo estimate 𝐹 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗
Figure 18
Figure 18. Figure 18: Instability of the 𝐹 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐹 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 2 10 1 Aj 0 0 0 0j y = 0:096` ¡0:08 `start = 10 10 2 Aj 0101j y = 0:054` ¡0:06 `start = 10 10 2 Aj 0 0 2 2j y = … view at source ↗
Figure 19
Figure 19. Figure 19: Stability of the 𝐴 tensor at criticality. Selected components of the Monte Carlo estimate 𝐴 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗
Figure 20
Figure 20. Figure 20: Instability of the 𝐴 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐴 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 1 2 × 10 1 3 × 10 1 4 × 10 1 6 × 10 1 Bj 0 0 0 0j y = 0:94` ¡0:54 `start = 10 10 1 4 × 10 2 6 × 10 2 2 × 10 1 Bj… view at source ↗
Figure 21
Figure 21. Figure 21: Stability of the 𝐵 tensor at criticality. Selected components of the Monte Carlo estimate 𝐵 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗
Figure 22
Figure 22. Figure 22: Instability of the 𝐵 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐵 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 0 20 40 60 80 100 0.30 0.25 0.20 0.15 0.10 0.05 0.00 D( ) < n > n (a) 0 20 40 60 80 100 0.1 0.0 0.1 0.2 0.3 0.4 0.5… view at source ↗
Figure 23
Figure 23. Figure 23: Stability beyond the perturbative regime. Comparison of the diagonal components of the Monte Carlo estimate, single-input exact solution, and large-ℓ expansion for the NTK tensors 𝐷 and 𝐹 in a tanh MLP with orthogonal initialization. Hidden layers have width 50; means are computed over 600 initializations. (a) The tensor 𝐷 estimates are in quantitative agreement with the exact solution at both small and l… view at source ↗
read the original abstract

It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statistics are supplemented by a power series in $1/\mathrm{width}$. In particular, recent empirical results by Day et al. show that the tensors appearing in this treatment stabilize for large depth, as opposed to the tensors of i.i.d.-initialized networks. In this article, we derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of recently-introduced Feynman diagrams for the corresponding recursions in the i.i.d.-case which are valid to all orders in $1/\mathrm{width}$. Finally, we show explicitly that the recursions we derive reproduce the stability of the finite-width tensors which was observed for activation functions with vanishing fixed point. This work therefore provides a theoretical explanation for the stability of nonlinear networks of finite width initialized with orthogonal weights, closing a long-standing gap in the literature. We validate our theoretical results experimentally by showing that numerical solutions of our recursion relations and their analytical large-depth expansions agree excellently with Monte-Carlo estimates from network ensembles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper derives explicit layer-wise recursion relations for the leading finite-width correction tensors in the statistics of deep neural networks initialized with orthogonal weights. It extends Feynman diagram techniques (previously for i.i.d. Gaussian weights) to the orthogonal case to all orders in 1/width, obtains closed recursions for the tensors, and shows that these recursions reproduce the large-depth stability of the tensors observed empirically for activations with vanishing fixed points. The theoretical predictions are validated by agreement between numerical solutions of the recursions (and their large-depth analytic expansions) and Monte-Carlo estimates from finite-width network ensembles.

Significance. If the derivations and perturbative control hold, the work supplies the missing theoretical account for why orthogonal initialization stabilizes finite-width corrections at large depth (in contrast to i.i.d. Gaussian initialization), thereby closing a documented gap. The explicit recursions and diagram extension are reusable tools; the Monte-Carlo match provides direct empirical support for the central claim.

major comments (2)
  1. [large-depth expansions and validation sections] The central claim that the derived recursions explain the observed stability rests on the assumption that the leading 1/width tensors remain dominant as depth L grows large. No explicit remainder bound or uniform-in-L control on the truncation error of the finite-width expansion is provided (see the large-depth analysis and the statement that 'the recursions reproduce the stability'). If higher-order terms accumulate or the effective expansion parameter grows with L via diagram combinatorics, the leading-tensor stability would not suffice to explain the finite-width behavior.
  2. [derivation of recursion relations] The recursion relations are stated to close at leading order after incorporating the orthogonal constraint. However, it is not shown whether orthogonality-induced correlations at finite width can feed back into the leading tensors at depths where the expansion parameter is no longer parametrically small (see the derivation of the layer-wise recursions and the diagram extension).
minor comments (2)
  1. [preliminaries] Notation for the tensors (e.g., the precise definition of the leading correction objects) should be introduced once with an explicit equation reference rather than relying on prior diagram papers.
  2. [experimental validation] The Monte-Carlo validation would benefit from reporting the range of widths and depths tested and the number of independent network realizations per point to allow assessment of statistical error.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive major comments. We address each point below, providing clarifications on the scope of our perturbative analysis and indicating revisions where they strengthen the presentation without altering the central claims.

read point-by-point responses
  1. Referee: [large-depth expansions and validation sections] The central claim that the derived recursions explain the observed stability rests on the assumption that the leading 1/width tensors remain dominant as depth L grows large. No explicit remainder bound or uniform-in-L control on the truncation error of the finite-width expansion is provided (see the large-depth analysis and the statement that 'the recursions reproduce the stability'). If higher-order terms accumulate or the effective expansion parameter grows with L via diagram combinatorics, the leading-tensor stability would not suffice to explain the finite-width behavior.

    Authors: We agree that the analysis is perturbative in 1/width and that an explicit uniform-in-L remainder bound is not derived. The manuscript establishes closed recursions for the leading-order tensors, obtains their large-depth analytic expansions, and demonstrates quantitative agreement with Monte-Carlo estimates from finite-width ensembles. This agreement holds across the depths examined, indicating that higher-order contributions do not destabilize the leading tensors in practice. In the revised version we will add an explicit caveat in the large-depth section acknowledging the absence of a rigorous truncation bound and clarifying that the explanatory power for observed stability rests on the combination of exact leading-order closure and empirical validation. revision: partial

  2. Referee: [derivation of recursion relations] The recursion relations are stated to close at leading order after incorporating the orthogonal constraint. However, it is not shown whether orthogonality-induced correlations at finite width can feed back into the leading tensors at depths where the expansion parameter is no longer parametrically small (see the derivation of the layer-wise recursions and the diagram extension).

    Authors: The Feynman-diagram extension incorporates the orthogonal constraints order by order in 1/width. At leading order the diagram rules ensure that orthogonality-induced correlations are absorbed into the recursion kernels without introducing feedback from higher-order diagrams into the leading tensors. This decoupling follows from the structure of the orthogonal ensemble and holds at every depth because the recursion is derived by collecting all diagrams that contribute at O(1/width). We will insert a short clarifying paragraph in the derivation section that explicitly states this decoupling and references the diagram rules that prevent higher-order leakage into the leading tensors. revision: yes

standing simulated objections not resolved
  • Absence of an explicit remainder bound or uniform-in-L control on the truncation error of the finite-width expansion

Circularity Check

0 steps flagged

Derivation of orthogonal recursions independent of stability observation; minor self-citation to prior diagrams

full rationale

The paper derives explicit layer-wise recursion relations for the finite-width tensors directly from the orthogonal weight initialization constraint combined with the 1/width perturbative expansion. These recursions are then solved to reproduce the large-depth stability previously observed for activations with vanishing fixed points, with results matching Monte-Carlo ensemble estimates. This chain does not reduce to a fitted input or self-definition by the paper's equations. A self-citation exists to recently-introduced Feynman diagrams for the i.i.d. case, which is extended here, but the orthogonal recursions constitute new independent content and are externally validated against simulations rather than relying on the citation as load-bearing support.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the validity of the 1/width perturbative expansion around infinite width and on the assumption that the leading correction tensors dominate the depth-dependent behavior. No new free parameters or invented entities are introduced; the recursions follow from the orthogonal constraint applied to the existing expansion.

axioms (2)
  • domain assumption The finite-width expansion in powers of 1/width is a valid asymptotic description of network statistics for large but finite width.
    Invoked when claiming the recursions capture the observed stability.
  • domain assumption Activation functions have a vanishing fixed point (average output zero when input is zero).
    Required for the stability result; stated as the case where stability holds.

pith-pipeline@v0.9.0 · 5523 in / 1431 out tokens · 31297 ms · 2026-05-08T12:22:12.420973+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages

  1. [1]

    Understanding the Difficulty of Training Deep Feedforward Neural Networks

    Xavier Glorot and Yoshua Bengio. “Understanding the Difficulty of Training Deep Feedforward Neural Networks”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Mar. 2010, pp. 249–256

  2. [2]

    Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

    Kaiming He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In:Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1026–1034. arXiv:1502.01852

  3. [3]

    Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

    Andrew M. Saxe, James L. McClelland, and Surya Ganguli.Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks. Feb. 2014. arXiv:1312.6120

  4. [4]

    All You Need Is a Good Init

    Dmytro Mishkin and Jiri Matas. “All You Need Is a Good Init”. In:International Conference on Learning Representations

  5. [5]

    arXiv, Feb. 2016. arXiv:1511.06422

  6. [6]

    Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

    Wei Hu, Lechao Xiao, and Jeffrey Pennington. “Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks”. In:International Conference on Learning Representations. Sept. 2019. arXiv:2001.05992

  7. [7]

    Resurrecting the Sigmoid in Deep Learning through Dynamical Isometry: Theory and Practice

    Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. “Resurrecting the Sigmoid in Deep Learning through Dynamical Isometry: Theory and Practice”. In:Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc., 2017. arXiv:1711.04735

  8. [8]

    The Emergence of Spectral Universality in Deep Net- works

    Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. “The Emergence of Spectral Universality in Deep Net- works”. In:Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics. PMLR, Mar. 2018, pp. 1924–1932. arXiv:1802.09979

  9. [9]

    Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

    Lechao Xiao et al. “Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks”. In:Proceedings of the 35th International Conference on Machine Learning. PMLR, July 2018, pp. 5393–5402

  10. [10]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Clement Hongler. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks”. In:Advances in Neural Information Processing Systems. Vol. 31. Curran Associates, Inc., 2018. arXiv:1806.07572

  11. [11]

    On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

    Wei Huang, Weitao Du, and Richard Yi Da Xu. “On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization”. In:Twenty-Ninth International Joint Conference on Artificial Intelligence. Vol. 3. Aug. 2021, pp. 2577– 2583.doi:10.24963/ijcai.2021/355. arXiv:2004.05867

  12. [12]

    Roberts, Sho Yaida, and Boris Hanin

    Daniel A. Roberts and Sho Yaida.The Principles of Deep Learning Theory: An Effective Theory Approach to Under- standing Neural Networks. Cambridge: Cambridge University Press, 2022.isbn: 978-1-316-51933-2.doi:10.1017/ 9781009023405. arXiv:2106.10165. 11

  13. [13]

    Weingarten, Asymptotic behavior of group integrals in the limit of infinite rank, J

    Don Weingarten. “Asymptotic Behavior of Group Integrals in the Limit of Infinite Rank”. In:Journal of Mathematical Physics19.5 (May 1978), pp. 999–1001.issn: 0022-2488.doi:10.1063/1.523807

  14. [14]

    Guillen, P

    Max Guillen, Philipp Misof, and Jan E. Gerken.Finite-Width Neural Tangent Kernels from Feynman Diagrams. Aug. 2025.doi:10.48550/arXiv.2508.11522. arXiv:2508.11522

  15. [15]

    Tiled Convolutional Neural Networks

    Jiquan Ngiam et al. “Tiled Convolutional Neural Networks”. In:Advances in Neural Information Processing Systems. Vol. 23. Curran Associates, Inc., 2010

  16. [16]

    Non-Gaussian Processes and Neural Networks at Finite Widths

    Sho Yaida. “Non-Gaussian Processes and Neural Networks at Finite Widths”. In:Proceedings of The First Mathematical and Scientific Machine Learning Conference. PMLR, Aug. 2020, pp. 165–192. arXiv:1910.00019

  17. [17]

    Symmetry-via-Duality: Invariant Neural Network Densities from Parameter-Space Correlators

    Anindita Maiti, Keegan Stoner, and James Halverson. “Symmetry-via-Duality: Invariant Neural Network Densities from Parameter-Space Correlators”. In:Machine Learning in Pure Mathematics and Theoretical Physics. Chap. Chapter 8, pp. 293–330.doi:10.1142/9781800613706_0008. arXiv:2106.00694

  18. [18]

    Structures of Neural Network Effective Theories

    Ian Banta et al. “Structures of Neural Network Effective Theories”. In:Physical Review D109.10 (May 2024), p. 105007. doi:10.1103/PhysRevD.109.105007. arXiv:2305.02334

  19. [19]

    arXiv preprint arXiv:2210.16859 , year=

    Alexander Maloney, Daniel A. Roberts, and James Sully.A Solvable Model of Neural Scaling Laws. Oct. 2022. arXiv: 2210.16859

  20. [20]

    Graph neural networks in particle physics

    Zhengkang Zhang. “Neural Scaling Laws from Large-N Field Theory: Solvable Model beyond the Ridgeless Limit”. In:Machine Learning: Science and Technology6.2 (Apr. 2025), p. 025010.issn: 2632-2153.doi:10.1088/2632- 2153/adc872. arXiv:2405.19398

  21. [21]

    Neural Networks and Quantum Field Theory

    James Halverson, Anindita Maiti, and Keegan Stoner. “Neural Networks and Quantum Field Theory”. In:Machine Learning: Science and Technology2.3 (Sept. 2021), p. 035002.issn: 2632-2153.doi:10.1088/2632-2153/abeca3. arXiv:2008.08601

  22. [22]

    The Edge of Chaos: Quantum Field Theory and Deep Neural Networks

    Kevin Grosvenor and Ro Jefferson. “The Edge of Chaos: Quantum Field Theory and Deep Neural Networks”. In:SciPost Physics12.3 (Mar. 2022), p. 081.issn: 2542-4653.doi:10.21468/SciPostPhys.12.3.081. arXiv:2109.13247

  23. [23]

    Neural Network Field Theories: Non-Gaussianity, Actions, and Locality

    Mehmet Demirtas et al. “Neural Network Field Theories: Non-Gaussianity, Actions, and Locality”. In:Machine Learning: Science and Technology5.1 (Jan. 2024), p. 015002.issn: 2632-2153.doi:10.1088/2632-2153/ad17d3. arXiv:2307.03223

  24. [24]

    Ac´ ın, T

    Beno ˆıt Collins and Piotr´Sniady. “Integration with Respect to the Haar Measure on Unitary, Orthogonal and Symplectic Groups”. In:Communications in Mathematical Physics264.3 (2006), pp. 773–795.doi:10.1007/s00220- 006- 1554-3

  25. [25]

    Collins and S

    Beno ˆıt Collins and Sho Matsumoto. “On some properties of orthogonal Weingarten functions”. In:Journal of Mathe- matical Physics50.11 (2009), p. 113516.doi:10.1063/1.3251304

  26. [26]

    Feature Learning and Generalization in Deep Networks with Orthogonal Weights

    Hannah Day, Yonatan Kahn, and Daniel A Roberts. “Feature Learning and Generalization in Deep Networks with Orthogonal Weights”. In:Machine Learning: Science and Technology6.3 (Aug. 2025), p. 035027.issn: 2632-2153. doi:10.1088/2632-2153/adf278. arXiv:2310.07765. 12 A Orthogonal NTK tensors The statistics of the joint distribution of preactivations and the...

  27. [27]

    Preactivations, NTKs, dNTKs and ddNTKs are represented by external lines, as illustrated below. 𝑧 𝛼 ≡ 𝛼 cΔΘ𝛼𝛽 ≡ 𝛼 𝛽 cdΘ 𝛿0 𝛿1 𝛿2 ≡ 𝛿1 𝛿2 𝛿0 šddIΘ 𝛿0 𝛿1 𝛿2 𝛿3 ≡ 𝛿1 𝛿2 𝛿3 𝛿0 šddIIΘ 𝛿1 𝛿2 𝛿3 𝛿4 ≡ 𝛿3 𝛿4 𝛿1 𝛿2 (47) In the first line of (47), a colored line represents a single NTK label. Distinct colors are used for external dotted lines associated with differe...

  28. [28]

    Define the cubic vertices as 𝛼𝑐 𝛽𝑐 𝜎 (ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼 𝛽 𝜎 (ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼1 𝛼𝑐 𝛽𝑐 Θ(ℓ) 𝛼𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎 (ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′′(ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′(ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼 𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼1 𝛼𝑐 𝛽𝑐 Θ(ℓ) 𝛼𝛽 𝜎′′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 ...

  29. [29]

    The square represents the full expectation value

    Draw a square propagator connecting internal lines in all possible ways, consistent with the chosen pairing. The square represents the full expectation value. E[ · ] (49) This procedure generates distinct diagram types, both connected and disconnected. The connected diagrams are further classified as𝑠-class diagrams, defined by the number𝑠of square propag...

  30. [30]

    For each𝑠-class diagram, generate all inequivalent permutations of its 2𝑚external labels carrying orthogonality charge. Multiply each resulting diagram by 1/𝑛for every uncharged pairing, and by the appropriate𝑚-class Weingarten functionW, determined by the relative ordering𝜏of the diagram’s labels with respect to the original pairing𝜋=(12) (34). . .(2𝑘2𝑘−...

  31. [31]

    Multiply each𝑠-class contribution by the M ¨obius coefficient(−1) 𝑠−1 (𝑠−1)!, and sum over all classes. The second group implements the effective field theory techniques developed in [11], applied to the square propagator in the diagrammatic construction of the previous step, through the following set of Feynman rules analogous to those introduced in [13]:

  32. [32]

    The expectation value is taken over the decorations of the internal lines attached to the propagator, which obeys the same selection rules described in [13]

    We define the bare propagator as ⟨ ⟩ 𝐾 (ℓ) ≡ (50) where⟨ ⟩ 𝐾 (ℓ) denotes a zero-mean Gaussian expectation with covariance specified by𝐾 (ℓ) . The expectation value is taken over the decorations of the internal lines attached to the propagator, which obeys the same selection rules described in [13]. These rules are summarized as follows: (a) Propagators ma...

  33. [33]

    Quartic vertices are defined analogously, following [13]. Explicitly, 𝛼1 𝛼2 1 𝑛 𝑉 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 𝛼1 𝛼2 1 𝑛 𝐷 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 𝛼1 𝛼3 1 𝑛 𝐹 (ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 19 𝛼1 𝛼2 1 𝑛 𝐴(ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 𝛼1 𝛼3 1 𝑛 𝐵(ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 𝛼1 𝛼2 1 𝑛 𝑃 (ℓ+1) 𝛼3 𝛼1 𝛼2 𝛼4 𝛼3 𝛼4 𝛼1 𝛼3 1 𝑛 𝑄 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼2 𝛼4 𝛼1 𝛼2 1 𝑛 𝑅 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼...

  34. [34]

    Higher-order NTK and preactivation tensors are introduced via a natural generalization of the vertices in (51)

  35. [35]

    This decomposition respects the selection rules (a)-(f)

    The square propagator decomposes into all connected and disconnected diagrams built from the bare propagator, quartic vertices, and higher-order vertices, with internal lines remaining undotted. This decomposition respects the selection rules (a)-(f). C Feynman rules in action In this appendix, we explicitly apply the Feynman rules (1)-(9) of Section 4 to...

  36. [36]

    𝑧 𝛼 ≡ 𝛼 cΔΘ𝛼𝛽 ≡ 𝛼 𝛽 (72) where a colored line corresponds to a single NTK label

    Preactivations and NTKs are represented by external lines, as illustrated below. 𝑧 𝛼 ≡ 𝛼 cΔΘ𝛼𝛽 ≡ 𝛼 𝛽 (72) where a colored line corresponds to a single NTK label

  37. [37]

    The propagator is represented by ⟨ ⟩ 𝐾 (ℓ) ≡ (73) where⟨ ⟩ 𝐾 (ℓ) denotes a zero-mean Gaussian expectation with covariance specified by𝐾 (ℓ) . The expectation value is taken over the decorations of the internal lines attached to the propagator, which satisfies the set of selection rules (a)-(f) listed in Rule (6) of Section 4

  38. [38]

    Cubic vertices are defined as in [13]. Explicitly, 𝛼 𝛽 dΔ𝐺 (ℓ) 𝑖, 𝛼𝛽 ∼ 𝐶𝑊 𝑛 , 𝛼 𝛽 cΔΩ (ℓ+1) 𝑖, 𝛼𝛽 ∼ 1 𝑛 , 𝛼 𝛽 𝜎 (ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼ 𝐶𝑊 𝑛 , 𝛼 𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼ 𝐶𝑊 𝑛 , 𝛼 𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼ 𝐶𝑊 𝑛 (74) where bΩ(ℓ+1) 𝑖, 𝛼𝛽 =𝜎 (ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 +𝐶𝑊 Θ(ℓ) 𝛼𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 and cΔΩ (ℓ+1) 𝑖, 𝛼𝛽 = bΩ(ℓ+1) 𝑖, 𝛼𝛽 −⟨bΩ(ℓ+1) 𝑖, 𝛼𝛽 ⟩𝐾 (ℓ) . Lines with...

  39. [39]

    Quartic vertices are defined analogously, following [13]. Explicitly, 𝛼1 𝛼2 1 𝑛 𝑉 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 , 𝛼1 𝛼2 1 𝑛 𝐷 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 , 𝛼1 𝛼3 1 𝑛 𝐹 (ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 , 𝛼1 𝛼2 1 𝑛 𝐴(ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 , 𝛼1 𝛼3 1 𝑛 𝐵(ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 (75)

  40. [40]

    The orthogonal diagram describing the 2𝑚-point cumulant for the reference pairing𝜋=(12) (34) · · · (2𝑚− 1 2𝑚)is obtained from the Gaussian one, using the Feynman rules (74) and (75), by summing over all reconnections of the external labels: 𝑉 orth 2𝑚, 𝜋 =𝑉 gauss 2𝑚, 𝜋 + ∑︁ 𝜆⊢𝑚, 𝜆≠(1,...,1) 𝛽𝜆 𝑛 𝑚−ℓ(𝜆) ∑︁ 𝜏∈ C𝜆 (𝜋) ℓ(𝜆)Ö 𝑗=1 𝑉 conn 2𝜆 𝑗 (76) Here: •𝜆=(𝜆 1,...

  41. [41]

    In the 1 𝑛-expansion ofW [2,1], only the terms of order 1 𝑛5 and 1 𝑛4 contribute nontrivially. The former becomes relevant when all neural indices are distinct, yielding 1 𝑛2 ℓ ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼2 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) + ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼2 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) +⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾...

  42. [42]

    In the 1 𝑛-expansion ofW [3], only the term of order 1 𝑛5 contributes. This occurs when all neural indices are distinct: 2 𝑛2 ℓ ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) + ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) +⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼4 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼6...

  43. [43]

    5 54 (logℓ) 3 + 1 36 (logℓ) 2 (51−10 logℓ 0) +𝑐 𝐷 1,0 + 1 144 logℓ −274−290 logℓ 0 −75(logℓ 0)2 −192𝑐 Θ 1,0 −288𝑐 𝑉 2,0 # ,(144) 𝐹 (ℓ) =− 1 2ℓ + 1 ℓ2

    In the 1 𝑛-expansion ofW [2], only the terms of order 1 𝑛4 and 1 𝑛3 contribute nontrivially. The 1 𝑛4 term arises when all neural indices are distinct − 1 𝑛2 ℓ ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼4 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼5 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) + ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼4 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼5 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) 38 +⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4...