arxiv: 2605.06563 · v1 · submitted 2026-05-07 · 💻 cs.LG · hep-th

Recognition: unknown

Criticality and Saturation in Orthogonal Neural Networks

Max Guillen , Jan E. Gerken

Authors on Pith no claims yet

Pith reviewed 2026-05-08 12:22 UTC · model grok-4.3

classification 💻 cs.LG hep-th

keywords orthogonal initializationfinite-width expansionneural network statisticsrecursion relationstensor stabilityactivation functionscriticalitysaturation

0 comments

The pith

Orthogonal weight initializations stabilize the finite-width correction tensors of neural networks as depth increases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives explicit recursion relations that track the evolution of finite-width correction tensors layer by layer when weights are initialized orthogonally. These relations show that the tensors reach a stable value after sufficiently many layers for common activation functions, in contrast to the growth seen with independent random weights. A reader would care because the recursions supply a direct mathematical reason for the empirical advantage of orthogonal initializations in deep training. The same relations also extend diagrammatic techniques to all orders of the inverse-width expansion.

Core claim

We derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of Feynman diagrams for the corresponding recursions which are valid to all orders in 1/width. We show explicitly that the recursions reproduce the stability of the finite-width tensors observed for activation functions with vanishing fixed point. Numerical solutions of the recursions and their large-depth expansions agree with Monte-Carlo estimates from network ensembles.

What carries the argument

Layer-wise recursion relations for the tensors in the 1/width expansion of network statistics under orthogonal initialization, which track moment evolution across layers and produce saturation at large depth.

If this is right

The recursions allow computation of network statistics at arbitrary depth without full ensemble simulation.
Stability of the tensors holds specifically when the activation function has a vanishing fixed point.
The diagrammatic extension covers all orders in the inverse-width expansion for the orthogonal case.
The derived relations close the theoretical account of why orthogonal weights prevent divergence of finite-width corrections in deep networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The recursions could be solved analytically to predict the depth at which saturation begins for given width and activation.
Similar recursion structures might appear under other structured initializations that preserve norm.
The method supplies a route to study how criticality conditions interact with finite-width effects in deeper architectures.

Load-bearing premise

The leading terms of the inverse-width power series continue to dominate the network behavior even when depth becomes large.

What would settle it

Numerical iteration of the derived recursions yields tensors that grow without bound as depth increases, while direct Monte-Carlo sampling from finite-width orthogonal networks produces saturating tensors.

Figures

Figures reproduced from arXiv: 2605.06563 by Jan E. Gerken, Max Guillen.

**Figure 1.** Figure 1: Gradient stability. Monte Carlo estimates of the NTK Θ (ℓ ) 𝛼𝛽 for a tanh orthogonal network (𝑛 = 50, 𝐿 = 30) across varying 𝐶𝑊, with the critical case 𝐶𝑊 = 1 shown in the center. Means are computed over 600 initializations; shaded regions denote standard errors (typically not visible). Results are consistent with Theorem 5.1. 2 4 6 8 10 0.5 0.4 0.3 0.2 0.1 0.0 0.1 Normalized NTK Tensors Experimental Predi… view at source ↗

**Figure 2.** Figure 2: Orthogonal saturation. (a) Normalized NTK tensors from Monte Carlo simulations of tanh orthogonal networks (𝑛 = 50, 𝐿 = 10) at criticality, compared with theoretical predictions. Means are computed over 600 initializations; shaded regions denote standard deviations (typically not visible), showing quantitative agreement. (b) Large-ℓ expansions of the NNGP and quartic vertex 𝑉, compared with exact solutions… view at source ↗

**Figure 3.** Figure 3: The NNGP and NTK. Solution of the single-input recursion relations (6) and (11). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solutions of (141) and (11), while orange boxes represent the large-ℓ expansions (157) and (159), evaluated at integer ℓ. The tensor magnitudes remain smaller than their Gaussian counterparts and exhibit earl… view at source ↗

**Figure 4.** Figure 4: The NTK mixed tensors. Solution of the single-input recursion relations (86) and (87). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solutions of (86) and (87), while orange boxes represent the large-ℓ expansions (160) and (161), evaluated at integer ℓ. The tensor magnitudes remain smaller than their Gaussian counterparts and exhibit… view at source ↗

**Figure 5.** Figure 5: The NTK variance tensors. Solution of the single-input recursion relations (88) and (89). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solutions of (88) and (89), while orange boxes represent the large-ℓ expansion (162) and (163), evaluated at integer ℓ. The magnitude of the tensors remains smaller than its Gaussian counterpart and … view at source ↗

**Figure 6.** Figure 6: The dNTK tensors. Solutions of the single-input recursion relations (90) and 91. We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solution of (90) and 91, while orange boxes represent the large-ℓ expansions (164) and (165), evaluated at integer ℓ. The magnitude of the tensors remains smaller than its Gaussian counterpart and exhibits ea… view at source ↗

**Figure 7.** Figure 7: The dINTK tensor. Solution of the single-input recursion relation (92). We consider a tanh network of width 𝑛 = 50 and depth 𝐿 = 10 with inputs drawn from (0, 1). Blue points denote the exact solution of (92), while orange boxes represent the large-ℓ expansion (166), evaluated at integer ℓ. The magnitude of the tensor remains smaller than its Gaussian counterpart and exhibits early-layer saturation. 1 2 3 … view at source ↗

**Figure 8.** Figure 8: The ddIINTK tensors. Solutions to the single-input recursion relations (93), (94), and (95). We consider a tanh network with width 𝑛 = 50 and depth 𝐿 = 10, with inputs drawn from (0, 1). Blue points show the exact solutions of (93), (94), and (95), while orange boxes represent the corresponding large-ℓ expansions (167), (168), and (169), evaluated at integer ℓ. The tensor magnitude remains below its Gaussi… view at source ↗

**Figure 9.** Figure 9: Orthogonal saturation. Exact solutions of the single-input recursion relations at order 1/𝑛 (see Theorems 4.1 and 4.2). We consider a tanh network of width 𝑛 = 50 with inputs drawn from the real interval (0, 1). All tensors attain smaller values than their Gaussian counterparts and exhibit early-layer saturation. These predictions are consistent with the empirical results of [25]. where Δc𝐺 (ℓ ) 𝛼𝛽 = 𝜎 (ℓ … view at source ↗

**Figure 10.** Figure 10: The sextic vertex. Solution to the single-input recursion relation (133) at order 1/𝑛 2 . We consider a tanh network with width 𝑛 = 50 and depth 𝐿 = 30, with inputs drawn from (0, 1). Blue points show the exact solution of (133), while orange boxes represent the large-ℓ expansion (170), evaluated at integer ℓ. The tensor magnitude remains below its Gaussian counterpart and exhibits early-layer saturation,… view at source ↗

**Figure 11.** Figure 11: Variance stability. Components of the Monte Carlo estimate of the NNGP 𝐾𝛼𝛽 for a tanh MLP, shown as a function of layer depth ℓ for identical and distinct inputs, across three values of 𝐶 (ℓ ) 𝑊 . Hidden layers have width 50. Means are computed over 600 initializations for both non-critical (left, right) and critical (middle) cases. Error bars are shown in all panels (see text). Θ (ℓ ) 𝛼𝛽 = 1 𝑁net ∑︁ 𝑁net… view at source ↗

**Figure 12.** Figure 12: Stability beyond the perturbative regime. Comparison of the diagonal components of the Monte Carlo estimate, single-input exact solution, and large-ℓ expansion for the NNGP 𝐾 and quartic vertex 𝑉 in a tanh MLP with orthogonal initialization. Hidden layers have width 50; means are computed over 600 initializations. (a) The NNGP estimates are in quantitative agreement with the exact solution at both small a… view at source ↗

**Figure 13.** Figure 13: Stability of the four-point cumulant 𝑉4 at criticality. Selected components of the Monte Carlo estimate 𝑉4 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁st… view at source ↗

**Figure 14.** Figure 14: Instability of the four-point cumulant 𝑉4 away from criticality.Selected components of the Monte Carlo estimate 𝑉4 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 4 10 3 10 2 Dj 0 0 0 0j y = 0:064` ¡1:45 `start = 10 10 6 10 5 10 4 10 3 10 2 10 1 Dj 0101j y = 3`… view at source ↗

**Figure 15.** Figure 15: Stability of the 𝐷 tensor at criticality. Selected components of the Monte Carlo estimate 𝐷 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗

**Figure 16.** Figure 16: Instability of the 𝐷 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐷 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 1 Fj 0 0 0 0j y = 0:37` ¡0:89 `start = 10 10 2 Fj 0101j y = 0:12` ¡0:88 `start = 10 10 2 10 1 Fj 0 0 2 2j y = 0:… view at source ↗

**Figure 17.** Figure 17: Stability of the 𝐹 tensor at criticality. Selected components of the Monte Carlo estimate 𝐹 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗

**Figure 18.** Figure 18: Instability of the 𝐹 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐹 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 2 10 1 Aj 0 0 0 0j y = 0:096` ¡0:08 `start = 10 10 2 Aj 0101j y = 0:054` ¡0:06 `start = 10 10 2 Aj 0 0 2 2j y = … view at source ↗

**Figure 19.** Figure 19: Stability of the 𝐴 tensor at criticality. Selected components of the Monte Carlo estimate 𝐴 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗

**Figure 20.** Figure 20: Instability of the 𝐴 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐴 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 10 1 2 × 10 1 3 × 10 1 4 × 10 1 6 × 10 1 Bj 0 0 0 0j y = 0:94` ¡0:54 `start = 10 10 1 4 × 10 2 6 × 10 2 2 × 10 1 Bj… view at source ↗

**Figure 21.** Figure 21: Stability of the 𝐵 tensor at criticality. Selected components of the Monte Carlo estimate 𝐵 for a tanh MLP are shown as a function of layer depth ℓ at the critical value 𝐶 (ℓ ) 𝑊 = 1. Hidden layers have width 50. An asymptotic power-law fit is shown in orange, with the fit starting at ℓstart. Estimates are obtained from 𝑁net = 600 initializations, with means and error bars computed over 𝑁stats = 10 repeti… view at source ↗

**Figure 22.** Figure 22: Instability of the 𝐵 tensor away from criticality. Selected components of the Monte Carlo estimate 𝐵 are shown for 𝐶 (ℓ ) 𝑊 < 1 (left) and 𝐶 (ℓ ) 𝑊 > 1 (right). Estimates are computed from 𝑁net = 600 initializations, with means and error bars obtained from 𝑁stats = 10 repetitions (see text). 0 20 40 60 80 100 0.30 0.25 0.20 0.15 0.10 0.05 0.00 D( ) < n > n (a) 0 20 40 60 80 100 0.1 0.0 0.1 0.2 0.3 0.4 0.5… view at source ↗

**Figure 23.** Figure 23: Stability beyond the perturbative regime. Comparison of the diagonal components of the Monte Carlo estimate, single-input exact solution, and large-ℓ expansion for the NTK tensors 𝐷 and 𝐹 in a tanh MLP with orthogonal initialization. Hidden layers have width 50; means are computed over 600 initializations. (a) The tensor 𝐷 estimates are in quantitative agreement with the exact solution at both small and l… view at source ↗

read the original abstract

It has been known for a long time that initializing weight matrices to be orthogonal instead of having i.i.d. Gaussian components can improve training performance. This phenomenon can be analyzed using finite-width corrections, where the infinite-width statistics are supplemented by a power series in $1/\mathrm{width}$. In particular, recent empirical results by Day et al. show that the tensors appearing in this treatment stabilize for large depth, as opposed to the tensors of i.i.d.-initialized networks. In this article, we derive explicit layer-wise recursion relations for the tensors appearing in the finite-width expansion of the network statistics in the case of orthogonal initializations. We also provide an extension of recently-introduced Feynman diagrams for the corresponding recursions in the i.i.d.-case which are valid to all orders in $1/\mathrm{width}$. Finally, we show explicitly that the recursions we derive reproduce the stability of the finite-width tensors which was observed for activation functions with vanishing fixed point. This work therefore provides a theoretical explanation for the stability of nonlinear networks of finite width initialized with orthogonal weights, closing a long-standing gap in the literature. We validate our theoretical results experimentally by showing that numerical solutions of our recursion relations and their analytical large-depth expansions agree excellently with Monte-Carlo estimates from network ensembles.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives explicit layer-wise recursions for the leading 1/width tensors under orthogonal initialization and shows they reproduce the observed large-depth stability, with good numerical agreement.

read the letter

The core advance is the set of closed recursions for the finite-width correction tensors when weights are constrained to be orthogonal. These recursions explain why the tensors stabilize rather than diverge or vanish with depth, at least for activations whose fixed point is zero. The authors also extend the all-order Feynman diagram machinery to this setting, which was previously only available for the i.i.d. Gaussian case. Both contributions are new relative to the cited literature on Gaussian networks and the empirical observations of Day et al. The Monte Carlo matches are clean and give the derivations some credibility on the depths tested. The work is therefore a direct theoretical account of a practical initialization trick that had been missing a mechanistic explanation. The main limitation is the lack of an explicit remainder bound on the 1/width series. The recursions are derived perturbatively, and the paper relies on numerical agreement to argue that higher orders stay small even as depth grows. If the effective expansion parameter grows with depth through diagram combinatorics or orthogonality-induced correlations, the leading-term stability would not fully account for the finite-width behavior. The authors do not appear to supply a uniform-in-depth error estimate, so that part of the claim rests on the observed match rather than a proof. The derivations themselves look internally consistent on the evidence given. This paper is aimed at researchers who work on finite-width analyses, orthogonal initialization, or mean-field limits of deep networks. A reader who already cares about why orthogonal weights improve training stability will get concrete tools from the recursions and diagrams. It is worth sending to peer review: the new derivations are specific enough that referees can check the algebra and the numerics are strong enough to make the central claim testable.

Referee Report

2 major / 2 minor

Summary. The paper derives explicit layer-wise recursion relations for the leading finite-width correction tensors in the statistics of deep neural networks initialized with orthogonal weights. It extends Feynman diagram techniques (previously for i.i.d. Gaussian weights) to the orthogonal case to all orders in 1/width, obtains closed recursions for the tensors, and shows that these recursions reproduce the large-depth stability of the tensors observed empirically for activations with vanishing fixed points. The theoretical predictions are validated by agreement between numerical solutions of the recursions (and their large-depth analytic expansions) and Monte-Carlo estimates from finite-width network ensembles.

Significance. If the derivations and perturbative control hold, the work supplies the missing theoretical account for why orthogonal initialization stabilizes finite-width corrections at large depth (in contrast to i.i.d. Gaussian initialization), thereby closing a documented gap. The explicit recursions and diagram extension are reusable tools; the Monte-Carlo match provides direct empirical support for the central claim.

major comments (2)

[large-depth expansions and validation sections] The central claim that the derived recursions explain the observed stability rests on the assumption that the leading 1/width tensors remain dominant as depth L grows large. No explicit remainder bound or uniform-in-L control on the truncation error of the finite-width expansion is provided (see the large-depth analysis and the statement that 'the recursions reproduce the stability'). If higher-order terms accumulate or the effective expansion parameter grows with L via diagram combinatorics, the leading-tensor stability would not suffice to explain the finite-width behavior.
[derivation of recursion relations] The recursion relations are stated to close at leading order after incorporating the orthogonal constraint. However, it is not shown whether orthogonality-induced correlations at finite width can feed back into the leading tensors at depths where the expansion parameter is no longer parametrically small (see the derivation of the layer-wise recursions and the diagram extension).

minor comments (2)

[preliminaries] Notation for the tensors (e.g., the precise definition of the leading correction objects) should be introduced once with an explicit equation reference rather than relying on prior diagram papers.
[experimental validation] The Monte-Carlo validation would benefit from reporting the range of widths and depths tested and the number of independent network realizations per point to allow assessment of statistical error.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive major comments. We address each point below, providing clarifications on the scope of our perturbative analysis and indicating revisions where they strengthen the presentation without altering the central claims.

read point-by-point responses

Referee: [large-depth expansions and validation sections] The central claim that the derived recursions explain the observed stability rests on the assumption that the leading 1/width tensors remain dominant as depth L grows large. No explicit remainder bound or uniform-in-L control on the truncation error of the finite-width expansion is provided (see the large-depth analysis and the statement that 'the recursions reproduce the stability'). If higher-order terms accumulate or the effective expansion parameter grows with L via diagram combinatorics, the leading-tensor stability would not suffice to explain the finite-width behavior.

Authors: We agree that the analysis is perturbative in 1/width and that an explicit uniform-in-L remainder bound is not derived. The manuscript establishes closed recursions for the leading-order tensors, obtains their large-depth analytic expansions, and demonstrates quantitative agreement with Monte-Carlo estimates from finite-width ensembles. This agreement holds across the depths examined, indicating that higher-order contributions do not destabilize the leading tensors in practice. In the revised version we will add an explicit caveat in the large-depth section acknowledging the absence of a rigorous truncation bound and clarifying that the explanatory power for observed stability rests on the combination of exact leading-order closure and empirical validation. revision: partial
Referee: [derivation of recursion relations] The recursion relations are stated to close at leading order after incorporating the orthogonal constraint. However, it is not shown whether orthogonality-induced correlations at finite width can feed back into the leading tensors at depths where the expansion parameter is no longer parametrically small (see the derivation of the layer-wise recursions and the diagram extension).

Authors: The Feynman-diagram extension incorporates the orthogonal constraints order by order in 1/width. At leading order the diagram rules ensure that orthogonality-induced correlations are absorbed into the recursion kernels without introducing feedback from higher-order diagrams into the leading tensors. This decoupling follows from the structure of the orthogonal ensemble and holds at every depth because the recursion is derived by collecting all diagrams that contribute at O(1/width). We will insert a short clarifying paragraph in the derivation section that explicitly states this decoupling and references the diagram rules that prevent higher-order leakage into the leading tensors. revision: yes

standing simulated objections not resolved

Absence of an explicit remainder bound or uniform-in-L control on the truncation error of the finite-width expansion

Circularity Check

0 steps flagged

Derivation of orthogonal recursions independent of stability observation; minor self-citation to prior diagrams

full rationale

The paper derives explicit layer-wise recursion relations for the finite-width tensors directly from the orthogonal weight initialization constraint combined with the 1/width perturbative expansion. These recursions are then solved to reproduce the large-depth stability previously observed for activations with vanishing fixed points, with results matching Monte-Carlo ensemble estimates. This chain does not reduce to a fitted input or self-definition by the paper's equations. A self-citation exists to recently-introduced Feynman diagrams for the i.i.d. case, which is extended here, but the orthogonal recursions constitute new independent content and are externally validated against simulations rather than relying on the citation as load-bearing support.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The work rests on the validity of the 1/width perturbative expansion around infinite width and on the assumption that the leading correction tensors dominate the depth-dependent behavior. No new free parameters or invented entities are introduced; the recursions follow from the orthogonal constraint applied to the existing expansion.

axioms (2)

domain assumption The finite-width expansion in powers of 1/width is a valid asymptotic description of network statistics for large but finite width.
Invoked when claiming the recursions capture the observed stability.
domain assumption Activation functions have a vanishing fixed point (average output zero when input is zero).
Required for the stability result; stated as the case where stability holds.

pith-pipeline@v0.9.0 · 5523 in / 1431 out tokens · 31297 ms · 2026-05-08T12:22:12.420973+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 22 canonical work pages

[1]

Understanding the Difficulty of Training Deep Feedforward Neural Networks

Xavier Glorot and Yoshua Bengio. “Understanding the Difficulty of Training Deep Feedforward Neural Networks”. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Mar. 2010, pp. 249–256

2010
[2]

Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification

Kaiming He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification”. In:Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1026–1034. arXiv:1502.01852

work page Pith review arXiv 2015
[3]

Exact solutions to the nonlinear dynamics of learning in deep linear neural networks

Andrew M. Saxe, James L. McClelland, and Surya Ganguli.Exact Solutions to the Nonlinear Dynamics of Learning in Deep Linear Neural Networks. Feb. 2014. arXiv:1312.6120

work page Pith review arXiv 2014
[4]

All You Need Is a Good Init

Dmytro Mishkin and Jiri Matas. “All You Need Is a Good Init”. In:International Conference on Learning Representations
[5]

arXiv, Feb. 2016. arXiv:1511.06422

work page arXiv 2016
[6]

Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks

Wei Hu, Lechao Xiao, and Jeffrey Pennington. “Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks”. In:International Conference on Learning Representations. Sept. 2019. arXiv:2001.05992

work page arXiv 2019
[7]

Resurrecting the Sigmoid in Deep Learning through Dynamical Isometry: Theory and Practice

Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. “Resurrecting the Sigmoid in Deep Learning through Dynamical Isometry: Theory and Practice”. In:Advances in Neural Information Processing Systems. Vol. 30. Curran Associates, Inc., 2017. arXiv:1711.04735

work page arXiv 2017
[8]

The Emergence of Spectral Universality in Deep Net- works

Jeffrey Pennington, Samuel Schoenholz, and Surya Ganguli. “The Emergence of Spectral Universality in Deep Net- works”. In:Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics. PMLR, Mar. 2018, pp. 1924–1932. arXiv:1802.09979

work page arXiv 2018
[9]

Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks

Lechao Xiao et al. “Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks”. In:Proceedings of the 35th International Conference on Machine Learning. PMLR, July 2018, pp. 5393–5402

2018
[10]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Clement Hongler. “Neural Tangent Kernel: Convergence and Generalization in Neural Networks”. In:Advances in Neural Information Processing Systems. Vol. 31. Curran Associates, Inc., 2018. arXiv:1806.07572

work page arXiv 2018
[11]

On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

Wei Huang, Weitao Du, and Richard Yi Da Xu. “On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization”. In:Twenty-Ninth International Joint Conference on Artificial Intelligence. Vol. 3. Aug. 2021, pp. 2577– 2583.doi:10.24963/ijcai.2021/355. arXiv:2004.05867

work page doi:10.24963/ijcai.2021/355 2021
[12]

Roberts, Sho Yaida, and Boris Hanin

Daniel A. Roberts and Sho Yaida.The Principles of Deep Learning Theory: An Effective Theory Approach to Under- standing Neural Networks. Cambridge: Cambridge University Press, 2022.isbn: 978-1-316-51933-2.doi:10.1017/ 9781009023405. arXiv:2106.10165. 11

work page arXiv 2022
[13]

Weingarten, Asymptotic behavior of group integrals in the limit of infinite rank, J

Don Weingarten. “Asymptotic Behavior of Group Integrals in the Limit of Infinite Rank”. In:Journal of Mathematical Physics19.5 (May 1978), pp. 999–1001.issn: 0022-2488.doi:10.1063/1.523807

work page doi:10.1063/1.523807 1978
[14]

Guillen, P

Max Guillen, Philipp Misof, and Jan E. Gerken.Finite-Width Neural Tangent Kernels from Feynman Diagrams. Aug. 2025.doi:10.48550/arXiv.2508.11522. arXiv:2508.11522

work page doi:10.48550/arxiv.2508.11522 2025
[15]

Tiled Convolutional Neural Networks

Jiquan Ngiam et al. “Tiled Convolutional Neural Networks”. In:Advances in Neural Information Processing Systems. Vol. 23. Curran Associates, Inc., 2010

2010
[16]

Non-Gaussian Processes and Neural Networks at Finite Widths

Sho Yaida. “Non-Gaussian Processes and Neural Networks at Finite Widths”. In:Proceedings of The First Mathematical and Scientific Machine Learning Conference. PMLR, Aug. 2020, pp. 165–192. arXiv:1910.00019

work page arXiv 2020
[17]

Symmetry-via-Duality: Invariant Neural Network Densities from Parameter-Space Correlators

Anindita Maiti, Keegan Stoner, and James Halverson. “Symmetry-via-Duality: Invariant Neural Network Densities from Parameter-Space Correlators”. In:Machine Learning in Pure Mathematics and Theoretical Physics. Chap. Chapter 8, pp. 293–330.doi:10.1142/9781800613706_0008. arXiv:2106.00694

work page doi:10.1142/9781800613706_0008
[18]

Structures of Neural Network Effective Theories

Ian Banta et al. “Structures of Neural Network Effective Theories”. In:Physical Review D109.10 (May 2024), p. 105007. doi:10.1103/PhysRevD.109.105007. arXiv:2305.02334

work page doi:10.1103/physrevd.109.105007 2024
[19]

arXiv preprint arXiv:2210.16859 , year=

Alexander Maloney, Daniel A. Roberts, and James Sully.A Solvable Model of Neural Scaling Laws. Oct. 2022. arXiv: 2210.16859

work page arXiv 2022
[20]

Graph neural networks in particle physics

Zhengkang Zhang. “Neural Scaling Laws from Large-N Field Theory: Solvable Model beyond the Ridgeless Limit”. In:Machine Learning: Science and Technology6.2 (Apr. 2025), p. 025010.issn: 2632-2153.doi:10.1088/2632- 2153/adc872. arXiv:2405.19398

work page doi:10.1088/2632- 2025
[21]

Neural Networks and Quantum Field Theory

James Halverson, Anindita Maiti, and Keegan Stoner. “Neural Networks and Quantum Field Theory”. In:Machine Learning: Science and Technology2.3 (Sept. 2021), p. 035002.issn: 2632-2153.doi:10.1088/2632-2153/abeca3. arXiv:2008.08601

work page doi:10.1088/2632-2153/abeca3 2021
[22]

The Edge of Chaos: Quantum Field Theory and Deep Neural Networks

Kevin Grosvenor and Ro Jefferson. “The Edge of Chaos: Quantum Field Theory and Deep Neural Networks”. In:SciPost Physics12.3 (Mar. 2022), p. 081.issn: 2542-4653.doi:10.21468/SciPostPhys.12.3.081. arXiv:2109.13247

work page doi:10.21468/scipostphys.12.3.081 2022
[23]

Neural Network Field Theories: Non-Gaussianity, Actions, and Locality

Mehmet Demirtas et al. “Neural Network Field Theories: Non-Gaussianity, Actions, and Locality”. In:Machine Learning: Science and Technology5.1 (Jan. 2024), p. 015002.issn: 2632-2153.doi:10.1088/2632-2153/ad17d3. arXiv:2307.03223

work page doi:10.1088/2632-2153/ad17d3 2024
[24]

Ac´ ın, T

Beno ˆıt Collins and Piotr´Sniady. “Integration with Respect to the Haar Measure on Unitary, Orthogonal and Symplectic Groups”. In:Communications in Mathematical Physics264.3 (2006), pp. 773–795.doi:10.1007/s00220- 006- 1554-3

work page doi:10.1007/s00220- 2006
[25]

Collins and S

Beno ˆıt Collins and Sho Matsumoto. “On some properties of orthogonal Weingarten functions”. In:Journal of Mathe- matical Physics50.11 (2009), p. 113516.doi:10.1063/1.3251304

work page doi:10.1063/1.3251304 2009
[26]

Feature Learning and Generalization in Deep Networks with Orthogonal Weights

Hannah Day, Yonatan Kahn, and Daniel A Roberts. “Feature Learning and Generalization in Deep Networks with Orthogonal Weights”. In:Machine Learning: Science and Technology6.3 (Aug. 2025), p. 035027.issn: 2632-2153. doi:10.1088/2632-2153/adf278. arXiv:2310.07765. 12 A Orthogonal NTK tensors The statistics of the joint distribution of preactivations and the...

work page doi:10.1088/2632-2153/adf278 2025
[27]

Preactivations, NTKs, dNTKs and ddNTKs are represented by external lines, as illustrated below. 𝑧 𝛼 ≡ 𝛼 cΔΘ𝛼𝛽 ≡ 𝛼 𝛽 cdΘ 𝛿0 𝛿1 𝛿2 ≡ 𝛿1 𝛿2 𝛿0 ddIΘ 𝛿0 𝛿1 𝛿2 𝛿3 ≡ 𝛿1 𝛿2 𝛿3 𝛿0 ddIIΘ 𝛿1 𝛿2 𝛿3 𝛿4 ≡ 𝛿3 𝛿4 𝛿1 𝛿2 (47) In the first line of (47), a colored line represents a single NTK label. Distinct colors are used for external dotted lines associated with differe...
[28]

Define the cubic vertices as 𝛼𝑐 𝛽𝑐 𝜎 (ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼 𝛽 𝜎 (ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼1 𝛼𝑐 𝛽𝑐 Θ(ℓ) 𝛼𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎 (ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′′(ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼𝑐 𝛽𝑐 𝜎′(ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼𝐶 𝑊 𝛼 𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 ∼1 𝛼𝑐 𝛽𝑐 Θ(ℓ) 𝛼𝛽 𝜎′′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼𝐶 ...
[29]

The square represents the full expectation value

Draw a square propagator connecting internal lines in all possible ways, consistent with the chosen pairing. The square represents the full expectation value. E[ · ] (49) This procedure generates distinct diagram types, both connected and disconnected. The connected diagrams are further classified as𝑠-class diagrams, defined by the number𝑠of square propag...
[30]

For each𝑠-class diagram, generate all inequivalent permutations of its 2𝑚external labels carrying orthogonality charge. Multiply each resulting diagram by 1/𝑛for every uncharged pairing, and by the appropriate𝑚-class Weingarten functionW, determined by the relative ordering𝜏of the diagram’s labels with respect to the original pairing𝜋=(12) (34). . .(2𝑘2𝑘−...
[31]

Multiply each𝑠-class contribution by the M ¨obius coefficient(−1) 𝑠−1 (𝑠−1)!, and sum over all classes. The second group implements the effective field theory techniques developed in [11], applied to the square propagator in the diagrammatic construction of the previous step, through the following set of Feynman rules analogous to those introduced in [13]:
[32]

The expectation value is taken over the decorations of the internal lines attached to the propagator, which obeys the same selection rules described in [13]

We define the bare propagator as ⟨ ⟩ 𝐾 (ℓ) ≡ (50) where⟨ ⟩ 𝐾 (ℓ) denotes a zero-mean Gaussian expectation with covariance specified by𝐾 (ℓ) . The expectation value is taken over the decorations of the internal lines attached to the propagator, which obeys the same selection rules described in [13]. These rules are summarized as follows: (a) Propagators ma...
[33]

Quartic vertices are defined analogously, following [13]. Explicitly, 𝛼1 𝛼2 1 𝑛 𝑉 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 𝛼1 𝛼2 1 𝑛 𝐷 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 𝛼1 𝛼3 1 𝑛 𝐹 (ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 19 𝛼1 𝛼2 1 𝑛 𝐴(ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 𝛼1 𝛼3 1 𝑛 𝐵(ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 𝛼1 𝛼2 1 𝑛 𝑃 (ℓ+1) 𝛼3 𝛼1 𝛼2 𝛼4 𝛼3 𝛼4 𝛼1 𝛼3 1 𝑛 𝑄 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼2 𝛼4 𝛼1 𝛼2 1 𝑛 𝑅 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼...
[34]

Higher-order NTK and preactivation tensors are introduced via a natural generalization of the vertices in (51)
[35]

This decomposition respects the selection rules (a)-(f)

The square propagator decomposes into all connected and disconnected diagrams built from the bare propagator, quartic vertices, and higher-order vertices, with internal lines remaining undotted. This decomposition respects the selection rules (a)-(f). C Feynman rules in action In this appendix, we explicitly apply the Feynman rules (1)-(9) of Section 4 to...
[36]

𝑧 𝛼 ≡ 𝛼 cΔΘ𝛼𝛽 ≡ 𝛼 𝛽 (72) where a colored line corresponds to a single NTK label

Preactivations and NTKs are represented by external lines, as illustrated below. 𝑧 𝛼 ≡ 𝛼 cΔΘ𝛼𝛽 ≡ 𝛼 𝛽 (72) where a colored line corresponds to a single NTK label
[37]

The propagator is represented by ⟨ ⟩ 𝐾 (ℓ) ≡ (73) where⟨ ⟩ 𝐾 (ℓ) denotes a zero-mean Gaussian expectation with covariance specified by𝐾 (ℓ) . The expectation value is taken over the decorations of the internal lines attached to the propagator, which satisfies the set of selection rules (a)-(f) listed in Rule (6) of Section 4
[38]

Cubic vertices are defined as in [13]. Explicitly, 𝛼 𝛽 dΔ𝐺 (ℓ) 𝑖, 𝛼𝛽 ∼ 𝐶𝑊 𝑛 , 𝛼 𝛽 cΔΩ (ℓ+1) 𝑖, 𝛼𝛽 ∼ 1 𝑛 , 𝛼 𝛽 𝜎 (ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼ 𝐶𝑊 𝑛 , 𝛼 𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼ 𝐶𝑊 𝑛 , 𝛼 𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 ∼ 𝐶𝑊 𝑛 (74) where bΩ(ℓ+1) 𝑖, 𝛼𝛽 =𝜎 (ℓ) 𝑖, 𝛼 𝜎 (ℓ) 𝑖,𝛽 +𝐶𝑊 Θ(ℓ) 𝛼𝛽 𝜎′(ℓ) 𝑖, 𝛼 𝜎′(ℓ) 𝑖,𝛽 and cΔΩ (ℓ+1) 𝑖, 𝛼𝛽 = bΩ(ℓ+1) 𝑖, 𝛼𝛽 −⟨bΩ(ℓ+1) 𝑖, 𝛼𝛽 ⟩𝐾 (ℓ) . Lines with...
[39]

Quartic vertices are defined analogously, following [13]. Explicitly, 𝛼1 𝛼2 1 𝑛 𝑉 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 , 𝛼1 𝛼2 1 𝑛 𝐷 (ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 , 𝛼1 𝛼3 1 𝑛 𝐹 (ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 , 𝛼1 𝛼2 1 𝑛 𝐴(ℓ+1) 𝛼1 𝛼2 𝛼3 𝛼4 𝛼3 𝛼4 , 𝛼1 𝛼3 1 𝑛 𝐵(ℓ+1) 𝛼1 𝛼3 𝛼2 𝛼4 𝛼2 𝛼4 (75)
[40]

The orthogonal diagram describing the 2𝑚-point cumulant for the reference pairing𝜋=(12) (34) · · · (2𝑚− 1 2𝑚)is obtained from the Gaussian one, using the Feynman rules (74) and (75), by summing over all reconnections of the external labels: 𝑉 orth 2𝑚, 𝜋 =𝑉 gauss 2𝑚, 𝜋 + ∑︁ 𝜆⊢𝑚, 𝜆≠(1,...,1) 𝛽𝜆 𝑛 𝑚−ℓ(𝜆) ∑︁ 𝜏∈ C𝜆 (𝜋) ℓ(𝜆)Ö 𝑗=1 𝑉 conn 2𝜆 𝑗 (76) Here: •𝜆=(𝜆 1,...
[41]

In the 1 𝑛-expansion ofW [2,1], only the terms of order 1 𝑛5 and 1 𝑛4 contribute nontrivially. The former becomes relevant when all neural indices are distinct, yielding 1 𝑛2 ℓ ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼2 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) + ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼2 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) +⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾...
[42]

In the 1 𝑛-expansion ofW [3], only the term of order 1 𝑛5 contributes. This occurs when all neural indices are distinct: 2 𝑛2 ℓ ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) + ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) +⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼4 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼6...
[43]

5 54 (logℓ) 3 + 1 36 (logℓ) 2 (51−10 logℓ 0) +𝑐 𝐷 1,0 + 1 144 logℓ −274−290 logℓ 0 −75(logℓ 0)2 −192𝑐 Θ 1,0 −288𝑐 𝑉 2,0 # ,(144) 𝐹 (ℓ) =− 1 2ℓ + 1 ℓ2

In the 1 𝑛-expansion ofW [2], only the terms of order 1 𝑛4 and 1 𝑛3 contribute nontrivially. The 1 𝑛4 term arises when all neural indices are distinct − 1 𝑛2 ℓ ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼4 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼5 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) + ⟨𝜎 (ℓ) 𝛼1 𝜎 (ℓ) 𝛼4 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼2 𝜎 (ℓ) 𝛼3 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼5 𝜎 (ℓ) 𝛼6 ⟩𝐾 (ℓ) 38 +⟨𝜎 (ℓ) 𝛼3 𝜎 (ℓ) 𝛼5 ⟩𝐾 (ℓ) ⟨𝜎 (ℓ) 𝛼4...

2009