Approximating Simple ReLU Networks based on Spectral Decomposition of Fisher Information

Junichi Takeuchi; Ka Long Keith Ho; Yoshinari Takeishi

arxiv: 2505.17907 · v2 · submitted 2025-05-23 · 📊 stat.ML · cs.LG

Approximating Simple ReLU Networks based on Spectral Decomposition of Fisher Information

Ka Long Keith Ho , Yoshinari Takeishi , Junichi Takeuchi This is my paper

Pith reviewed 2026-05-19 13:43 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords Fisher informationReLU networksspherical harmonicsneural tangent kernelspectral analysisMercer decompositiontwo-layer neural networkseigenvalue concentration

0 comments

The pith

In two-layer ReLU networks with random hidden weights, 97.7 percent of the Fisher information trace concentrates in the first three eigenspaces corresponding to spherical harmonics of order at most 2.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the Fisher information matrices of two-layer ReLU networks that have random weights in the hidden layer. It shows that the eigenvalues concentrate on a few eigenspaces, with the first three accounting for 97.7 percent of the total trace regardless of the total number of parameters. The authors identify these eigenspaces as the spaces of spherical harmonic functions with orders up to 2. The finding connects to the Mercer decomposition of the neural tangent kernel for these networks. This matters because it indicates that the key statistical properties of the network are captured in a low-dimensional function space that does not grow with network size.

Core claim

The central discovery is that for two-layer ReLU networks with random hidden weights, the Fisher information matrix exhibits strong concentration of its eigenvalues in the first three eigenspaces, which account for 97.7% of the trace independently of the number of parameters. These eigenspaces are precisely the spherical harmonics of degree not greater than 2.

What carries the argument

The spectral decomposition of the Fisher information matrix, which isolates the dominant eigenspaces and maps them onto the spherical harmonic functions of orders 0, 1, and 2.

If this is right

The effective dimension of the network's statistical model is bounded by the dimension of low-order spherical harmonics.
This concentration explains why the Fisher matrix properties do not depend on the width of the network.
The result provides an explicit basis for approximating the network using only quadratic and lower spherical harmonics.
It links the Fisher information spectrum directly to the eigenfunctions in the Mercer expansion of the neural tangent kernel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This suggests that optimization or sampling in these networks primarily operates in a low-order harmonic subspace, which could lead to better initialization strategies.
Similar spectral analysis might apply to other activation functions or network depths, potentially revealing analogous low-order structures.
One testable extension is to verify the concentration percentage for networks with non-random weights or different random distributions.

Load-bearing premise

The hidden-layer weights are drawn randomly from a fixed distribution and the network consists of exactly two layers with ReLU activations.

What would settle it

For a concrete two-layer ReLU network with randomly chosen hidden weights, compute its Fisher information matrix, extract the leading eigenvectors, and test whether they align with the spherical harmonics of order at most 2 while summing to about 97.7 percent of the matrix trace.

Figures

Figures reproduced from arXiv: 2505.17907 by Junichi Takeuchi, Ka Long Keith Ho, Yoshinari Takeishi.

**Figure 1.** Figure 1: For approximate eigenvectors v and their limiting functions F from Theorems 3.1 to 3.5, we show the values of X⊤v against the theoretical values F(x). The top left shows the case of v (0) (Group 1), top right shows the case of v (l) (Group 2), bottom left shows the case of v (γ) (Group 3), and the bottom right shows the case of v (α,β) (Group 3). The plots shown are generated with d = 10 and m = 100000. re… view at source ↗

read the original abstract

Properties of Fisher information matrices of 2-layer neural ReLU networks with random hidden weights are studied. For these networks, it is known that the eigenvalue distribution highly concentrates on several eigenspaces approximately. In particular, the eigenvalues for the first three eigenspaces account for 97.7% of the trace of the Fisher information matrix, independently of the number of parameters. In this paper, we identify the function spaces which correspond to those major eigenspaces. This function space consists of the spherical harmonic functions whose orders are not greater than 2. This result relates to the Mercer decomposition of the neural tangent kernels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives an explicit map from the top eigenspaces of the Fisher information in random two-layer ReLU nets to spherical harmonics of order at most 2, with a claimed 97.7% trace concentration that holds under the random-weight assumption.

read the letter

The central contribution is the identification of the dominant eigenspaces with spherical harmonics of order ≤2. The authors start from the known concentration of the Fisher matrix eigenvalues for two-layer ReLU networks whose hidden weights are drawn from a fixed random distribution, then use the Mercer expansion of the associated neural tangent kernel to label those eigenspaces concretely as low-order harmonics. That step supplies a function-space description that was not spelled out in the earlier NTK literature they cite.

Referee Report

2 major / 2 minor

Summary. The manuscript analyzes the Fisher information matrix of two-layer ReLU networks whose hidden-layer weights are drawn from a fixed random distribution. It establishes that the eigenvalue spectrum concentrates on the first three eigenspaces, which together account for 97.7% of the trace independently of the total number of parameters, and identifies the corresponding function spaces with spherical harmonics of order at most 2. The identification is obtained through the Mercer decomposition of the neural tangent kernel induced by the random ReLU network.

Significance. If the stated concentration and harmonic identification hold under the paper's assumptions, the result supplies an explicit, low-dimensional characterization of the dominant directions in the Fisher metric for this simple architecture. This could support reduced-order approximations or analyses of curvature in the random-weight regime and strengthens the link between NTK spectral theory and information geometry for ReLU networks.

major comments (2)

[§3] §3 (main theorem on trace concentration): the claim that the 97.7% trace fraction is independent of the number of parameters is stated without an explicit limit statement. The derivation appears to rely on averaging over the random hidden weights or the infinite-width regime; a finite-width counter-example or a precise statement of the asymptotic regime is needed to support the parameter-count independence asserted in the abstract.
[§4] §4 (identification with spherical harmonics): the mapping of the dominant eigenspaces to harmonics of order ≤2 is obtained via the Mercer kernel of the NTK under random Gaussian or spherical hidden weights. The proof sketch should explicitly verify that the ReLU-induced kernel eigenfunctions coincide with the low-order spherical harmonics on the sphere; without this step the identification remains formal rather than constructive.

minor comments (2)

[Notation] Notation for the Fisher matrix and the NTK should be unified across sections; currently the same symbol appears to be overloaded for the finite-sample and population versions.
[Figure 2] Figure 2 (eigenvalue histogram) would benefit from an inset showing the cumulative trace fraction up to the third eigenspace for several widths to illustrate the claimed independence.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major comment below and have revised the manuscript to incorporate clarifications on the asymptotic regime and to expand the explicit verification in the harmonic identification.

read point-by-point responses

Referee: [§3] §3 (main theorem on trace concentration): the claim that the 97.7% trace fraction is independent of the number of parameters is stated without an explicit limit statement. The derivation appears to rely on averaging over the random hidden weights or the infinite-width regime; a finite-width counter-example or a precise statement of the asymptotic regime is needed to support the parameter-count independence asserted in the abstract.

Authors: We appreciate the referee highlighting the need for precision here. The 97.7% trace concentration is derived exactly in the infinite-width limit m → ∞ (with input dimension d fixed), where the Fisher information reduces to the NTK and higher-degree contributions vanish by orthogonality of spherical harmonics. We have added an explicit limit statement to Theorem 1 and Section 3 in the revision, clarifying that the parameter-count independence holds asymptotically in this regime. Finite-m numerical results in the manuscript already show the fraction remains close to 97.7% for moderate widths, consistent with the limit. revision: yes
Referee: [§4] §4 (identification with spherical harmonics): the mapping of the dominant eigenspaces to harmonics of order ≤2 is obtained via the Mercer kernel of the NTK under random Gaussian or spherical hidden weights. The proof sketch should explicitly verify that the ReLU-induced kernel eigenfunctions coincide with the low-order spherical harmonics on the sphere; without this step the identification remains formal rather than constructive.

Authors: We agree that an explicit verification improves clarity. The NTK induced by random ReLU weights is a zonal kernel on the sphere, admitting a Mercer expansion in Legendre polynomials P_k(cos θ), whose associated eigenfunctions are the spherical harmonics of degree k. For the ReLU NTK, the coefficients for k ≥ 3 are identically zero in the relevant inner-product computation on the unit sphere. The revised Section 4 now includes the full expansion and direct verification that the dominant eigenspaces are spanned exactly by harmonics of degree ≤ 2. revision: yes

Circularity Check

0 steps flagged

No circularity: derivation relies on standard Mercer decomposition of NTK under explicit random-weight assumptions

full rationale

The paper states that the 97.7% trace concentration is already known for random-hidden-weight two-layer ReLU networks and then identifies the corresponding eigenspaces with spherical harmonics of order ≤2 via the Mercer decomposition of the induced neural tangent kernel. No equation is shown to be equivalent to its own input by construction, no fitted parameter is relabeled as a prediction, and no load-bearing uniqueness theorem or ansatz is imported solely via self-citation. The central identification is a direct consequence of the kernel's eigenfunction properties under the stated randomness and architecture; the result remains falsifiable by direct computation on finite networks and does not reduce to a tautology or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the assumption that hidden weights are random and on the algebraic properties of the ReLU activation; no free parameters are introduced in the abstract, but the 97.7% figure is presented as an observed constant.

axioms (2)

domain assumption Hidden-layer weights are drawn independently from a rotationally invariant distribution (implicitly standard normal or uniform on the sphere).
Stated in the abstract as the setting under which the eigenvalue concentration and harmonic identification hold.
domain assumption The network is exactly two layers with ReLU activations and the output is linear in the final weights.
The Fisher information matrix is defined for this specific architecture.

pith-pipeline@v0.9.0 · 5633 in / 1437 out tokens · 64288 ms · 2026-05-19T13:43:53.733642+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the eigenvalues for the first three eigenspaces account for 97.7% of the trace... spherical harmonic functions whose orders are not greater than 2
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

J := E[X(x)X^T(x)] ... Mercer decomposition of the neural tangent kernels

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages

[1]

Deep learning: a statistical viewpoint,

P. L. Bartlett, A. Montanari, and A. Rakhlin, “Deep learning: a statistical viewpoint,”Acta Numerica, vol. 30, pp. 87–201, 2021

work page 2021
[2]

C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

work page 2006
[3]

A review on neural networks with random weights,

Weipeng Cao, Xizhao Wang, Zhong Ming, Jinzhu Gao, “A review on neural networks with random weights,” Neurocomputing, V olume 275, 2018, Pages 278-287

work page 2018
[4]

Neural tangent kernel: Convergence and generalization in neural networks.CoRR, abs/1806.07572, 2018

A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural net- works,” Presented at the 32nd Conference on Neural Information Processing Systems, arXiv:1806.07572v3, 2018

work page arXiv 2018
[5]

Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,

Y . Takeishi, M. Iida, and J. Takeuchi, “Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,” arXiv:2111.15256, 2021. 8

work page arXiv 2021
[6]

Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,

Y . Takeishi, M. Iida, and J. Takeuchi, “Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,”Neural Networks, vol. 164, pp. 691-706, July, 2023

work page 2023
[7]

Risk Bounds on MDL Estimators for Linear Regression Models with Application to Simple ReLU Neural Networks,

Y . Takeishi and J. Takeuchi, “Risk Bounds on MDL Estimators for Linear Regression Models with Application to Simple ReLU Neural Networks,” arXiv:2407.03854, 2024

work page arXiv 2024
[8]

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve,

S. Mei and A. Montanari, "The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve," Communications on Pure and Applied Mathematics, vol. 75, pp. 667-766, April, 2022

work page 2022
[9]

Random Features for Large-Scale Kernel Machines,

A. Rahimi and B. Recht, "Random Features for Large-Scale Kernel Machines," Presented at the 20th Conference on Neural Information Processing Systems, 2007 9 Appendix A Proof of Theorem 3.1 Let v(0) := ∥W (1)∥/ √ d, ..., ∥W (m)∥/ √ d , with W (i) ∈ Rd for i = 1, ..., m. First note that for any positive constant a > 0, σ(ax) = aσ(x). By rewriting and applyi...

work page 2007
[10]

Hence, E h σ(x⊤ ˆZ) i = E h σ ∥x∥ ˆZ1 i = ∥x∥E h σ( ˆZ1) i = ∥x∥ Z 1 −1 σ(u)(1 − u2) d−1 2 −1 B( d−1 2 , 1

, where B(·, ·) is the beta function. Hence, E h σ(x⊤ ˆZ) i = E h σ ∥x∥ ˆZ1 i = ∥x∥E h σ( ˆZ1) i = ∥x∥ Z 1 −1 σ(u)(1 − u2) d−1 2 −1 B( d−1 2 , 1

work page
[11]

du = ∥x∥ Z 1 0 u(1 − u2) d−1 2 −1 B( d−1 2 , 1

work page
[12]

σ(x⊤Z) Z 2 γ ∥Z∥ # . Writing ˆZ := Z/∥Z∥, we may apply the tower rule and then rewrite the expectation as √ d + 2E

du = ∥x∥ " −(1 − u2) d−1 2 (d − 1)B( d−1 2 , 1 2) #1 0 = ∥x∥ (d − 1)B( d−1 2 , 1 2) = 1 2π B d 2 , 1 2 ∥x∥, where σ ∥x∥ ˆZ1 = ∥x∥σ( ˆZ1) follows from the non-negativity of ∥x∥. The second term E ∥Z∥2 √ d = √ d is straightforward since ∥Z∥2 is χ-squared distributed with d degrees of freedom. Combining gives X ⊤v(0) p − → √ d 2π B d 2 , 1 2 ∥x∥. 10 B Proof ...

work page
[13]

du = d √ d + 2|xγ| B( d−1 2 , 1 2) ( u2(1 − u2)(d−1)/2 −(d − 1) 1 0 + Z 1 0 2u(1 − u2)(d−1)/2 d − 1 du ) = d √ d + 2|xγ| B( d−1 2 , 1 2) −2(1 − u2)(d+1)/2 (d − 1)(d + 1) 1 0 = 2d √ d + 2|xγ| (d2 − 1)B( d−1 2 , 1 2) = d √ d + 2 (d + 1)π B( d 2 , 1 2)|xγ|. Using (4) and (6), we obtain 13 X ⊤˜v(γ) p − →1√ 2 d √ d + 2 π(d + 1)B( d 2 , 1 2)|xγ| − r d + 2 d √ d...

work page
[14]

du = Kγ Z 1 0 u(1 − u2) (d−2) 2 −1 B( d−2 2 , 1

work page
[15]

du = Kγ " −(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #1 0 = Kγ (d − 2)B( d−2 2 , 1

work page
[16]

The original expression then becomes d √ d + 2E ˆZγ h E n σ x⊤ −γ ˆZ−γ ˆZ 2 γ | ˆZγ oi = d √ d + 2∥x−γ∥ (d − 2)B( d−2 2 , 1

work page
[17]

E ˆZ 2 γ q 1 − ˆZ 2γ = d √ d + 2∥x−γ∥ (d − 2)B( d−2 2 , 1 2) B( d 2 , 3 2) B( d−1 2 , 1 2) = d √ d + 2 2π(d + 1)B( d 2 , 1 2)∥x−γ∥. Finally, using (4) and (6), we obtain X ⊤˜v(γ) p − →1√ 2 d √ d + 2 2π(d + 1)B( d 2 , 1 2)∥x−γ∥ − r d + 2 d √ d 2π B( d 2 , 1 2)∥x∥ ! = √ d + 2 2π √ 2 B( d 2 , 1 2)∥x∥ d d + 1 − 1 = − √ d + 2 2π(d + 1) √ 2 B( d 2 , 1 2)∥x∥ as ...

work page
[18]

If 0 ≤ Cγ < 1, then E h σ (Cγ + cos(ϕ)) + σ (−Cγ + cos(ϕ)) |∥ ˆZ−γ∥ i = Z Cγ −Cγ (Cγ + u)(1 − u2) d−4 2 B( d−2 2 , 1

Denoting Cγ := ( |xγ|∥ ˆZγ∥)/(∥x−γ∥∥ ˆZ−γ∥), we then take expectation with respect to cos(ϕ) conditioned on ∥ ˆZ−γ∥. If 0 ≤ Cγ < 1, then E h σ (Cγ + cos(ϕ)) + σ (−Cγ + cos(ϕ)) |∥ ˆZ−γ∥ i = Z Cγ −Cγ (Cγ + u)(1 − u2) d−4 2 B( d−2 2 , 1

work page
[19]

du + 2 Z 1 Cγ u(1 − u2) d−4 2 B( d−2 2 , 1

work page
[20]

du =Cγ Z Cγ −Cγ (1 − u2) d−4 2 B( d−2 2 , 1

work page
[21]

−(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #Cγ −Cγ + 2

du + " −(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #Cγ −Cγ + 2 " −(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #1 Cγ =Cγ Z Cγ −Cγ (1 − u2) d−4 2 B( d−2 2 , 1

work page
[22]

du + 2 (1 − C 2 γ) d−2 2 (d − 2)B( d−2 2 , 1

work page
[23]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E

By Taylor expansion of the second term and the integrand of the first term, we have 1 B( d−2 2 , 1 2) Cγ Z Cγ −Cγ (1 − u2) d−4 2 du + 2(1 − C 2 γ) d−2 2 (d − 2) ! = 1 B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ, where Rγ = R(Cγ) = O(C 4 γ) (as Cγ tends to 0) is the remainder term. It is important that Rγ is bounded over Cγ ∈ [0, 1), because both LHS and the firs...

work page
[24]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E

Cγ1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # =E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

work page
[25]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # =E

Cγ1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # . 15 Also, as E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # =E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 − 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥ 2B( d−2 2 , 1 2)B( d−1 2 , 1 2) 2 d − 2 B( d 2 , 3

work page
[26]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥B( d−2 2 , 5 2) 2B( d−2 2 , 1 2)B( d−1 2 , 1 2) 2 3 + x2 γ ∥x−γ∥2 ! − E

+ x2 γ ∥x−γ∥2 B( d − 2 2 , 5 2) ! − E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥B( d−2 2 , 5 2) 2B( d−2 2 , 1 2)B( d−1 2 , 1 2) 2 3 + x2 γ ∥x−γ∥2 ! − E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥B( d 2 , 1 2) 2π(d + 1) 1 + 3x2 γ 2∥x−γ∥2 ! − E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2...

work page
[27]

(13) Here, it is useful to note that by denoting rγ := |xγ|2/∥x∥2, then ∥x−γ∥ = ∥x∥(1−rγ)1/2 = ∥x∥(1− rγ 2 )+O(∥x∥r2 γ) and x2 γ ∥x−γ∥2 = rγ(1−rγ)−1 = rγ+O(r2 γ)

Rγ1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) − 2 d − 2 + Cγ − C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = O(∥x∥r2 γ). (13) Here, it is useful to note that by denoting rγ := |xγ|2/∥x∥2, then ∥x−γ∥ = ∥x∥(1−rγ)1/2 = ∥x∥(1− rγ 2 )+O(∥x∥r2 γ) and x2 γ ∥x−γ∥2 = rγ(1−rγ)−1 = rγ+O(r2 γ). (14) First Term: Since Rγ is bounded on Cγ ∈ [0, 1), we get by dir...

work page
[28]

Rγ1 | ˆZγ| < ∥x−γ∥ ∥x∥ # = O(∥x∥r2 γ) Second Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) − 2 d − 2 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = − 2∥x−γ∥ (d − 2)B( d−2 2 , 1 2) Z 1 ∥x−γ ∥/∥x∥ u2 p 1 − u2 (1 − u2)(d−3)/2 B( d−1 2 , 1

work page
[29]

du = − ∥x−γ∥ (d − 2)B( d−2 2 , 1 2) Z x2 γ /∥x∥2 0 √ 1 − t t(d−2)/2 B( d−1 2 , 1

work page
[30]

16 Third Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

dt =O(∥x∥rd/2 γ ). 16 Third Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

work page
[31]

Cγ1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = |xγ| B( d−2 2 , 1 2) Z 1 ∥x−γ ∥/∥x∥ u2 p 1 − u2 u√ 1 − u2 (1 − u2)(d−3)/2 B( d−1 2 , 1

work page
[32]

du = |xγ| 2B( d−2 2 , 1 2) Z x2 γ /∥x∥2 0 (1 − t) t(d−3)/2 B( d−1 2 , 1

work page
[33]

Fourth Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

dt =O(∥x∥rd/2 γ ). Fourth Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

work page
[34]

(−C 2 γ)1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = − |xγ|2 ∥x−γ∥B( d−2 2 , 1 2) Z 1 ∥x−γ ∥/∥x∥ u2 p 1 − u2 u2 1 − u2 (1 − u2)(d−3)/2 B( d−1 2 , 1

work page
[35]

du = − |xγ|2 2B( d−2 2 , 1 2)∥x−γ∥ Z x2 γ /∥x∥2 0 (1 − t)3/2 t(d−4)/2 B( d−1 2 , 1

work page
[36]

Therefore (13) holds for d ≥ 2

dt =O(∥x∥rd/2 γ ). Therefore (13) holds for d ≥ 2. By combining all the results, we obtain X ⊤v(γ,γ ) p − →d √ d + 2 2π(d + 1)∥x−γ∥B( d 2 , 1 2) 1 + 3x2 γ 2∥x−γ∥2 ! + √ 2hγ (∥x∥, rγ) , where hγ (∥x∥, rγ) = O(∥x∥r2 γ). By (4) and (6), X ⊤˜v(γ) p − →1√ 2 d √ d + 2 2π(d + 1)∥x−γ∥B( d 2 , 1 2) 1 + 3x2 γ 2∥x−γ∥2 ! − r d + 2 d √ d 2π B( d 2 , 1 2)∥x∥ ! + hγ ∥x∥...

work page
[37]

· 2 3 xα cos3(ϕ) + xβ sin3(ϕ) = d − 2 6π B( d − 2 2 , 5 2) xαx3 β ∥x∥3 + x3 αxβ ∥x∥3 ! = d − 2 6π B( d − 2 2 , 5

work page
[38]

xαxβ ∥x∥ = 1 2π(d + 1)B( d 2 , 1

work page
[39]

Therefore, X ⊤v(α,β) p − →d √ d + 2 2π(d + 1)B( d 2 , 1

xαxβ ∥x∥ . Therefore, X ⊤v(α,β) p − →d √ d + 2 2π(d + 1)B( d 2 , 1

work page
[40]

xαxβ ∥x∥ . General case ( xα, xβ, xαβ ̸= 0): Since the angle θ between xαβ and ˆZαβ is uniformly distributed on [−π, π), by considering the expectation conditioned on ˆZ−αβ, we obtain E[σ(x⊤ αβ ˆZαβ + x⊤ −αβ ˆZ−αβ) ˆZα ˆZβ| ˆZ−αβ] =∥xαβ∥E " σ ∥ ˆZαβ∥ cos(θ) + x⊤ −αβ ˆZ−αβ ∥xαβ∥ ! ˆZα ˆZβ| ˆZ−αβ # =∥xαβ∥E σ ∥ ˆZαβ∥ cos(θ) + x⊤ −αβ ˆZ−αβ ∥xαβ∥ ! ∥ ˆZαβ∥2 ∥x...

work page
[41]

When C ≥ 1, we instead have Z 1 −1 1 − C −2u2 3/2 (1 − u2)(d−5)/2du := I(C)

du When 0 ≤ C < 1, via a Taylor expansion, we have Z C −C 1 − C −2u2 3/2 (1 − u2)(d−5)/2du = 3πC 8 1 − (d − 5)C 2 12 + Rαβ, where Rαβ = R(C) = O(C 4). When C ≥ 1, we instead have Z 1 −1 1 − C −2u2 3/2 (1 − u2)(d−5)/2du := I(C). Note that 0 < C < 1 is equivalent to ∥ ˆZαβ∥2 < 1 − ∥xαβ∥2/∥x∥2 := 1 − rαβ. Also, ∥x−αβ∥ = ∥x∥(1 − rαβ)1/2 = ∥x∥(1 − rαβ 2 ) + O(...

work page
[42]

E     ∥xαβ∥ ∥x−αβ∥ ∥ ˆZαβ∥4 q 1 − ∥ ˆZαβ∥2 1 − d − 5 12 ∥xαβ∥2 ∥x−αβ∥2 ∥ ˆZαβ∥2 1 − ∥ ˆZαβ∥2 !    − xαxβ 3π∥xαβ∥B( d−3 2 , 1

work page
[43]

E 3π 8 ∥ ˆZαβ∥3C 1 − (d − 5)C 2 12 1{∥ ˆZαβ∥2 ≥ 1 − rαβ} + O xαxβ ∥xαβ∥ r2 αβ = xαxβ 8∥x−αβ∥B( d−3 2 , 1 2) d − 2 2 B( d − 3 2 , 3) − d − 5 12 B( d − 5 2 , 4)rαβ + O xαxβ ∥xαβ∥ r2 αβ = d − 2 16B( d−3 2 , 1

work page
[44]

B( d − 3 2 , 3) xαxβ ∥x−αβ∥ 1 − 1 2 rαβ + O xαxβ ∥xαβ∥ r2 αβ = 1 2(d + 1)π B( d 2 , 1

work page
[45]

xαxβ ∥x−αβ∥ 1 − 1 2 rαβ + O xαxβ ∥xαβ∥ r2 αβ = 1 2(d + 1)π B( d 2 , 1

work page
[46]

xαxβ ∥x∥ 1 − 1 2 rαβ 1 + 1 2 rαβ + O xαxβ ∥xαβ∥ r2 αβ = 1 2(d + 1)π B( d 2 , 1

work page
[47]

where E 3π 8 ∥ ˆZαβ∥3C 1 − (d − 5)C 2 12 + Rαβ 1{∥ ˆZαβ∥2 ≥ 1 − rαβ} = O(r2 αβ) follows from directly integrating, similar to the proof of Theorem 3.3 for d ≥ 6

xαxβ ∥x∥ + O xαxβ ∥xαβ∥ r2 αβ . where E 3π 8 ∥ ˆZαβ∥3C 1 − (d − 5)C 2 12 + Rαβ 1{∥ ˆZαβ∥2 ≥ 1 − rαβ} = O(r2 αβ) follows from directly integrating, similar to the proof of Theorem 3.3 for d ≥ 6. Finally, combining these results gives X ⊤v(α,β) p − →d √ d + 2 2(d + 1)π B( d 2 , 1

work page
[48]

xαxβ ∥x∥ + hαβ xαxβ ∥xαβ∥ , rαβ , where hαβ xαxβ ∥xαβ ∥ , rαβ = O xαxβ ∥xαβ ∥ r2 αβ . 21

work page

[1] [1]

Deep learning: a statistical viewpoint,

P. L. Bartlett, A. Montanari, and A. Rakhlin, “Deep learning: a statistical viewpoint,”Acta Numerica, vol. 30, pp. 87–201, 2021

work page 2021

[2] [2]

C. M. Bishop, Pattern Recognition and Machine Learning, Springer, 2006

work page 2006

[3] [3]

A review on neural networks with random weights,

Weipeng Cao, Xizhao Wang, Zhong Ming, Jinzhu Gao, “A review on neural networks with random weights,” Neurocomputing, V olume 275, 2018, Pages 278-287

work page 2018

[4] [4]

Neural tangent kernel: Convergence and generalization in neural networks.CoRR, abs/1806.07572, 2018

A. Jacot, F. Gabriel, and C. Hongler, “Neural tangent kernel: Convergence and generalization in neural net- works,” Presented at the 32nd Conference on Neural Information Processing Systems, arXiv:1806.07572v3, 2018

work page arXiv 2018

[5] [5]

Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,

Y . Takeishi, M. Iida, and J. Takeuchi, “Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,” arXiv:2111.15256, 2021. 8

work page arXiv 2021

[6] [6]

Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,

Y . Takeishi, M. Iida, and J. Takeuchi, “Approximate Spectral Decomposition of Fisher Information Matrix for Simple ReLU Networks,”Neural Networks, vol. 164, pp. 691-706, July, 2023

work page 2023

[7] [7]

Risk Bounds on MDL Estimators for Linear Regression Models with Application to Simple ReLU Neural Networks,

Y . Takeishi and J. Takeuchi, “Risk Bounds on MDL Estimators for Linear Regression Models with Application to Simple ReLU Neural Networks,” arXiv:2407.03854, 2024

work page arXiv 2024

[8] [8]

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve,

S. Mei and A. Montanari, "The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve," Communications on Pure and Applied Mathematics, vol. 75, pp. 667-766, April, 2022

work page 2022

[9] [9]

Random Features for Large-Scale Kernel Machines,

A. Rahimi and B. Recht, "Random Features for Large-Scale Kernel Machines," Presented at the 20th Conference on Neural Information Processing Systems, 2007 9 Appendix A Proof of Theorem 3.1 Let v(0) := ∥W (1)∥/ √ d, ..., ∥W (m)∥/ √ d , with W (i) ∈ Rd for i = 1, ..., m. First note that for any positive constant a > 0, σ(ax) = aσ(x). By rewriting and applyi...

work page 2007

[10] [10]

Hence, E h σ(x⊤ ˆZ) i = E h σ ∥x∥ ˆZ1 i = ∥x∥E h σ( ˆZ1) i = ∥x∥ Z 1 −1 σ(u)(1 − u2) d−1 2 −1 B( d−1 2 , 1

, where B(·, ·) is the beta function. Hence, E h σ(x⊤ ˆZ) i = E h σ ∥x∥ ˆZ1 i = ∥x∥E h σ( ˆZ1) i = ∥x∥ Z 1 −1 σ(u)(1 − u2) d−1 2 −1 B( d−1 2 , 1

work page

[11] [11]

du = ∥x∥ Z 1 0 u(1 − u2) d−1 2 −1 B( d−1 2 , 1

work page

[12] [12]

σ(x⊤Z) Z 2 γ ∥Z∥ # . Writing ˆZ := Z/∥Z∥, we may apply the tower rule and then rewrite the expectation as √ d + 2E

du = ∥x∥ " −(1 − u2) d−1 2 (d − 1)B( d−1 2 , 1 2) #1 0 = ∥x∥ (d − 1)B( d−1 2 , 1 2) = 1 2π B d 2 , 1 2 ∥x∥, where σ ∥x∥ ˆZ1 = ∥x∥σ( ˆZ1) follows from the non-negativity of ∥x∥. The second term E ∥Z∥2 √ d = √ d is straightforward since ∥Z∥2 is χ-squared distributed with d degrees of freedom. Combining gives X ⊤v(0) p − → √ d 2π B d 2 , 1 2 ∥x∥. 10 B Proof ...

work page

[13] [13]

du = d √ d + 2|xγ| B( d−1 2 , 1 2) ( u2(1 − u2)(d−1)/2 −(d − 1) 1 0 + Z 1 0 2u(1 − u2)(d−1)/2 d − 1 du ) = d √ d + 2|xγ| B( d−1 2 , 1 2) −2(1 − u2)(d+1)/2 (d − 1)(d + 1) 1 0 = 2d √ d + 2|xγ| (d2 − 1)B( d−1 2 , 1 2) = d √ d + 2 (d + 1)π B( d 2 , 1 2)|xγ|. Using (4) and (6), we obtain 13 X ⊤˜v(γ) p − →1√ 2 d √ d + 2 π(d + 1)B( d 2 , 1 2)|xγ| − r d + 2 d √ d...

work page

[14] [14]

du = Kγ Z 1 0 u(1 − u2) (d−2) 2 −1 B( d−2 2 , 1

work page

[15] [15]

du = Kγ " −(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #1 0 = Kγ (d − 2)B( d−2 2 , 1

work page

[16] [16]

The original expression then becomes d √ d + 2E ˆZγ h E n σ x⊤ −γ ˆZ−γ ˆZ 2 γ | ˆZγ oi = d √ d + 2∥x−γ∥ (d − 2)B( d−2 2 , 1

work page

[17] [17]

E ˆZ 2 γ q 1 − ˆZ 2γ = d √ d + 2∥x−γ∥ (d − 2)B( d−2 2 , 1 2) B( d 2 , 3 2) B( d−1 2 , 1 2) = d √ d + 2 2π(d + 1)B( d 2 , 1 2)∥x−γ∥. Finally, using (4) and (6), we obtain X ⊤˜v(γ) p − →1√ 2 d √ d + 2 2π(d + 1)B( d 2 , 1 2)∥x−γ∥ − r d + 2 d √ d 2π B( d 2 , 1 2)∥x∥ ! = √ d + 2 2π √ 2 B( d 2 , 1 2)∥x∥ d d + 1 − 1 = − √ d + 2 2π(d + 1) √ 2 B( d 2 , 1 2)∥x∥ as ...

work page

[18] [18]

If 0 ≤ Cγ < 1, then E h σ (Cγ + cos(ϕ)) + σ (−Cγ + cos(ϕ)) |∥ ˆZ−γ∥ i = Z Cγ −Cγ (Cγ + u)(1 − u2) d−4 2 B( d−2 2 , 1

Denoting Cγ := ( |xγ|∥ ˆZγ∥)/(∥x−γ∥∥ ˆZ−γ∥), we then take expectation with respect to cos(ϕ) conditioned on ∥ ˆZ−γ∥. If 0 ≤ Cγ < 1, then E h σ (Cγ + cos(ϕ)) + σ (−Cγ + cos(ϕ)) |∥ ˆZ−γ∥ i = Z Cγ −Cγ (Cγ + u)(1 − u2) d−4 2 B( d−2 2 , 1

work page

[19] [19]

du + 2 Z 1 Cγ u(1 − u2) d−4 2 B( d−2 2 , 1

work page

[20] [20]

du =Cγ Z Cγ −Cγ (1 − u2) d−4 2 B( d−2 2 , 1

work page

[21] [21]

−(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #Cγ −Cγ + 2

du + " −(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #Cγ −Cγ + 2 " −(1 − u2) d−2 2 (d − 2)B( d−2 2 , 1 2) #1 Cγ =Cγ Z Cγ −Cγ (1 − u2) d−4 2 B( d−2 2 , 1

work page

[22] [22]

du + 2 (1 − C 2 γ) d−2 2 (d − 2)B( d−2 2 , 1

work page

[23] [23]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E

By Taylor expansion of the second term and the integrand of the first term, we have 1 B( d−2 2 , 1 2) Cγ Z Cγ −Cγ (1 − u2) d−4 2 du + 2(1 − C 2 γ) d−2 2 (d − 2) ! = 1 B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ, where Rγ = R(Cγ) = O(C 4 γ) (as Cγ tends to 0) is the remainder term. It is important that Rγ is bounded over Cγ ∈ [0, 1), because both LHS and the firs...

work page

[24] [24]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E

Cγ1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # =E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ + Rγ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

work page

[25] [25]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # =E

Cγ1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # . 15 Also, as E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| < ∥x−γ∥ ∥x∥ # =E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 − 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥ 2B( d−2 2 , 1 2)B( d−1 2 , 1 2) 2 d − 2 B( d 2 , 3

work page

[26] [26]

ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥B( d−2 2 , 5 2) 2B( d−2 2 , 1 2)B( d−1 2 , 1 2) 2 3 + x2 γ ∥x−γ∥2 ! − E

+ x2 γ ∥x−γ∥2 B( d − 2 2 , 5 2) ! − E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥B( d−2 2 , 5 2) 2B( d−2 2 , 1 2)B( d−1 2 , 1 2) 2 3 + x2 γ ∥x−γ∥2 ! − E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) 2 d − 2 + C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = ∥x−γ∥B( d 2 , 1 2) 2π(d + 1) 1 + 3x2 γ 2∥x−γ∥2 ! − E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2...

work page

[27] [27]

(13) Here, it is useful to note that by denoting rγ := |xγ|2/∥x∥2, then ∥x−γ∥ = ∥x∥(1−rγ)1/2 = ∥x∥(1− rγ 2 )+O(∥x∥r2 γ) and x2 γ ∥x−γ∥2 = rγ(1−rγ)−1 = rγ+O(r2 γ)

Rγ1 | ˆZγ| < ∥x−γ∥ ∥x∥ # + E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) − 2 d − 2 + Cγ − C 2 γ 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = O(∥x∥r2 γ). (13) Here, it is useful to note that by denoting rγ := |xγ|2/∥x∥2, then ∥x−γ∥ = ∥x∥(1−rγ)1/2 = ∥x∥(1− rγ 2 )+O(∥x∥r2 γ) and x2 γ ∥x−γ∥2 = rγ(1−rγ)−1 = rγ+O(r2 γ). (14) First Term: Since Rγ is bounded on Cγ ∈ [0, 1), we get by dir...

work page

[28] [28]

Rγ1 | ˆZγ| < ∥x−γ∥ ∥x∥ # = O(∥x∥r2 γ) Second Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1 2) − 2 d − 2 1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = − 2∥x−γ∥ (d − 2)B( d−2 2 , 1 2) Z 1 ∥x−γ ∥/∥x∥ u2 p 1 − u2 (1 − u2)(d−3)/2 B( d−1 2 , 1

work page

[29] [29]

du = − ∥x−γ∥ (d − 2)B( d−2 2 , 1 2) Z x2 γ /∥x∥2 0 √ 1 − t t(d−2)/2 B( d−1 2 , 1

work page

[30] [30]

16 Third Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

dt =O(∥x∥rd/2 γ ). 16 Third Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

work page

[31] [31]

Cγ1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = |xγ| B( d−2 2 , 1 2) Z 1 ∥x−γ ∥/∥x∥ u2 p 1 − u2 u√ 1 − u2 (1 − u2)(d−3)/2 B( d−1 2 , 1

work page

[32] [32]

du = |xγ| 2B( d−2 2 , 1 2) Z x2 γ /∥x∥2 0 (1 − t) t(d−3)/2 B( d−1 2 , 1

work page

[33] [33]

Fourth Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

dt =O(∥x∥rd/2 γ ). Fourth Term: E " ˆZ 2 γ ∥x−γ∥∥ ˆZ−γ∥ 2B( d−2 2 , 1

work page

[34] [34]

(−C 2 γ)1 | ˆZγ| ≥ ∥x−γ∥ ∥x∥ # = − |xγ|2 ∥x−γ∥B( d−2 2 , 1 2) Z 1 ∥x−γ ∥/∥x∥ u2 p 1 − u2 u2 1 − u2 (1 − u2)(d−3)/2 B( d−1 2 , 1

work page

[35] [35]

du = − |xγ|2 2B( d−2 2 , 1 2)∥x−γ∥ Z x2 γ /∥x∥2 0 (1 − t)3/2 t(d−4)/2 B( d−1 2 , 1

work page

[36] [36]

Therefore (13) holds for d ≥ 2

dt =O(∥x∥rd/2 γ ). Therefore (13) holds for d ≥ 2. By combining all the results, we obtain X ⊤v(γ,γ ) p − →d √ d + 2 2π(d + 1)∥x−γ∥B( d 2 , 1 2) 1 + 3x2 γ 2∥x−γ∥2 ! + √ 2hγ (∥x∥, rγ) , where hγ (∥x∥, rγ) = O(∥x∥r2 γ). By (4) and (6), X ⊤˜v(γ) p − →1√ 2 d √ d + 2 2π(d + 1)∥x−γ∥B( d 2 , 1 2) 1 + 3x2 γ 2∥x−γ∥2 ! − r d + 2 d √ d 2π B( d 2 , 1 2)∥x∥ ! + hγ ∥x∥...

work page

[37] [37]

· 2 3 xα cos3(ϕ) + xβ sin3(ϕ) = d − 2 6π B( d − 2 2 , 5 2) xαx3 β ∥x∥3 + x3 αxβ ∥x∥3 ! = d − 2 6π B( d − 2 2 , 5

work page

[38] [38]

xαxβ ∥x∥ = 1 2π(d + 1)B( d 2 , 1

work page

[39] [39]

Therefore, X ⊤v(α,β) p − →d √ d + 2 2π(d + 1)B( d 2 , 1

xαxβ ∥x∥ . Therefore, X ⊤v(α,β) p − →d √ d + 2 2π(d + 1)B( d 2 , 1

work page

[40] [40]

xαxβ ∥x∥ . General case ( xα, xβ, xαβ ̸= 0): Since the angle θ between xαβ and ˆZαβ is uniformly distributed on [−π, π), by considering the expectation conditioned on ˆZ−αβ, we obtain E[σ(x⊤ αβ ˆZαβ + x⊤ −αβ ˆZ−αβ) ˆZα ˆZβ| ˆZ−αβ] =∥xαβ∥E " σ ∥ ˆZαβ∥ cos(θ) + x⊤ −αβ ˆZ−αβ ∥xαβ∥ ! ˆZα ˆZβ| ˆZ−αβ # =∥xαβ∥E σ ∥ ˆZαβ∥ cos(θ) + x⊤ −αβ ˆZ−αβ ∥xαβ∥ ! ∥ ˆZαβ∥2 ∥x...

work page

[41] [41]

When C ≥ 1, we instead have Z 1 −1 1 − C −2u2 3/2 (1 − u2)(d−5)/2du := I(C)

du When 0 ≤ C < 1, via a Taylor expansion, we have Z C −C 1 − C −2u2 3/2 (1 − u2)(d−5)/2du = 3πC 8 1 − (d − 5)C 2 12 + Rαβ, where Rαβ = R(C) = O(C 4). When C ≥ 1, we instead have Z 1 −1 1 − C −2u2 3/2 (1 − u2)(d−5)/2du := I(C). Note that 0 < C < 1 is equivalent to ∥ ˆZαβ∥2 < 1 − ∥xαβ∥2/∥x∥2 := 1 − rαβ. Also, ∥x−αβ∥ = ∥x∥(1 − rαβ)1/2 = ∥x∥(1 − rαβ 2 ) + O(...

work page

[42] [42]

E     ∥xαβ∥ ∥x−αβ∥ ∥ ˆZαβ∥4 q 1 − ∥ ˆZαβ∥2 1 − d − 5 12 ∥xαβ∥2 ∥x−αβ∥2 ∥ ˆZαβ∥2 1 − ∥ ˆZαβ∥2 !    − xαxβ 3π∥xαβ∥B( d−3 2 , 1

work page

[43] [43]

E 3π 8 ∥ ˆZαβ∥3C 1 − (d − 5)C 2 12 1{∥ ˆZαβ∥2 ≥ 1 − rαβ} + O xαxβ ∥xαβ∥ r2 αβ = xαxβ 8∥x−αβ∥B( d−3 2 , 1 2) d − 2 2 B( d − 3 2 , 3) − d − 5 12 B( d − 5 2 , 4)rαβ + O xαxβ ∥xαβ∥ r2 αβ = d − 2 16B( d−3 2 , 1

work page

[44] [44]

B( d − 3 2 , 3) xαxβ ∥x−αβ∥ 1 − 1 2 rαβ + O xαxβ ∥xαβ∥ r2 αβ = 1 2(d + 1)π B( d 2 , 1

work page

[45] [45]

xαxβ ∥x−αβ∥ 1 − 1 2 rαβ + O xαxβ ∥xαβ∥ r2 αβ = 1 2(d + 1)π B( d 2 , 1

work page

[46] [46]

xαxβ ∥x∥ 1 − 1 2 rαβ 1 + 1 2 rαβ + O xαxβ ∥xαβ∥ r2 αβ = 1 2(d + 1)π B( d 2 , 1

work page

[47] [47]

where E 3π 8 ∥ ˆZαβ∥3C 1 − (d − 5)C 2 12 + Rαβ 1{∥ ˆZαβ∥2 ≥ 1 − rαβ} = O(r2 αβ) follows from directly integrating, similar to the proof of Theorem 3.3 for d ≥ 6

xαxβ ∥x∥ + O xαxβ ∥xαβ∥ r2 αβ . where E 3π 8 ∥ ˆZαβ∥3C 1 − (d − 5)C 2 12 + Rαβ 1{∥ ˆZαβ∥2 ≥ 1 − rαβ} = O(r2 αβ) follows from directly integrating, similar to the proof of Theorem 3.3 for d ≥ 6. Finally, combining these results gives X ⊤v(α,β) p − →d √ d + 2 2(d + 1)π B( d 2 , 1

work page

[48] [48]

xαxβ ∥x∥ + hαβ xαxβ ∥xαβ∥ , rαβ , where hαβ xαxβ ∥xαβ ∥ , rαβ = O xαxβ ∥xαβ ∥ r2 αβ . 21

work page