Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models

Michael W. Mahoney; Zhenyu Liao

arxiv: 2506.13139 · v2 · submitted 2025-06-16 · 📊 stat.ML · cs.LG

Random Matrix Theory for Deep Learning: Beyond Eigenvalues of Linear Models

Zhenyu Liao , Michael W. Mahoney This is my paper

Pith reviewed 2026-05-19 09:53 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords random matrix theoryhigh-dimensional asymptoticsdeep neural networksgeneralizationdouble descentproportional regimenonlinear models

0 comments

The pith

A High-dimensional Equivalent unifies random matrix tools to characterize training and generalization in nonlinear deep networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a High-dimensional Equivalent to extend random matrix theory past eigenvalue analysis of linear models. It aims to handle high-dimensional data, nonlinear activations, and arbitrary spectral functionals when dimensions, samples, and parameters are all large and comparable. A sympathetic reader would care because this regime produces scaling laws, double descent, and other behaviors that break classical low-dimensional intuition, yet lack precise theory for deep networks. If the framework works, it supplies explicit formulas for how training loss and test error behave in shallow and deep nonlinear models. The authors demonstrate this on linear models first, then extend to nonlinear shallow networks and finally to deep networks.

Core claim

The paper claims that the High-dimensional Equivalent, which generalizes both the deterministic equivalent and the linear equivalent, allows precise asymptotic characterizations of training and generalization performance for linear models, nonlinear shallow networks, and deep networks in the proportional high-dimensional regime. These characterizations capture phenomena such as scaling laws, double descent, and nonlinear learning dynamics through analysis of generic eigenspectral functionals.

What carries the argument

The High-dimensional Equivalent, a unifying object that systematically addresses high dimensionality, nonlinearity, and generic eigenspectral functionals for networks in the proportional regime.

If this is right

Precise asymptotic formulas for training loss and generalization error become available for both shallow and deep nonlinear networks.
Scaling laws and double-descent phenomena receive explicit dependence on width, depth, and activation type.
Nonlinear learning dynamics can be tracked through the evolution of the relevant spectral functionals rather than through direct simulation.
A unified perspective replaces separate analyses for linear and nonlinear models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same machinery could be applied to study how depth interacts with data geometry beyond the proportional regime.
One could test whether the framework recovers known results for kernel methods when the network becomes infinitely wide.
If the equivalent holds, it might simplify analysis of optimization trajectories in overparameterized models.

Load-bearing premise

That a single equivalent can be derived which continues to control the relevant spectral behavior once nonlinear activations are introduced in the proportional regime.

What would settle it

Empirical training curves or test-error plots for a two-layer ReLU network in the proportional regime that deviate systematically from the formulas derived via the High-dimensional Equivalent.

Figures

Figures reproduced from arXiv: 2506.13139 by Michael W. Mahoney, Zhenyu Liao.

**Figure 2.** Figure 2: Visualization of “concentration” versus “non-concentration” around the mean for [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Different behavior of nonlinear ϕ(fLLN(x)) and ϕ(fCLT(x)) for ϕ(t) = tanh(t) (in blue) in the LLN and CLT regime, with n = 500. We have ϕ(fLLN(x)) ≃ ψLLN(fLLN(x)) in the LLN regime (as a consequence of ϕ(0) = ψLLN(0) = 0) and E[ϕ(fCLT(x))] = E[ψCLT(fCLT(x))] in the CLT regime (as a consequence of aϕ;0 = aψCLT;0 = 0), with different quadratic functions ψLLN(t) = t 2/4 and ψCLT(t) = t 2 − 1 = √ 2He2(t) in re… view at source ↗

**Figure 4.** Figure 4: Mar˘cenko-Pastur law in Equation (22) for different values of [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Empirical and theoretical risks of the linear regressor [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Empirical and theoretical training and test MSEs of single-hidden-layer NN model [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

Modern Machine Learning (ML) and Deep Neural Networks (DNNs) often operate on high-dimensional data and rely on overparameterized models, where classical low-dimensional intuitions break down. In particular, the proportional regime where the data dimension, sample size, and number of model parameters are all large and comparable, gives rise to novel and sometimes counterintuitive behaviors. This paper extends traditional Random Matrix Theory (RMT) beyond eigenvalue-based analysis of linear models to address the challenges posed by nonlinear ML models such as DNNs in this regime. We introduce the concept of High-dimensional Equivalent, which unifies and generalizes both Deterministic Equivalent and Linear Equivalent, to systematically address three technical challenges: high dimensionality, nonlinearity, and the need to analyze generic eigenspectral functionals. Leveraging this framework, we provide precise characterizations of the training and generalization performance of linear models, nonlinear shallow networks, and deep networks. Our results capture rich phenomena, including scaling laws, double descent, and nonlinear learning dynamics, offering a unified perspective on the theoretical understanding of deep learning in high dimensions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a High-dimensional Equivalent to unify RMT analysis for nonlinear deep nets and claims precise characterizations of training and generalization including scaling laws, but the deep network part needs checking for layer correlation issues.

read the letter

The main thing to know is that this paper defines a High-dimensional Equivalent to push random matrix theory past linear eigenvalue analysis and into nonlinear models like deep networks in the proportional regime. It uses this to characterize training and generalization performance for linear models, shallow networks, and deep networks, with the goal of covering scaling laws, double descent, and nonlinear dynamics in one framework. The abstract positions the new concept as a generalization of both deterministic and linear equivalents that can handle high dimensionality, nonlinearity, and generic eigenspectral functionals. That unification is the clearest new element. The paper does a reasonable job framing the technical challenges and showing how the framework applies across model types, which gives a broader view than many prior RMT papers that stay with linear cases. The citations track established work in high-dimensional ML theory without obvious gaps. The soft spot is in the deep network extension. If the derivation works by recursive application of single-layer results, it risks missing cross-layer correlations that come from backpropagation or shared data, and the abstract does not detail how generic functionals are handled when nonlinearity stacks over multiple layers. That is worth a close look in the full derivations, though it is not obviously fatal from what is visible. The work is aimed at theorists who study high-dimensional statistics and scaling in overparameterized models. Someone already working with random matrix methods or double descent would find the most direct value. It is coherent enough on its own terms to deserve peer review rather than a desk reject, even if the deep net claims end up needing substantial revision.

Referee Report

2 major / 2 minor

Summary. The manuscript extends Random Matrix Theory beyond eigenvalue analysis of linear models by introducing the 'High-dimensional Equivalent,' which unifies Deterministic Equivalents and Linear Equivalents. This framework is claimed to systematically handle high dimensionality, nonlinearity, and generic eigenspectral functionals in the proportional regime, enabling precise characterizations of training and generalization performance for linear models, nonlinear shallow networks, and deep networks, while capturing phenomena such as scaling laws, double descent, and nonlinear learning dynamics.

Significance. If the derivations hold without hidden restrictions on layer interactions or activation functions, the work would offer a meaningful unification for analyzing overparameterized nonlinear models, extending RMT tools to deep networks in a way that could support falsifiable predictions and reproducible analyses of generalization in high dimensions.

major comments (2)

[Framework introduction] Framework introduction (abstract and corresponding section): The central claim that the High-dimensional Equivalent systematically derives characterizations for deep networks requires explicit steps showing how cross-layer correlations induced by backpropagation and shared data are captured; if the derivation relies on recursive single-layer application, it risks failing to account for compounded nonlinearity effects on arbitrary eigenspectral functionals, which is load-bearing for the claimed precise performance characterizations.
[Deep network results] Deep network results section: The precise characterizations of training and generalization for deep networks are presented without visible error bounds or validation against the proportional regime assumptions; this undermines the unification claim if the High-dimensional Equivalent reduces to fitted quantities under generic nonlinear settings.

minor comments (2)

[Abstract] Abstract: The phrasing 'precise characterizations' would benefit from a brief note on the scope of functionals treated (e.g., beyond traces or resolvents) to improve clarity for readers.
[Notation and definitions] Notation: Ensure consistent definition of the High-dimensional Equivalent across sections to avoid potential self-referential issues when extending from linear to nonlinear cases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed report. The comments highlight important aspects of our framework's presentation and the need for additional rigor in the deep network analysis. We address each major comment below and indicate the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: Framework introduction (abstract and corresponding section): The central claim that the High-dimensional Equivalent systematically derives characterizations for deep networks requires explicit steps showing how cross-layer correlations induced by backpropagation and shared data are captured; if the derivation relies on recursive single-layer application, it risks failing to account for compounded nonlinearity effects on arbitrary eigenspectral functionals, which is load-bearing for the claimed precise performance characterizations.

Authors: We agree that the handling of cross-layer correlations merits more explicit exposition. The High-dimensional Equivalent is defined to propagate the joint statistics across layers by constructing an equivalent model that preserves the necessary second-order moments and spectral functionals under the proportional regime, thereby incorporating the effects of shared data and backpropagation-induced dependencies. To address the concern directly, we will add a dedicated subsection in the framework introduction that walks through the iterative construction step by step, showing how compounded nonlinearity is captured without reducing to independent single-layer applications. This revision will make the load-bearing assumptions transparent. revision: yes
Referee: Deep network results section: The precise characterizations of training and generalization for deep networks are presented without visible error bounds or validation against the proportional regime assumptions; this undermines the unification claim if the High-dimensional Equivalent reduces to fitted quantities under generic nonlinear settings.

Authors: The characterizations are asymptotic results in the proportional high-dimensional limit. We acknowledge that the current version does not include explicit non-asymptotic error bounds or extensive numerical validation for the deep network case. In the revision we will add a discussion of the convergence rates drawing on existing RMT concentration results, together with a new set of numerical experiments that compare the theoretical predictions against finite-dimensional simulations for both training and generalization error in deep networks. This will clarify that the equivalents are not merely fitted but derived from the high-dimensional analysis. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The abstract and framework description introduce the High-dimensional Equivalent as a generalization of prior Deterministic and Linear Equivalents to handle nonlinearity and generic functionals for deep networks. No equations, self-citations, or derivations are quoted that reduce any claimed prediction or characterization to a fitted input, self-definition, or load-bearing self-citation chain. The performance characterizations for linear models, shallow networks, and deep networks are presented as derived outputs from the framework rather than inputs renamed as results. This qualifies as a self-contained theoretical extension with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on extending classical RMT assumptions to nonlinear settings and introducing a new conceptual object whose existence and computability are asserted rather than derived from first principles in the visible text.

axioms (1)

standard math Standard high-dimensional proportional regime assumptions from random matrix theory
Paper builds directly on existing RMT limits where dimension, samples, and parameters are large and comparable.

invented entities (1)

High-dimensional Equivalent no independent evidence
purpose: Unify deterministic and linear equivalents to handle nonlinearity and generic spectral functionals
New concept introduced to address the three stated technical challenges for DNN analysis.

pith-pipeline@v0.9.0 · 5716 in / 1215 out tokens · 33109 ms · 2026-05-19T09:53:52.016840+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Definition 1 (High-dimensional Equivalent) … f(Mϕ(X)) − f(˜Mϕ(X)) → 0 … Mϕ(X) ↔ ˜Mϕ(X)
IndisputableMonolith/Foundation/BranchSelection.lean branch_selection echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 7 (High-dimensional linearization of kernel matrix) … ˜Kϕ = a²ϕ;0 1n1⊤n + a²ϕ;1 X⊤X + …
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_add / embed_eq_pow echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

Theorem 8 … CK matrix Kℓ … ˜Kϕ,ℓ = α²ℓ,1 X⊤X + … recursion αℓ,1 = aϕℓ;1 · αℓ−1,1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages

[1]

Potters and J.-P

M. Potters and J.-P. Bouchaud, A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists . Cambridge University Press, 2020

work page 2020
[2]

G. W. Anderson, A. Guionnet, and O. Zeitouni, An Introduction to Random Matrices, ser. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2010, vol. 118

work page 2010
[3]

Bai and J

Z. Bai and J. W. Silverstein, Spectral Analysis of Large Dimensional Random Matrices , 2nd ed., ser. Springer Series in Statistics. Springer-Verlag New York, 2010, vol. 20

work page 2010
[4]

Cleaning large correlation matrices: Tools from Random Matrix Theory,

J. Bun, J.-P. Bouchaud, and M. Potters, “Cleaning large correlation matrices: Tools from Random Matrix Theory,” Physics Reports, vol. 666, pp. 1–109, 2017

work page 2017
[5]

Random matrix theory and wireless communications,

A. M. Tulino and S. Verd´ u, “Random matrix theory and wireless communications,”Foun- dations and Trends® in Communications and Information Theory, vol. 1, no. 1, pp. 1–182, 2004

work page 2004
[6]

Couillet and M

R. Couillet and M. Debbah, Random Matrix Methods for Wireless Communications. Cam- bridge University Press, 2011

work page 2011
[7]

Surprises in high-dimensional ridgeless least squares interpolation,

T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,” The Annals of Statistics , vol. 50, no. 2, pp. 949–986, Apr. 2022

work page 2022
[8]

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve,

S. Mei and A. Montanari, “The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve,”Communications on Pure and Applied Mathematics, 2021

work page 2021
[9]

Deep Neural Networks as Gaussian Processes,

J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri, “Deep Neural Networks as Gaussian Processes,” in International Conference on Learning Repre- sentations, 2018

work page 2018
[10]

Neural Tangent Kernel: Convergence and Gen- eralization in Neural Networks,

A. Jacot, F. Gabriel, and C. Hongler, “Neural Tangent Kernel: Convergence and Gen- eralization in Neural Networks,” in Advances in Neural Information Processing Systems , vol. 31. Curran Associates, Inc., 2018, pp. 8571–8580

work page 2018
[11]

Benign overfitting in linear regres- sion,

P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, “Benign overfitting in linear regres- sion,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 063–30 070, 2020. 21

work page 2020
[12]

Introduction to the Non-Asymptotic Analysis of Random Matrices,

R. Vershynin, “Introduction to the Non-Asymptotic Analysis of Random Matrices,” in Compressed Sensing: Theory and Applications , Y. C. Eldar and G. Kutyniok, Eds. Cam- bridge University Press, 2012, pp. 210–268

work page 2012
[13]

Central limit theorem for linear eigenvalue statistics of random matrices with independent entries,

A. Lytova and L. Pastur, “Central limit theorem for linear eigenvalue statistics of random matrices with independent entries,” The Annals of Probability , vol. 37, no. 5, pp. 1778– 1840, 2009

work page 2009
[14]

CLT for linear spectral statistics of large-dimensional sample covariance matrices,

Z. Bai and J. W. Silverstein, “CLT for linear spectral statistics of large-dimensional sample covariance matrices,” The Annals of Probability , vol. 32, no. 1A, pp. 553–605, 2004

work page 2004
[15]

Matrices al´ eatoires: Statistique asymptotique des valeurs pro- pres,

L. Pastur and A. Lejay, “Matrices al´ eatoires: Statistique asymptotique des valeurs pro- pres,” in S´ eminaire de Probabilit´ es XXXVI, J. Az´ ema, M.´Emery, M. Ledoux, and M. Yor, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, vol. 1801, pp. 135–164

work page 2003
[16]

Distribution of eigenvalues for some sets of random matrices,

V. A. Marcenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Mathematics of the USSR-Sbornik , vol. 1, no. 4, p. 457, 1967

work page 1967
[17]

Random matrices: Universality of ESDs and the circular law,

T. Tao, V. Vu, and M. Krishnapur, “Random matrices: Universality of ESDs and the circular law,” The Annals of Probability , vol. 38, no. 5, pp. 2023–2065, 2010

work page 2023
[18]

L. A. Pastur and M. Shcherbina, Eigenvalue Distribution of Large Random Matrices , ser. Mathematical Surveys and Monographs. American Mathematical Society, 2011, vol. 171

work page 2011
[19]

The gaussian equivalence of generative models for learning with shallow neural networks,

S. Goldt, B. Loureiro, G. Reeves, F. Krzakala, M. M´ ezard, and L. Zdeborov´ a, “The gaussian equivalence of generative models for learning with shallow neural networks,” in Mathemat- ical and Scientific Machine Learning . PMLR, 2022, pp. 426–471

work page 2022
[20]

Universality laws for high-dimensional learning with random fea- tures,

H. Hu and Y. M. Lu, “Universality laws for high-dimensional learning with random fea- tures,” IEEE Transactions on Information Theory , vol. 69, no. 3, pp. 1932–1964, 2022

work page 1932
[21]

Universality of empirical risk minimization,

A. Montanari and B. N. Saeed, “Universality of empirical risk minimization,” in Conference on Learning Theory. PMLR, 2022, pp. 4310–4312

work page 2022
[22]

The breakdown of gaussian universality in classification of high- dimensional linear factor mixtures,

X. MAI and Z. Liao, “The breakdown of gaussian universality in classification of high- dimensional linear factor mixtures,” in The Thirteenth International Conference on Learn- ing Representations, 2025

work page 2025
[23]

High-dimensional asymptotics of prediction: Ridge regression and classification,

E. Dobriban and S. Wager, “High-dimensional asymptotics of prediction: Ridge regression and classification,” The Annals of Statistics , vol. 46, no. 1, pp. 247–279, 2018

work page 2018
[24]

Reconciling modern machine-learning practice and the classical bias–variance trade-off,

M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019

work page 2019
[25]

A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent,

Z. Liao, R. Couillet, and M. W. Mahoney, “A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent,” in Advances in Neural Information Processing Systems , vol. 33. Curran Associates, Inc., 2020, pp. 13 939–13 950

work page 2020
[26]

A random matrix approach to neural networks,

C. Louart, Z. Liao, and R. Couillet, “A random matrix approach to neural networks,” Annals of Applied Probability, vol. 28, no. 2, pp. 1190–1248, 2018

work page 2018
[27]

C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning , Nov. 2005. 22

work page 2005
[28]

On the similarity between the laplace and Neural Tangent Kernels,

A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and R. Basri, “On the similarity between the laplace and Neural Tangent Kernels,” in Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2020, pp. 1451–1461

work page 2020
[29]

The Dynamics of Learning: A Random Matrix Approach,

Z. Liao and R. Couillet, “The Dynamics of Learning: A Random Matrix Approach,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 2018, pp. 3072–3081

work page 2018
[30]

High-dimensional dynamics of general- ization error in neural networks,

M. S. Advani, A. M. Saxe, and H. Sompolinsky, “High-dimensional dynamics of general- ization error in neural networks,” Neural Networks, vol. 132, pp. 428–446, 2020

work page 2020
[31]

Halting Time is Pre- dictable for Large Models: A Universality Property and Average-Case Analysis,

C. Paquette, B. van Merri¨ enboer, E. Paquette, and F. Pedregosa, “Halting Time is Pre- dictable for Large Models: A Universality Property and Average-Case Analysis,” Founda- tions of Computational Mathematics , vol. 23, no. 2, pp. 597–673, 2023

work page 2023
[32]

“Lossless

L. Gu, Y. Du, Y. Zhang, D. Xie, S. Pu, R. Qiu, and Z. Liao, ““Lossless” Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach,” in Ad- vances in Neural Information Processing Systems , vol. 35. Curran Associates, Inc., 2022, pp. 3774–3787

work page 2022
[33]

Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks,

Z. Fan and Z. Wang, “Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks,” in Advances in Neural Information Processing Systems , vol. 33. Curran Associates, Inc., 2020, pp. 7710–7721

work page 2020
[34]

Wide neural networks of any depth evolve as linear models under gradient descent,

J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” Journal of Statistical Mechanics: Theory and Experiment , vol. 2020, no. 12, p. 124002, 2020

work page 2020
[35]

Hanson-Wright inequality and sub-gaussian concentra- tion,

M. Rudelson and R. Vershynin, “Hanson-Wright inequality and sub-gaussian concentra- tion,” Electronic Communications in Probability, vol. 18, no. none, 2013

work page 2013
[36]

Couillet and Z

R. Couillet and Z. Liao, Random Matrix Methods for Machine Learning . Cambridge University Press, 2022

work page 2022
[37]

Kernel spectral clustering of large dimensional data,

R. Couillet and F. Benaych-Georges, “Kernel spectral clustering of large dimensional data,” Electronic Journal of Statistics , vol. 10, no. 1, pp. 1393–1454, 2016

work page 2016
[38]

Covariance discriminative power of kernel clustering meth- ods,

A. Kammoun and R. Couillet, “Covariance discriminative power of kernel clustering meth- ods,” Electronic Journal of Statistics , vol. 17, no. 1, pp. 291–390, Jan. 2023. 23 A Derivation of Theorem 5 We present here the following (heuristic, and thus more accessible) derivation of Theorem 5. To simplify the derivation, we outline the proof of Theorem 5 by d...

work page 2023
[39]

For the scalar functionals of interest f(Q) for Q ∈ Rn×n, the first (and often most natural) deterministic quantity to describe its behavior is the expectation E[f(Q)]

Computing or approximating the expectation of the random matrix Q. For the scalar functionals of interest f(Q) for Q ∈ Rn×n, the first (and often most natural) deterministic quantity to describe its behavior is the expectation E[f(Q)]. In the case of a linear or bilinear functional f(·) as in Definition 2 and Remark 1, this is equal to f(E[Q]). In the cas...

work page
[40]

1 nx⊤ i Q−ixi 1 + 1 nx⊤ i Q−ixi # ≃ 1 − 1 1 + δ(z) , and for xj = C 1 2zj, j ̸= i, that 1 n E[x⊤ i Qxj] = E

Establishing the LLN-type concentration of f(Q) around f( ˜Q). This step often involves concentration inequalities of the form Pr(|f(Q) − f( ˜Q)| ≥ t) ≤ δ(n, t), (45) for some function δ(n, t) that decreases to zero as n → ∞. This can be achieved, e.g., by bounding sequentially the differences f(Q)−f(E[Q]) and f(E[Q])−f( ˜Q). (The latter uses that the two...

work page
[41]

Exponential eigendecay (this is the case for, e.g., RBF kernel [27]) for which we have µK(dt) = αt for some α ∈ (0, 1). In this case, we have θ = 1 α log( n d ) n d + Cα d n , C α = 1 α log π sin(πα) , (64) which is slower than the n−1 rate for linear models in Remark 6, where we use thatR ∞ 0 xαx a+x dx = π sin(πα) · e−aα for α ∈ (0, 1)

work page
[42]

ϕ(ξi)ϕ εijξi + 1 − ε2 ij 2 + O(ε4 ij) ! ξj !# = E[ϕ(ξi)ϕ(ξj)] + E

Polynomial eigendecay (this is the case for, e.g., Mat´ ern kernel [27]) for which we have µK(dt) = t−β for some β > 1. In this case, we have θ = Cβ d n 1+ 1 2−β , C β = 2−β r sin(πβ) π , (65) which is faster than the n−1 rate for linear models in Remark 6, where we use thatR ∞ 0 x1−β 1+x dx = π sin(πβ) for β ∈ (0, 2). In particular, in the case of Harmon...

work page

[1] [1]

Potters and J.-P

M. Potters and J.-P. Bouchaud, A First Course in Random Matrix Theory: for Physicists, Engineers and Data Scientists . Cambridge University Press, 2020

work page 2020

[2] [2]

G. W. Anderson, A. Guionnet, and O. Zeitouni, An Introduction to Random Matrices, ser. Cambridge Studies in Advanced Mathematics. Cambridge University Press, 2010, vol. 118

work page 2010

[3] [3]

Bai and J

Z. Bai and J. W. Silverstein, Spectral Analysis of Large Dimensional Random Matrices , 2nd ed., ser. Springer Series in Statistics. Springer-Verlag New York, 2010, vol. 20

work page 2010

[4] [4]

Cleaning large correlation matrices: Tools from Random Matrix Theory,

J. Bun, J.-P. Bouchaud, and M. Potters, “Cleaning large correlation matrices: Tools from Random Matrix Theory,” Physics Reports, vol. 666, pp. 1–109, 2017

work page 2017

[5] [5]

Random matrix theory and wireless communications,

A. M. Tulino and S. Verd´ u, “Random matrix theory and wireless communications,”Foun- dations and Trends® in Communications and Information Theory, vol. 1, no. 1, pp. 1–182, 2004

work page 2004

[6] [6]

Couillet and M

R. Couillet and M. Debbah, Random Matrix Methods for Wireless Communications. Cam- bridge University Press, 2011

work page 2011

[7] [7]

Surprises in high-dimensional ridgeless least squares interpolation,

T. Hastie, A. Montanari, S. Rosset, and R. J. Tibshirani, “Surprises in high-dimensional ridgeless least squares interpolation,” The Annals of Statistics , vol. 50, no. 2, pp. 949–986, Apr. 2022

work page 2022

[8] [8]

The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve,

S. Mei and A. Montanari, “The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve,”Communications on Pure and Applied Mathematics, 2021

work page 2021

[9] [9]

Deep Neural Networks as Gaussian Processes,

J. Lee, J. Sohl-dickstein, J. Pennington, R. Novak, S. Schoenholz, and Y. Bahri, “Deep Neural Networks as Gaussian Processes,” in International Conference on Learning Repre- sentations, 2018

work page 2018

[10] [10]

Neural Tangent Kernel: Convergence and Gen- eralization in Neural Networks,

A. Jacot, F. Gabriel, and C. Hongler, “Neural Tangent Kernel: Convergence and Gen- eralization in Neural Networks,” in Advances in Neural Information Processing Systems , vol. 31. Curran Associates, Inc., 2018, pp. 8571–8580

work page 2018

[11] [11]

Benign overfitting in linear regres- sion,

P. L. Bartlett, P. M. Long, G. Lugosi, and A. Tsigler, “Benign overfitting in linear regres- sion,” Proceedings of the National Academy of Sciences, vol. 117, no. 48, pp. 30 063–30 070, 2020. 21

work page 2020

[12] [12]

Introduction to the Non-Asymptotic Analysis of Random Matrices,

R. Vershynin, “Introduction to the Non-Asymptotic Analysis of Random Matrices,” in Compressed Sensing: Theory and Applications , Y. C. Eldar and G. Kutyniok, Eds. Cam- bridge University Press, 2012, pp. 210–268

work page 2012

[13] [13]

Central limit theorem for linear eigenvalue statistics of random matrices with independent entries,

A. Lytova and L. Pastur, “Central limit theorem for linear eigenvalue statistics of random matrices with independent entries,” The Annals of Probability , vol. 37, no. 5, pp. 1778– 1840, 2009

work page 2009

[14] [14]

CLT for linear spectral statistics of large-dimensional sample covariance matrices,

Z. Bai and J. W. Silverstein, “CLT for linear spectral statistics of large-dimensional sample covariance matrices,” The Annals of Probability , vol. 32, no. 1A, pp. 553–605, 2004

work page 2004

[15] [15]

Matrices al´ eatoires: Statistique asymptotique des valeurs pro- pres,

L. Pastur and A. Lejay, “Matrices al´ eatoires: Statistique asymptotique des valeurs pro- pres,” in S´ eminaire de Probabilit´ es XXXVI, J. Az´ ema, M.´Emery, M. Ledoux, and M. Yor, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, vol. 1801, pp. 135–164

work page 2003

[16] [16]

Distribution of eigenvalues for some sets of random matrices,

V. A. Marcenko and L. A. Pastur, “Distribution of eigenvalues for some sets of random matrices,” Mathematics of the USSR-Sbornik , vol. 1, no. 4, p. 457, 1967

work page 1967

[17] [17]

Random matrices: Universality of ESDs and the circular law,

T. Tao, V. Vu, and M. Krishnapur, “Random matrices: Universality of ESDs and the circular law,” The Annals of Probability , vol. 38, no. 5, pp. 2023–2065, 2010

work page 2023

[18] [18]

L. A. Pastur and M. Shcherbina, Eigenvalue Distribution of Large Random Matrices , ser. Mathematical Surveys and Monographs. American Mathematical Society, 2011, vol. 171

work page 2011

[19] [19]

The gaussian equivalence of generative models for learning with shallow neural networks,

S. Goldt, B. Loureiro, G. Reeves, F. Krzakala, M. M´ ezard, and L. Zdeborov´ a, “The gaussian equivalence of generative models for learning with shallow neural networks,” in Mathemat- ical and Scientific Machine Learning . PMLR, 2022, pp. 426–471

work page 2022

[20] [20]

Universality laws for high-dimensional learning with random fea- tures,

H. Hu and Y. M. Lu, “Universality laws for high-dimensional learning with random fea- tures,” IEEE Transactions on Information Theory , vol. 69, no. 3, pp. 1932–1964, 2022

work page 1932

[21] [21]

Universality of empirical risk minimization,

A. Montanari and B. N. Saeed, “Universality of empirical risk minimization,” in Conference on Learning Theory. PMLR, 2022, pp. 4310–4312

work page 2022

[22] [22]

The breakdown of gaussian universality in classification of high- dimensional linear factor mixtures,

X. MAI and Z. Liao, “The breakdown of gaussian universality in classification of high- dimensional linear factor mixtures,” in The Thirteenth International Conference on Learn- ing Representations, 2025

work page 2025

[23] [23]

High-dimensional asymptotics of prediction: Ridge regression and classification,

E. Dobriban and S. Wager, “High-dimensional asymptotics of prediction: Ridge regression and classification,” The Annals of Statistics , vol. 46, no. 1, pp. 247–279, 2018

work page 2018

[24] [24]

Reconciling modern machine-learning practice and the classical bias–variance trade-off,

M. Belkin, D. Hsu, S. Ma, and S. Mandal, “Reconciling modern machine-learning practice and the classical bias–variance trade-off,” Proceedings of the National Academy of Sciences, vol. 116, no. 32, pp. 15 849–15 854, 2019

work page 2019

[25] [25]

A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent,

Z. Liao, R. Couillet, and M. W. Mahoney, “A random matrix analysis of random fourier features: beyond the gaussian kernel, a precise phase transition, and the corresponding double descent,” in Advances in Neural Information Processing Systems , vol. 33. Curran Associates, Inc., 2020, pp. 13 939–13 950

work page 2020

[26] [26]

A random matrix approach to neural networks,

C. Louart, Z. Liao, and R. Couillet, “A random matrix approach to neural networks,” Annals of Applied Probability, vol. 28, no. 2, pp. 1190–1248, 2018

work page 2018

[27] [27]

C. E. Rasmussen and C. K. I. Williams, Gaussian Processes for Machine Learning , Nov. 2005. 22

work page 2005

[28] [28]

On the similarity between the laplace and Neural Tangent Kernels,

A. Geifman, A. Yadav, Y. Kasten, M. Galun, D. Jacobs, and R. Basri, “On the similarity between the laplace and Neural Tangent Kernels,” in Proceedings of the 34th International Conference on Neural Information Processing Systems. Curran Associates Inc., 2020, pp. 1451–1461

work page 2020

[29] [29]

The Dynamics of Learning: A Random Matrix Approach,

Z. Liao and R. Couillet, “The Dynamics of Learning: A Random Matrix Approach,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 2018, pp. 3072–3081

work page 2018

[30] [30]

High-dimensional dynamics of general- ization error in neural networks,

M. S. Advani, A. M. Saxe, and H. Sompolinsky, “High-dimensional dynamics of general- ization error in neural networks,” Neural Networks, vol. 132, pp. 428–446, 2020

work page 2020

[31] [31]

Halting Time is Pre- dictable for Large Models: A Universality Property and Average-Case Analysis,

C. Paquette, B. van Merri¨ enboer, E. Paquette, and F. Pedregosa, “Halting Time is Pre- dictable for Large Models: A Universality Property and Average-Case Analysis,” Founda- tions of Computational Mathematics , vol. 23, no. 2, pp. 597–673, 2023

work page 2023

[32] [32]

“Lossless

L. Gu, Y. Du, Y. Zhang, D. Xie, S. Pu, R. Qiu, and Z. Liao, ““Lossless” Compression of Deep Neural Networks: A High-dimensional Neural Tangent Kernel Approach,” in Ad- vances in Neural Information Processing Systems , vol. 35. Curran Associates, Inc., 2022, pp. 3774–3787

work page 2022

[33] [33]

Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks,

Z. Fan and Z. Wang, “Spectra of the Conjugate Kernel and Neural Tangent Kernel for linear-width neural networks,” in Advances in Neural Information Processing Systems , vol. 33. Curran Associates, Inc., 2020, pp. 7710–7721

work page 2020

[34] [34]

Wide neural networks of any depth evolve as linear models under gradient descent,

J. Lee, L. Xiao, S. S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington, “Wide neural networks of any depth evolve as linear models under gradient descent,” Journal of Statistical Mechanics: Theory and Experiment , vol. 2020, no. 12, p. 124002, 2020

work page 2020

[35] [35]

Hanson-Wright inequality and sub-gaussian concentra- tion,

M. Rudelson and R. Vershynin, “Hanson-Wright inequality and sub-gaussian concentra- tion,” Electronic Communications in Probability, vol. 18, no. none, 2013

work page 2013

[36] [36]

Couillet and Z

R. Couillet and Z. Liao, Random Matrix Methods for Machine Learning . Cambridge University Press, 2022

work page 2022

[37] [37]

Kernel spectral clustering of large dimensional data,

R. Couillet and F. Benaych-Georges, “Kernel spectral clustering of large dimensional data,” Electronic Journal of Statistics , vol. 10, no. 1, pp. 1393–1454, 2016

work page 2016

[38] [38]

Covariance discriminative power of kernel clustering meth- ods,

A. Kammoun and R. Couillet, “Covariance discriminative power of kernel clustering meth- ods,” Electronic Journal of Statistics , vol. 17, no. 1, pp. 291–390, Jan. 2023. 23 A Derivation of Theorem 5 We present here the following (heuristic, and thus more accessible) derivation of Theorem 5. To simplify the derivation, we outline the proof of Theorem 5 by d...

work page 2023

[39] [39]

For the scalar functionals of interest f(Q) for Q ∈ Rn×n, the first (and often most natural) deterministic quantity to describe its behavior is the expectation E[f(Q)]

Computing or approximating the expectation of the random matrix Q. For the scalar functionals of interest f(Q) for Q ∈ Rn×n, the first (and often most natural) deterministic quantity to describe its behavior is the expectation E[f(Q)]. In the case of a linear or bilinear functional f(·) as in Definition 2 and Remark 1, this is equal to f(E[Q]). In the cas...

work page

[40] [40]

1 nx⊤ i Q−ixi 1 + 1 nx⊤ i Q−ixi # ≃ 1 − 1 1 + δ(z) , and for xj = C 1 2zj, j ̸= i, that 1 n E[x⊤ i Qxj] = E

Establishing the LLN-type concentration of f(Q) around f( ˜Q). This step often involves concentration inequalities of the form Pr(|f(Q) − f( ˜Q)| ≥ t) ≤ δ(n, t), (45) for some function δ(n, t) that decreases to zero as n → ∞. This can be achieved, e.g., by bounding sequentially the differences f(Q)−f(E[Q]) and f(E[Q])−f( ˜Q). (The latter uses that the two...

work page

[41] [41]

Exponential eigendecay (this is the case for, e.g., RBF kernel [27]) for which we have µK(dt) = αt for some α ∈ (0, 1). In this case, we have θ = 1 α log( n d ) n d + Cα d n , C α = 1 α log π sin(πα) , (64) which is slower than the n−1 rate for linear models in Remark 6, where we use thatR ∞ 0 xαx a+x dx = π sin(πα) · e−aα for α ∈ (0, 1)

work page

[42] [42]

ϕ(ξi)ϕ εijξi + 1 − ε2 ij 2 + O(ε4 ij) ! ξj !# = E[ϕ(ξi)ϕ(ξj)] + E

Polynomial eigendecay (this is the case for, e.g., Mat´ ern kernel [27]) for which we have µK(dt) = t−β for some β > 1. In this case, we have θ = Cβ d n 1+ 1 2−β , C β = 2−β r sin(πβ) π , (65) which is faster than the n−1 rate for linear models in Remark 6, where we use thatR ∞ 0 x1−β 1+x dx = π sin(πβ) for β ∈ (0, 2). In particular, in the case of Harmon...

work page