Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

Behrad Moniri; Hamed Hassani

arxiv: 2605.17767 · v1 · pith:JGKG3E4Cnew · submitted 2026-05-18 · 📊 stat.ML · cs.LG

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

Behrad Moniri , Hamed Hassani This is my paper

Pith reviewed 2026-05-20 01:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords feature learningtwo-layer networksgradient descentlinear-width regimespiked random matrixbatch reuseinformation exponent

0 comments

The pith

In the linear-width regime, the second gradient step on two-layer networks produces weights that act as a spiked random matrix whose number of outliers is set by floor(alpha2 over one-half minus alpha1).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper analyzes feature learning in two-layer networks when hidden units, samples, and input dimension all scale proportionally. One gradient step is limited to rank-one updates that capture only information-exponent-one directions. With two steps whose sizes scale as powers of width, the weight matrix acquires a spiked spectrum whose outlier count equals floor of alpha2 divided by (one-half minus alpha1). Reusing the same batch across steps lets the second update align with directions whose information exponent exceeds one, provided the exponents are chosen suitably, whereas independent batches cannot.

Core claim

The weights after the second gradient step behave as a spiked random matrix with multiple outliers, the number of which is given by floor(alpha2 / (1/2 - alpha1)), and batch reuse enables the second update to capture directions with information exponent exceeding one when alpha1 and alpha2 are chosen appropriately.

What carries the argument

The spectral characterization of the updated weights as a spiked random matrix, with outlier count controlled by the ratio of the two step-size exponents.

If this is right

The number of learned directions grows with the ratio of the second step-size exponent to the remaining capacity after the first step.
Batch reuse produces a qualitative improvement over independent batches by unlocking directions with information exponent larger than one.
Early training dynamics in overparameterized networks admit a precise spectral description once the linear-width scaling and step-size powers are fixed.
The same scaling regime supplies a tractable limit for studying how optimization moves from random initialization to feature alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending the same analysis to three or more steps would predict how many additional outliers appear under continued power-law step sizes.
The batch-reuse distinction may generalize to deeper architectures if analogous step-size scalings are applied layer-wise.
Finite-width simulations with moderate proportionality constants could directly count outliers and test whether the floor formula remains predictive before the asymptotic limit.

Load-bearing premise

The derivation assumes the linear-width regime in which hidden neurons, sample size, and input dimension scale proportionally, together with power-law step-size scalings for the two updates.

What would settle it

Compute the eigenvalues of the weight matrix after exactly two scaled gradient steps on synthetic data with proportional dimensions and verify whether the number of large outliers matches the floor formula while their alignment with the target differs between reused and independent batches.

Figures

Figures reproduced from arXiv: 2605.17767 by Behrad Moniri, Hamed Hassani.

**Figure 2.** Figure 2: The number of different directions Λ(α1, α2) learned by the second step of gradient descent, as a function of α1, α2 ∈ [0, 0.5). This theorem shows that after the second step of gradient descent update, the weights are in the Bulk+Spikes phase of training from [MM21], and that the dynamics closely match the early phase training dynamics empirically characterized in prior work (see also [MPM21, WES+23]). Us… view at source ↗

**Figure 3.** Figure 3: The histogram of the singular values of W1 and W2 with α2 = 0.4, α1 = 0.3. In this setup, we consider the reused batch setting and set M = 1 with g1(z) = H3(z), σ(z) = tanh(z), n = 40 × 103 , dX = 14 × 103 and N = 20 × 103 . The vertical dashed lines correspond to the outliers and the vertical blue line is the top non-outlying singular value. As predicted by Theorem 4, the spectrum of W2 contains two outly… view at source ↗

**Figure 4.** Figure 4: The alignment of right singular vectors of the weight matrix, with the target direction in [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Inner Product Alignment β ⊤ ∗ βˆ 2;2 as a function of dX/ξ2n. Simulation 2. In this setting, we consider n = 6 × 103 and vary dX between 600 and 12 × 103 . We set M = 1 with y = H3(Xβ⋆). We compute β ⊤ ⋆ βˆ 2;2 and average it for each values of dX over 100 trials. We compare the simulation results with the theoretical predictions of Theorem 6. This is shown in [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent, such updates are fundamentally limited: they are approximately rank-one, capturing only a single direction, and require the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the second step of gradient descent with step-sizes $\eta_1 \asymp N^{\alpha_1}$ and $\eta_2 \asymp N^{\alpha_2}$ for $\alpha_1, \alpha_2 \in [0,0.5)$. We derive a sharp spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of these outliers is determined by the scaling parameters $\alpha_1$ and $\alpha_2$ through $\lfloor \frac{\alpha_2}{1/2 - \alpha_1} \rfloor$. Furthermore, by analyzing the alignment between these learned directions and the target function, we identify a qualitative gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, under the condition that $\alpha_1, \alpha_2$ are chosen properly. This confirms that the benefits of batch reuse, previously observed in finite-width regimes, persist in the high-dimensional linear-width limit. By characterizing these early-phase spectral transitions, our work establishes a tractable mathematical framework for studying optimization and feature learning phenomenology in modern overparameterized networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two GD steps yield a precise outlier count floor(alpha2/(1/2-alpha1)) and let batch reuse reach info exponents above one in the linear-width limit.

read the letter

The paper's main contribution is a clean extension of one-step feature learning results to two gradient steps. It shows that the updated weights act as a spiked random matrix whose number of outliers is exactly floor(alpha2 / (1/2 - alpha1)), and that reusing the same batch lets the second step align with directions whose information exponent exceeds one when the exponents alpha1 and alpha2 are chosen right. Independent batches stay limited to exponent one. These distinctions survive the linear-width scaling where width, samples, and dimension all grow proportionally with the step sizes eta1 ~ N^alpha1 and eta2 ~ N^alpha2 for alpha in [0, 0.5).

Referee Report

1 major / 2 minor

Summary. The paper examines feature learning in two-layer neural networks in the linear-width regime, where hidden neurons, sample size, and input dimension scale proportionally. It characterizes the weights after a second gradient descent step with step sizes scaling as N to the power alpha1 and alpha2 (alpha in [0, 0.5)), showing that the updated weights act as a spiked random matrix whose number of outliers is floor(alpha2 / (1/2 - alpha1)). It further identifies a qualitative difference in alignment with the target function depending on whether batches are independent or reused, with reuse enabling capture of directions having information exponent greater than one under suitable alpha choices.

Significance. If the asymptotic spectral results hold, the work supplies a precise mathematical framework for early-phase feature learning beyond single-step updates, with an explicit dependence of the number of learned directions on the step-size exponents and a clear distinction for batch reuse. This extends prior one-step analyses in a controlled high-dimensional limit and could inform understanding of optimization phenomenology in overparameterized networks.

major comments (1)

The central spectral characterization and the explicit outlier count floor(alpha2 / (1/2 - alpha1)) are load-bearing for the main claims, yet the abstract and stated results leave the precise perturbation analysis and random-matrix derivation implicit; a dedicated section or appendix deriving this count from the linear-width scaling and the two-step update would strengthen verifiability.

minor comments (2)

Clarify the precise definition of the information exponent early in the introduction, as its usage in the batch-reuse comparison is central to the qualitative gap claimed.
The step-size regime alpha1, alpha2 in [0, 0.5) is stated without discussion of boundary behavior at 0.5; a brief remark on why the upper limit is strict would aid readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation and constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: The central spectral characterization and the explicit outlier count floor(alpha2 / (1/2 - alpha1)) are load-bearing for the main claims, yet the abstract and stated results leave the precise perturbation analysis and random-matrix derivation implicit; a dedicated section or appendix deriving this count from the linear-width scaling and the two-step update would strengthen verifiability.

Authors: We agree that the perturbation analysis underlying the outlier count is central and that its current presentation could be made more self-contained for verifiability. In the revised manuscript we will add a dedicated subsection (placed after the statement of the main spectral result) that derives the floor(alpha2 / (1/2 - alpha1)) count explicitly from the linear-width scaling, the two-step gradient update, and the associated spiked random-matrix perturbation. The derivation will collect the key intermediate lemmas on the covariance structure and eigenvalue perturbation that are currently distributed across the proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained asymptotic analysis

full rationale

The paper derives the spiked random matrix behavior and outlier count floor(alpha2 / (1/2 - alpha1)) via perturbation analysis and high-dimensional limits in the linear-width regime, with step-size scalings eta1 ~ N^alpha1 and eta2 ~ N^alpha2. This is a mathematical characterization from random matrix theory applied to the two-step gradient updates, not a fit to data or a quantity defined circularly from the outputs. The batch-reuse vs. independent-batch distinction follows from analyzing alignments with the target function under the stated scalings, yielding independent content on information exponents >1. No load-bearing self-citations, self-definitional steps, or renamed known results appear in the central claims; the analysis is externally grounded in asymptotic techniques rather than reducing to the paper's own inputs by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on the linear-width scaling assumption and the specific power-law step-size regimes; these are domain assumptions standard in high-dimensional neural-network theory rather than new entities or fitted constants.

free parameters (1)

alpha1 and alpha2
Exponents in [0, 0.5) that set the scaling of the two step sizes; they directly determine the number of outliers via the floor formula.

axioms (1)

domain assumption Linear-width regime: hidden neurons, sample size, and input dimension all scale proportionally with N.
Invoked throughout the abstract as the setting in which the spectral characterization holds.

pith-pipeline@v0.9.0 · 5872 in / 1258 out tokens · 40495 ms · 2026-05-20T01:25:24.997623+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

number of these outliers is determined by the scaling parameters alpha1 and alpha2 through floor(alpha2/(1/2-alpha1))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · 9 internal anchors

[1]

Electronic Journal of Probability , volume=

Eigenvalue distribution of some nonlinear models of random matrices , author=. Electronic Journal of Probability , volume=. 2021 , publisher=

work page 2021
[2]

Journal of Multivariate Analysis , volume=

On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices , author=. Journal of Multivariate Analysis , volume=. 2007 , publisher=

work page 2007
[3]

Indiana University Mathematics Journal , pages=

Exact separation phenomenon for the eigenvalues of large information-plus-noise type matrices, and an application to spiked models , author=. Indiana University Mathematics Journal , pages=. 2014 , publisher=

work page 2014
[4]

International Conference on Learning Representations , year=

Gradient descent provably optimizes over-parameterized neural networks , author=. International Conference on Learning Representations , year=

work page
[5]

Electronic Communications in Probability , publisher =

Sandrine P. Electronic Communications in Probability , publisher =

work page
[6]

Information and Inference: A Journal of the IMA , volume=

Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=

work page 2022
[7]

Stat , volume=

Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions , author=. Stat , volume=. 2020 , publisher=

work page 2020
[8]

Communications in Mathematical Research , year =

Zhang , Huiming and Chen , Songxi , title =. Communications in Mathematical Research , year =

work page
[9]

International Conference on Learning Representations , year=

A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features , author=. International Conference on Learning Representations , year=

work page
[10]

Conference on Learning Theory , year=

Learning neural networks with two nonlinear layers in polynomial time , author=. Conference on Learning Theory , year=

work page
[11]

Advances in Neural Information Processing Systems , year=

Provable guarantees for nonlinear feature learning in three-layer neural networks , author=. Advances in Neural Information Processing Systems , year=

work page
[12]

BIT Numerical Mathematics , volume=

Perturbation bounds in connection with singular value decomposition , author=. BIT Numerical Mathematics , volume=. 1972 , publisher=

work page 1972
[13]

International Conference on Learning Representations , year=

Adversarial Feature Learning , author=. International Conference on Learning Representations , year=

work page
[14]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Representation learning: A review and new perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=

work page 2013
[15]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018
[16]

2010 , publisher=

Spectral Analysis of Large Dimensional Random Matrices , author=. 2010 , publisher=

work page 2010
[17]

Evasion Attacks against Machine Learning at Test Time

Battista Biggio and Igino Corona and Davide Maiorca and Blaine Nelson and Nedim S rndi \' c and Pavel Laskov and Giorgio Giacinto and Fabio Roli. Evasion Attacks against Machine Learning at Test Time. Proc. Joint European Conf. Mach. Learning and Knowledge Discovery in Databases. 2013

work page 2013
[18]

, author=

WONDER: Weighted One-shot Distributed Ridge Regression in High Dimensions. , author=. Journal of Machine Learning Research , volume=

work page
[19]

2001 , publisher=

Statistical mechanics of learning , author=. 2001 , publisher=

work page 2001
[20]

Neural Networks and Spin Glasses , pages=

Statistical theory of learning a rule , author=. Neural Networks and Spin Glasses , pages=. 1990 , publisher=

work page 1990
[21]

Journal of Physics A: Mathematical and General , volume=

Phase transitions in simple learning , author=. Journal of Physics A: Mathematical and General , volume=. 1989 , publisher=

work page 1989
[22]

Journal of Physics A: Mathematical and General , volume=

Finite-size effects and optimal test set size in linear perceptrons , author=. Journal of Physics A: Mathematical and General , volume=. 1995 , publisher=

work page 1995
[23]

Neural Networks , volume=

Stochastic linear learning: Exact test and training error averages , author=. Neural Networks , volume=. 1993 , publisher=

work page 1993
[24]

Journal of Physics A: Mathematical and General , volume=

On the ability of the optimal perceptron to generalise , author=. Journal of Physics A: Mathematical and General , volume=. 1990 , publisher=

work page 1990
[25]

The Handbook of Brain Theory and Neural Networks, , pages=

Statistical mechanics of learning: Generalization , author=. The Handbook of Brain Theory and Neural Networks, , pages=

work page
[26]

Pattern recognition letters , volume=

Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix , author=. Pattern recognition letters , volume=. 1998 , publisher=

work page 1998
[27]

Proceedings of the Scandinavian Conference on Image Analysis , volume=

Small sample size generalization , author=. Proceedings of the Scandinavian Conference on Image Analysis , volume=

work page
[28]

Models of Neural Networks III , pages=

Statistical mechanics of generalization , author=. Models of Neural Networks III , pages=. 1996 , publisher=

work page 1996
[29]

Proceedings of the National Academy of Sciences , volume=

A brief prehistory of double descent , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020
[30]

IEEE Transactions on Pattern Analysis and Machine Intelligence , number=

A problem of dimensionality: A simple example , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , number=. 1979 , publisher=

work page 1979
[31]

On the Peaking Phenomenon of the Lasso in Model Selection

On the peaking phenomenon of the lasso in model selection , author=. arXiv preprint arXiv:0904.4416 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Frontiers of Life , volume=

Learning to generalize , author=. Frontiers of Life , volume=

work page
[33]

Physical Review E , volume=

Jamming transition as a paradigm to understand the loss landscape of deep neural networks , author=. Physical Review E , volume=. 2019 , publisher=

work page 2019
[34]

Neural Networks , volume=

High-dimensional dynamics of generalization error in neural networks , author=. Neural Networks , volume=. 2020 , publisher=

work page 2020
[35]

Results in statistical discriminant analysis: A review of the former

Raudys,. Results in statistical discriminant analysis: A review of the former. Journal of Multivariate Analysis , volume=. 2004 , publisher=

work page 2004
[36]

Multiparametric

Serdobolskii, Vadim Ivanovich , year=. Multiparametric

work page
[37]

2022 , publisher=

Random Matrix Methods for Machine Learning , author=. 2022 , publisher=

work page 2022
[38]

Random matrix theory and wireless communications , Volume =

Tulino, Antonio M and Verd. Random matrix theory and wireless communications , Volume =. Communications and Information Theory , Number =

work page
[39]

Couillet, Romain and Debbah, Merouane , Publisher =. Random

work page
[40]

Journal of Statistical Planning and Inference , volume=

Random matrix theory in statistics: A review , author=. Journal of Statistical Planning and Inference , volume=. 2014 , publisher=

work page 2014
[41]

Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =

Yao, Jianfeng and Bai, Zhidong and Zheng, Shurong , Date-Added =. Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =

work page
[42]

Technical Cybernetics (in Russian) , pages=

On the amount of a priori information in designing the classification algorithm , author=. Technical Cybernetics (in Russian) , pages=. 1972 , volume=

work page 1972
[43]

Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size , author=. Sov. Math. Dokl. , volume=

work page
[44]

The Annals of Applied Probability , volume=

Deterministic equivalents for certain functionals of large random matrices , author=. The Annals of Applied Probability , volume=. 2007 , publisher=

work page 2007
[45]

Computing Systems (in Russian) , volume=

On determining training sample size of linear classifier , author=. Computing Systems (in Russian) , volume=

work page
[46]

2012 , publisher=

Statistical and Neural Classifiers: An integrated approach to design , author=. 2012 , publisher=

work page 2012
[47]

New Trends in Probability and Statistics , volume=

Small sample properties of ridge estimate of the covariance matrix in statistical and neural net classification , author=. New Trends in Probability and Statistics , volume=

work page
[48]

1998 , publisher=

Combinatorial theory of the free product with amalgamation and operator-valued free probability theory , author=. 1998 , publisher=

work page 1998
[49]

The Annals of Statistics , volume=

High-dimensional asymptotics of prediction: Ridge regression and classification , author=. The Annals of Statistics , volume=. 2018 , publisher=

work page 2018
[50]

What Causes the Test Error? Going Beyond Bias-Variance via

Lin, Licong and Dobriban, Edgar , journal=. What Causes the Test Error? Going Beyond Bias-Variance via

work page
[51]

Communications on Pure and Applied Mathematics , volume=

The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

work page 2022
[52]

Advances in Neural Information Processing Systems , year=

Overparameterization improves robustness to covariate shift in high dimensions , author=. Advances in Neural Information Processing Systems , year=

work page
[53]

Spectra of large block matrices

Spectra of large block matrices , author=. arXiv preprint cs/0610045 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

Journal of Functional Analysis , volume=

Applications of realizations (aka linearizations) to free probability , author=. Journal of Functional Analysis , volume=. 2018 , publisher=

work page 2018
[55]

Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=. Do

work page
[56]

Advances in Neural Information Processing Systems , year=

Measuring robustness to natural distribution shifts in image classification , author=. Advances in Neural Information Processing Systems , year=

work page
[57]

Assessing Generalization of

Jiang, Yiding and Nagarajan, Vaishnavh and Baek, Christina and Kolter, J Zico , booktitle=. Assessing Generalization of

work page
[58]

2022 , booktitle=

Agreement-on-the-line: Predicting the Performance of Neural Networks under Distribution Shift , author=. 2022 , booktitle=

work page 2022
[59]

International Conference on Machine Learning , year=

The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization , author=. International Conference on Machine Learning , year=

work page
[60]

Advances in Neural Information Processing Systems , year=

Understanding double descent requires a fine-grained bias-variance decomposition , author=. Advances in Neural Information Processing Systems , year=

work page
[61]

International Conference on Artificial Intelligence and Statistics , year=

A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions , author=. International Conference on Artificial Intelligence and Statistics , year=

work page
[62]

Stochastic Processes and their Applications , volume=

On the limiting spectral distribution for a large class of symmetric random matrices with correlated entries , author=. Stochastic Processes and their Applications , volume=. 2015 , publisher=

work page 2015
[63]

Banna, Marwa and Najim, Jamal and Yao, Jianfeng , journal=. A. 2020 , publisher=

work page 2020
[64]

The matrix Dyson equation and its applications for random matrices

Erd. The matrix. arXiv preprint arXiv:1903.10060 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1903
[65]

IEEE Transactions on Information Theory , year=

Universality laws for high-dimensional learning with random features , author=. IEEE Transactions on Information Theory , year=

work page
[66]

2017 , publisher=

Free probability and random matrices , author=. 2017 , publisher=

work page 2017
[67]

International Mathematics Research Notices , volume=

Operator-valued semicircular elements: solving a quadratic matrix equation with positivity constraints , author=. International Mathematics Research Notices , volume=. 2007 , publisher=

work page 2007
[68]

Advances in Neural Information Processing Systems , year=

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , author=. Advances in Neural Information Processing Systems , year=

work page
[69]

International Conference on Learning Representations , year=

Anisotropic Random Feature Regression in High Dimensions , author=. International Conference on Learning Representations , year=

work page
[70]

The Annals of Statistics , volume=

Linearized two-layers neural networks in high dimension , author=. The Annals of Statistics , volume=. 2021 , publisher=

work page 2021
[71]

The Annals of Statistics , volume=

Distributed linear regression by averaging , author=. The Annals of Statistics , volume=. 2021 , publisher=

work page 2021
[72]

The Annals of Statistics , volume=

The spectrum of kernel random matrices , author=. The Annals of Statistics , volume=. 2010 , publisher=

work page 2010
[73]

Random Features for Large-Scale Kernel Machines , year =

Rahimi, Ali and Recht, Benjamin , booktitle =. Random Features for Large-Scale Kernel Machines , year =

work page
[74]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

The many faces of robustness: A critical analysis of out-of-distribution generalization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

work page
[75]

Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay and Hu, Weihua and Yasunaga, Michihiro and Phillips, Richard Lanas and Gao, Irena and others , booktitle=

work page
[76]

International Conference on Machine Learning , year=

Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization , author=. International Conference on Machine Learning , year=

work page
[77]

Advances in Neural Information Processing Systems , year=

On the Optimal Weighted _2 Regularization in Overparameterized Linear Regression , author=. Advances in Neural Information Processing Systems , year=

work page
[78]

The Annals of Statistics , volume=

Surprises in high-dimensional ridgeless least squares interpolation , author=. The Annals of Statistics , volume=. 2022 , publisher=

work page 2022
[79]

arXiv preprint arXiv:2208.02753 , year=

Spectral universality of regularized linear regression with nearly deterministic sensing matrices , author=. arXiv preprint arXiv:2208.02753 , year=

work page arXiv
[80]

Conference on Learning Theory , year=

Universality of empirical risk minimization , author=. Conference on Learning Theory , year=

work page

Showing first 80 references.

[1] [1]

Electronic Journal of Probability , volume=

Eigenvalue distribution of some nonlinear models of random matrices , author=. Electronic Journal of Probability , volume=. 2021 , publisher=

work page 2021

[2] [2]

Journal of Multivariate Analysis , volume=

On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices , author=. Journal of Multivariate Analysis , volume=. 2007 , publisher=

work page 2007

[3] [3]

Indiana University Mathematics Journal , pages=

Exact separation phenomenon for the eigenvalues of large information-plus-noise type matrices, and an application to spiked models , author=. Indiana University Mathematics Journal , pages=. 2014 , publisher=

work page 2014

[4] [4]

International Conference on Learning Representations , year=

Gradient descent provably optimizes over-parameterized neural networks , author=. International Conference on Learning Representations , year=

work page

[5] [5]

Electronic Communications in Probability , publisher =

Sandrine P. Electronic Communications in Probability , publisher =

work page

[6] [6]

Information and Inference: A Journal of the IMA , volume=

Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=

work page 2022

[7] [7]

Stat , volume=

Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions , author=. Stat , volume=. 2020 , publisher=

work page 2020

[8] [8]

Communications in Mathematical Research , year =

Zhang , Huiming and Chen , Songxi , title =. Communications in Mathematical Research , year =

work page

[9] [9]

International Conference on Learning Representations , year=

A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features , author=. International Conference on Learning Representations , year=

work page

[10] [10]

Conference on Learning Theory , year=

Learning neural networks with two nonlinear layers in polynomial time , author=. Conference on Learning Theory , year=

work page

[11] [11]

Advances in Neural Information Processing Systems , year=

Provable guarantees for nonlinear feature learning in three-layer neural networks , author=. Advances in Neural Information Processing Systems , year=

work page

[12] [12]

BIT Numerical Mathematics , volume=

Perturbation bounds in connection with singular value decomposition , author=. BIT Numerical Mathematics , volume=. 1972 , publisher=

work page 1972

[13] [13]

International Conference on Learning Representations , year=

Adversarial Feature Learning , author=. International Conference on Learning Representations , year=

work page

[14] [14]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Representation learning: A review and new perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=

work page 2013

[15] [15]

2018 , publisher=

High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=

work page 2018

[16] [16]

2010 , publisher=

Spectral Analysis of Large Dimensional Random Matrices , author=. 2010 , publisher=

work page 2010

[17] [17]

Evasion Attacks against Machine Learning at Test Time

Battista Biggio and Igino Corona and Davide Maiorca and Blaine Nelson and Nedim S rndi \' c and Pavel Laskov and Giorgio Giacinto and Fabio Roli. Evasion Attacks against Machine Learning at Test Time. Proc. Joint European Conf. Mach. Learning and Knowledge Discovery in Databases. 2013

work page 2013

[18] [18]

, author=

WONDER: Weighted One-shot Distributed Ridge Regression in High Dimensions. , author=. Journal of Machine Learning Research , volume=

work page

[19] [19]

2001 , publisher=

Statistical mechanics of learning , author=. 2001 , publisher=

work page 2001

[20] [20]

Neural Networks and Spin Glasses , pages=

Statistical theory of learning a rule , author=. Neural Networks and Spin Glasses , pages=. 1990 , publisher=

work page 1990

[21] [21]

Journal of Physics A: Mathematical and General , volume=

Phase transitions in simple learning , author=. Journal of Physics A: Mathematical and General , volume=. 1989 , publisher=

work page 1989

[22] [22]

Journal of Physics A: Mathematical and General , volume=

Finite-size effects and optimal test set size in linear perceptrons , author=. Journal of Physics A: Mathematical and General , volume=. 1995 , publisher=

work page 1995

[23] [23]

Neural Networks , volume=

Stochastic linear learning: Exact test and training error averages , author=. Neural Networks , volume=. 1993 , publisher=

work page 1993

[24] [24]

Journal of Physics A: Mathematical and General , volume=

On the ability of the optimal perceptron to generalise , author=. Journal of Physics A: Mathematical and General , volume=. 1990 , publisher=

work page 1990

[25] [25]

The Handbook of Brain Theory and Neural Networks, , pages=

Statistical mechanics of learning: Generalization , author=. The Handbook of Brain Theory and Neural Networks, , pages=

work page

[26] [26]

Pattern recognition letters , volume=

Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix , author=. Pattern recognition letters , volume=. 1998 , publisher=

work page 1998

[27] [27]

Proceedings of the Scandinavian Conference on Image Analysis , volume=

Small sample size generalization , author=. Proceedings of the Scandinavian Conference on Image Analysis , volume=

work page

[28] [28]

Models of Neural Networks III , pages=

Statistical mechanics of generalization , author=. Models of Neural Networks III , pages=. 1996 , publisher=

work page 1996

[29] [29]

Proceedings of the National Academy of Sciences , volume=

A brief prehistory of double descent , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=

work page 2020

[30] [30]

IEEE Transactions on Pattern Analysis and Machine Intelligence , number=

A problem of dimensionality: A simple example , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , number=. 1979 , publisher=

work page 1979

[31] [31]

On the Peaking Phenomenon of the Lasso in Model Selection

On the peaking phenomenon of the lasso in model selection , author=. arXiv preprint arXiv:0904.4416 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Frontiers of Life , volume=

Learning to generalize , author=. Frontiers of Life , volume=

work page

[33] [33]

Physical Review E , volume=

Jamming transition as a paradigm to understand the loss landscape of deep neural networks , author=. Physical Review E , volume=. 2019 , publisher=

work page 2019

[34] [34]

Neural Networks , volume=

High-dimensional dynamics of generalization error in neural networks , author=. Neural Networks , volume=. 2020 , publisher=

work page 2020

[35] [35]

Results in statistical discriminant analysis: A review of the former

Raudys,. Results in statistical discriminant analysis: A review of the former. Journal of Multivariate Analysis , volume=. 2004 , publisher=

work page 2004

[36] [36]

Multiparametric

Serdobolskii, Vadim Ivanovich , year=. Multiparametric

work page

[37] [37]

2022 , publisher=

Random Matrix Methods for Machine Learning , author=. 2022 , publisher=

work page 2022

[38] [38]

Random matrix theory and wireless communications , Volume =

Tulino, Antonio M and Verd. Random matrix theory and wireless communications , Volume =. Communications and Information Theory , Number =

work page

[39] [39]

Couillet, Romain and Debbah, Merouane , Publisher =. Random

work page

[40] [40]

Journal of Statistical Planning and Inference , volume=

Random matrix theory in statistics: A review , author=. Journal of Statistical Planning and Inference , volume=. 2014 , publisher=

work page 2014

[41] [41]

Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =

Yao, Jianfeng and Bai, Zhidong and Zheng, Shurong , Date-Added =. Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =

work page

[42] [42]

Technical Cybernetics (in Russian) , pages=

On the amount of a priori information in designing the classification algorithm , author=. Technical Cybernetics (in Russian) , pages=. 1972 , volume=

work page 1972

[43] [43]

Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size , author=. Sov. Math. Dokl. , volume=

work page

[44] [44]

The Annals of Applied Probability , volume=

Deterministic equivalents for certain functionals of large random matrices , author=. The Annals of Applied Probability , volume=. 2007 , publisher=

work page 2007

[45] [45]

Computing Systems (in Russian) , volume=

On determining training sample size of linear classifier , author=. Computing Systems (in Russian) , volume=

work page

[46] [46]

2012 , publisher=

Statistical and Neural Classifiers: An integrated approach to design , author=. 2012 , publisher=

work page 2012

[47] [47]

New Trends in Probability and Statistics , volume=

Small sample properties of ridge estimate of the covariance matrix in statistical and neural net classification , author=. New Trends in Probability and Statistics , volume=

work page

[48] [48]

1998 , publisher=

Combinatorial theory of the free product with amalgamation and operator-valued free probability theory , author=. 1998 , publisher=

work page 1998

[49] [49]

The Annals of Statistics , volume=

High-dimensional asymptotics of prediction: Ridge regression and classification , author=. The Annals of Statistics , volume=. 2018 , publisher=

work page 2018

[50] [50]

What Causes the Test Error? Going Beyond Bias-Variance via

Lin, Licong and Dobriban, Edgar , journal=. What Causes the Test Error? Going Beyond Bias-Variance via

work page

[51] [51]

Communications on Pure and Applied Mathematics , volume=

The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=

work page 2022

[52] [52]

Advances in Neural Information Processing Systems , year=

Overparameterization improves robustness to covariate shift in high dimensions , author=. Advances in Neural Information Processing Systems , year=

work page

[53] [53]

Spectra of large block matrices

Spectra of large block matrices , author=. arXiv preprint cs/0610045 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

Journal of Functional Analysis , volume=

Applications of realizations (aka linearizations) to free probability , author=. Journal of Functional Analysis , volume=. 2018 , publisher=

work page 2018

[55] [55]

Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=. Do

work page

[56] [56]

Advances in Neural Information Processing Systems , year=

Measuring robustness to natural distribution shifts in image classification , author=. Advances in Neural Information Processing Systems , year=

work page

[57] [57]

Assessing Generalization of

Jiang, Yiding and Nagarajan, Vaishnavh and Baek, Christina and Kolter, J Zico , booktitle=. Assessing Generalization of

work page

[58] [58]

2022 , booktitle=

Agreement-on-the-line: Predicting the Performance of Neural Networks under Distribution Shift , author=. 2022 , booktitle=

work page 2022

[59] [59]

International Conference on Machine Learning , year=

The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization , author=. International Conference on Machine Learning , year=

work page

[60] [60]

Advances in Neural Information Processing Systems , year=

Understanding double descent requires a fine-grained bias-variance decomposition , author=. Advances in Neural Information Processing Systems , year=

work page

[61] [61]

International Conference on Artificial Intelligence and Statistics , year=

A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions , author=. International Conference on Artificial Intelligence and Statistics , year=

work page

[62] [62]

Stochastic Processes and their Applications , volume=

On the limiting spectral distribution for a large class of symmetric random matrices with correlated entries , author=. Stochastic Processes and their Applications , volume=. 2015 , publisher=

work page 2015

[63] [63]

Banna, Marwa and Najim, Jamal and Yao, Jianfeng , journal=. A. 2020 , publisher=

work page 2020

[64] [64]

The matrix Dyson equation and its applications for random matrices

Erd. The matrix. arXiv preprint arXiv:1903.10060 , year=

work page internal anchor Pith review Pith/arXiv arXiv 1903

[65] [65]

IEEE Transactions on Information Theory , year=

Universality laws for high-dimensional learning with random features , author=. IEEE Transactions on Information Theory , year=

work page

[66] [66]

2017 , publisher=

Free probability and random matrices , author=. 2017 , publisher=

work page 2017

[67] [67]

International Mathematics Research Notices , volume=

Operator-valued semicircular elements: solving a quadratic matrix equation with positivity constraints , author=. International Mathematics Research Notices , volume=. 2007 , publisher=

work page 2007

[68] [68]

Advances in Neural Information Processing Systems , year=

High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , author=. Advances in Neural Information Processing Systems , year=

work page

[69] [69]

International Conference on Learning Representations , year=

Anisotropic Random Feature Regression in High Dimensions , author=. International Conference on Learning Representations , year=

work page

[70] [70]

The Annals of Statistics , volume=

Linearized two-layers neural networks in high dimension , author=. The Annals of Statistics , volume=. 2021 , publisher=

work page 2021

[71] [71]

The Annals of Statistics , volume=

Distributed linear regression by averaging , author=. The Annals of Statistics , volume=. 2021 , publisher=

work page 2021

[72] [72]

The Annals of Statistics , volume=

The spectrum of kernel random matrices , author=. The Annals of Statistics , volume=. 2010 , publisher=

work page 2010

[73] [73]

Random Features for Large-Scale Kernel Machines , year =

Rahimi, Ali and Recht, Benjamin , booktitle =. Random Features for Large-Scale Kernel Machines , year =

work page

[74] [74]

Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

The many faces of robustness: A critical analysis of out-of-distribution generalization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=

work page

[75] [75]

Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay and Hu, Weihua and Yasunaga, Michihiro and Phillips, Richard Lanas and Gao, Irena and others , booktitle=

work page

[76] [76]

International Conference on Machine Learning , year=

Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization , author=. International Conference on Machine Learning , year=

work page

[77] [77]

Advances in Neural Information Processing Systems , year=

On the Optimal Weighted _2 Regularization in Overparameterized Linear Regression , author=. Advances in Neural Information Processing Systems , year=

work page

[78] [78]

The Annals of Statistics , volume=

Surprises in high-dimensional ridgeless least squares interpolation , author=. The Annals of Statistics , volume=. 2022 , publisher=

work page 2022

[79] [79]

arXiv preprint arXiv:2208.02753 , year=

Spectral universality of regularized linear regression with nearly deterministic sensing matrices , author=. arXiv preprint arXiv:2208.02753 , year=

work page arXiv

[80] [80]

Conference on Learning Theory , year=

Universality of empirical risk minimization , author=. Conference on Learning Theory , year=

work page