Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent
Pith reviewed 2026-05-20 01:25 UTC · model grok-4.3
The pith
In the linear-width regime, the second gradient step on two-layer networks produces weights that act as a spiked random matrix whose number of outliers is set by floor(alpha2 over one-half minus alpha1).
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The weights after the second gradient step behave as a spiked random matrix with multiple outliers, the number of which is given by floor(alpha2 / (1/2 - alpha1)), and batch reuse enables the second update to capture directions with information exponent exceeding one when alpha1 and alpha2 are chosen appropriately.
What carries the argument
The spectral characterization of the updated weights as a spiked random matrix, with outlier count controlled by the ratio of the two step-size exponents.
If this is right
- The number of learned directions grows with the ratio of the second step-size exponent to the remaining capacity after the first step.
- Batch reuse produces a qualitative improvement over independent batches by unlocking directions with information exponent larger than one.
- Early training dynamics in overparameterized networks admit a precise spectral description once the linear-width scaling and step-size powers are fixed.
- The same scaling regime supplies a tractable limit for studying how optimization moves from random initialization to feature alignment.
Where Pith is reading between the lines
- Extending the same analysis to three or more steps would predict how many additional outliers appear under continued power-law step sizes.
- The batch-reuse distinction may generalize to deeper architectures if analogous step-size scalings are applied layer-wise.
- Finite-width simulations with moderate proportionality constants could directly count outliers and test whether the floor formula remains predictive before the asymptotic limit.
Load-bearing premise
The derivation assumes the linear-width regime in which hidden neurons, sample size, and input dimension scale proportionally, together with power-law step-size scalings for the two updates.
What would settle it
Compute the eigenvalues of the weight matrix after exactly two scaled gradient steps on synthetic data with proportional dimensions and verify whether the number of large outliers matches the floor formula while their alignment with the target differs between reused and independent batches.
Figures
read the original abstract
We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent, such updates are fundamentally limited: they are approximately rank-one, capturing only a single direction, and require the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the second step of gradient descent with step-sizes $\eta_1 \asymp N^{\alpha_1}$ and $\eta_2 \asymp N^{\alpha_2}$ for $\alpha_1, \alpha_2 \in [0,0.5)$. We derive a sharp spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of these outliers is determined by the scaling parameters $\alpha_1$ and $\alpha_2$ through $\lfloor \frac{\alpha_2}{1/2 - \alpha_1} \rfloor$. Furthermore, by analyzing the alignment between these learned directions and the target function, we identify a qualitative gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, under the condition that $\alpha_1, \alpha_2$ are chosen properly. This confirms that the benefits of batch reuse, previously observed in finite-width regimes, persist in the high-dimensional linear-width limit. By characterizing these early-phase spectral transitions, our work establishes a tractable mathematical framework for studying optimization and feature learning phenomenology in modern overparameterized networks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper examines feature learning in two-layer neural networks in the linear-width regime, where hidden neurons, sample size, and input dimension scale proportionally. It characterizes the weights after a second gradient descent step with step sizes scaling as N to the power alpha1 and alpha2 (alpha in [0, 0.5)), showing that the updated weights act as a spiked random matrix whose number of outliers is floor(alpha2 / (1/2 - alpha1)). It further identifies a qualitative difference in alignment with the target function depending on whether batches are independent or reused, with reuse enabling capture of directions having information exponent greater than one under suitable alpha choices.
Significance. If the asymptotic spectral results hold, the work supplies a precise mathematical framework for early-phase feature learning beyond single-step updates, with an explicit dependence of the number of learned directions on the step-size exponents and a clear distinction for batch reuse. This extends prior one-step analyses in a controlled high-dimensional limit and could inform understanding of optimization phenomenology in overparameterized networks.
major comments (1)
- The central spectral characterization and the explicit outlier count floor(alpha2 / (1/2 - alpha1)) are load-bearing for the main claims, yet the abstract and stated results leave the precise perturbation analysis and random-matrix derivation implicit; a dedicated section or appendix deriving this count from the linear-width scaling and the two-step update would strengthen verifiability.
minor comments (2)
- Clarify the precise definition of the information exponent early in the introduction, as its usage in the batch-reuse comparison is central to the qualitative gap claimed.
- The step-size regime alpha1, alpha2 in [0, 0.5) is stated without discussion of boundary behavior at 0.5; a brief remark on why the upper limit is strict would aid readability.
Simulated Author's Rebuttal
We thank the referee for their positive evaluation and constructive feedback. We address the single major comment below.
read point-by-point responses
-
Referee: The central spectral characterization and the explicit outlier count floor(alpha2 / (1/2 - alpha1)) are load-bearing for the main claims, yet the abstract and stated results leave the precise perturbation analysis and random-matrix derivation implicit; a dedicated section or appendix deriving this count from the linear-width scaling and the two-step update would strengthen verifiability.
Authors: We agree that the perturbation analysis underlying the outlier count is central and that its current presentation could be made more self-contained for verifiability. In the revised manuscript we will add a dedicated subsection (placed after the statement of the main spectral result) that derives the floor(alpha2 / (1/2 - alpha1)) count explicitly from the linear-width scaling, the two-step gradient update, and the associated spiked random-matrix perturbation. The derivation will collect the key intermediate lemmas on the covariance structure and eigenvalue perturbation that are currently distributed across the proofs. revision: yes
Circularity Check
No significant circularity; derivation is self-contained asymptotic analysis
full rationale
The paper derives the spiked random matrix behavior and outlier count floor(alpha2 / (1/2 - alpha1)) via perturbation analysis and high-dimensional limits in the linear-width regime, with step-size scalings eta1 ~ N^alpha1 and eta2 ~ N^alpha2. This is a mathematical characterization from random matrix theory applied to the two-step gradient updates, not a fit to data or a quantity defined circularly from the outputs. The batch-reuse vs. independent-batch distinction follows from analyzing alignments with the target function under the stated scalings, yielding independent content on information exponents >1. No load-bearing self-citations, self-definitional steps, or renamed known results appear in the central claims; the analysis is externally grounded in asymptotic techniques rather than reducing to the paper's own inputs by construction.
Axiom & Free-Parameter Ledger
free parameters (1)
- alpha1 and alpha2
axioms (1)
- domain assumption Linear-width regime: hidden neurons, sample size, and input dimension all scale proportionally with N.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
number of these outliers is determined by the scaling parameters alpha1 and alpha2 through floor(alpha2/(1/2-alpha1))
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Electronic Journal of Probability , volume=
Eigenvalue distribution of some nonlinear models of random matrices , author=. Electronic Journal of Probability , volume=. 2021 , publisher=
work page 2021
-
[2]
Journal of Multivariate Analysis , volume=
On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices , author=. Journal of Multivariate Analysis , volume=. 2007 , publisher=
work page 2007
-
[3]
Indiana University Mathematics Journal , pages=
Exact separation phenomenon for the eigenvalues of large information-plus-noise type matrices, and an application to spiked models , author=. Indiana University Mathematics Journal , pages=. 2014 , publisher=
work page 2014
-
[4]
International Conference on Learning Representations , year=
Gradient descent provably optimizes over-parameterized neural networks , author=. International Conference on Learning Representations , year=
-
[5]
Electronic Communications in Probability , publisher =
Sandrine P. Electronic Communications in Probability , publisher =
-
[6]
Information and Inference: A Journal of the IMA , volume=
Moving beyond sub-Gaussianity in high-dimensional statistics: Applications in covariance estimation and linear regression , author=. Information and Inference: A Journal of the IMA , volume=. 2022 , publisher=
work page 2022
-
[7]
Sub-Weibull distributions: Generalizing sub-Gaussian and sub-Exponential properties to heavier tailed distributions , author=. Stat , volume=. 2020 , publisher=
work page 2020
-
[8]
Communications in Mathematical Research , year =
Zhang , Huiming and Chen , Songxi , title =. Communications in Mathematical Research , year =
-
[9]
International Conference on Learning Representations , year=
A Theoretical Analysis on Feature Learning in Neural Networks: Emergence from Inputs and Advantage over Fixed Features , author=. International Conference on Learning Representations , year=
-
[10]
Conference on Learning Theory , year=
Learning neural networks with two nonlinear layers in polynomial time , author=. Conference on Learning Theory , year=
-
[11]
Advances in Neural Information Processing Systems , year=
Provable guarantees for nonlinear feature learning in three-layer neural networks , author=. Advances in Neural Information Processing Systems , year=
-
[12]
BIT Numerical Mathematics , volume=
Perturbation bounds in connection with singular value decomposition , author=. BIT Numerical Mathematics , volume=. 1972 , publisher=
work page 1972
-
[13]
International Conference on Learning Representations , year=
Adversarial Feature Learning , author=. International Conference on Learning Representations , year=
-
[14]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
Representation learning: A review and new perspectives , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=
work page 2013
-
[15]
High-dimensional probability: An introduction with applications in data science , author=. 2018 , publisher=
work page 2018
-
[16]
Spectral Analysis of Large Dimensional Random Matrices , author=. 2010 , publisher=
work page 2010
-
[17]
Evasion Attacks against Machine Learning at Test Time
Battista Biggio and Igino Corona and Davide Maiorca and Blaine Nelson and Nedim S rndi \' c and Pavel Laskov and Giorgio Giacinto and Fabio Roli. Evasion Attacks against Machine Learning at Test Time. Proc. Joint European Conf. Mach. Learning and Knowledge Discovery in Databases. 2013
work page 2013
- [18]
- [19]
-
[20]
Neural Networks and Spin Glasses , pages=
Statistical theory of learning a rule , author=. Neural Networks and Spin Glasses , pages=. 1990 , publisher=
work page 1990
-
[21]
Journal of Physics A: Mathematical and General , volume=
Phase transitions in simple learning , author=. Journal of Physics A: Mathematical and General , volume=. 1989 , publisher=
work page 1989
-
[22]
Journal of Physics A: Mathematical and General , volume=
Finite-size effects and optimal test set size in linear perceptrons , author=. Journal of Physics A: Mathematical and General , volume=. 1995 , publisher=
work page 1995
-
[23]
Stochastic linear learning: Exact test and training error averages , author=. Neural Networks , volume=. 1993 , publisher=
work page 1993
-
[24]
Journal of Physics A: Mathematical and General , volume=
On the ability of the optimal perceptron to generalise , author=. Journal of Physics A: Mathematical and General , volume=. 1990 , publisher=
work page 1990
-
[25]
The Handbook of Brain Theory and Neural Networks, , pages=
Statistical mechanics of learning: Generalization , author=. The Handbook of Brain Theory and Neural Networks, , pages=
-
[26]
Pattern recognition letters , volume=
Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix , author=. Pattern recognition letters , volume=. 1998 , publisher=
work page 1998
-
[27]
Proceedings of the Scandinavian Conference on Image Analysis , volume=
Small sample size generalization , author=. Proceedings of the Scandinavian Conference on Image Analysis , volume=
-
[28]
Models of Neural Networks III , pages=
Statistical mechanics of generalization , author=. Models of Neural Networks III , pages=. 1996 , publisher=
work page 1996
-
[29]
Proceedings of the National Academy of Sciences , volume=
A brief prehistory of double descent , author=. Proceedings of the National Academy of Sciences , volume=. 2020 , publisher=
work page 2020
-
[30]
IEEE Transactions on Pattern Analysis and Machine Intelligence , number=
A problem of dimensionality: A simple example , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , number=. 1979 , publisher=
work page 1979
-
[31]
On the Peaking Phenomenon of the Lasso in Model Selection
On the peaking phenomenon of the lasso in model selection , author=. arXiv preprint arXiv:0904.4416 , year=
work page internal anchor Pith review Pith/arXiv arXiv
- [32]
-
[33]
Jamming transition as a paradigm to understand the loss landscape of deep neural networks , author=. Physical Review E , volume=. 2019 , publisher=
work page 2019
-
[34]
High-dimensional dynamics of generalization error in neural networks , author=. Neural Networks , volume=. 2020 , publisher=
work page 2020
-
[35]
Results in statistical discriminant analysis: A review of the former
Raudys,. Results in statistical discriminant analysis: A review of the former. Journal of Multivariate Analysis , volume=. 2004 , publisher=
work page 2004
- [36]
-
[37]
Random Matrix Methods for Machine Learning , author=. 2022 , publisher=
work page 2022
-
[38]
Random matrix theory and wireless communications , Volume =
Tulino, Antonio M and Verd. Random matrix theory and wireless communications , Volume =. Communications and Information Theory , Number =
-
[39]
Couillet, Romain and Debbah, Merouane , Publisher =. Random
-
[40]
Journal of Statistical Planning and Inference , volume=
Random matrix theory in statistics: A review , author=. Journal of Statistical Planning and Inference , volume=. 2014 , publisher=
work page 2014
-
[41]
Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =
Yao, Jianfeng and Bai, Zhidong and Zheng, Shurong , Date-Added =. Large Sample Covariance Matrices and High-Dimensional Data Analysis , Year =
-
[42]
Technical Cybernetics (in Russian) , pages=
On the amount of a priori information in designing the classification algorithm , author=. Technical Cybernetics (in Russian) , pages=. 1972 , volume=
work page 1972
-
[43]
Representation of statistics of discriminant analysis and asymptotic expansion when space dimensions are comparable with sample size , author=. Sov. Math. Dokl. , volume=
-
[44]
The Annals of Applied Probability , volume=
Deterministic equivalents for certain functionals of large random matrices , author=. The Annals of Applied Probability , volume=. 2007 , publisher=
work page 2007
-
[45]
Computing Systems (in Russian) , volume=
On determining training sample size of linear classifier , author=. Computing Systems (in Russian) , volume=
-
[46]
Statistical and Neural Classifiers: An integrated approach to design , author=. 2012 , publisher=
work page 2012
-
[47]
New Trends in Probability and Statistics , volume=
Small sample properties of ridge estimate of the covariance matrix in statistical and neural net classification , author=. New Trends in Probability and Statistics , volume=
-
[48]
Combinatorial theory of the free product with amalgamation and operator-valued free probability theory , author=. 1998 , publisher=
work page 1998
-
[49]
The Annals of Statistics , volume=
High-dimensional asymptotics of prediction: Ridge regression and classification , author=. The Annals of Statistics , volume=. 2018 , publisher=
work page 2018
-
[50]
What Causes the Test Error? Going Beyond Bias-Variance via
Lin, Licong and Dobriban, Edgar , journal=. What Causes the Test Error? Going Beyond Bias-Variance via
-
[51]
Communications on Pure and Applied Mathematics , volume=
The generalization error of random features regression: Precise asymptotics and the double descent curve , author=. Communications on Pure and Applied Mathematics , volume=. 2022 , publisher=
work page 2022
-
[52]
Advances in Neural Information Processing Systems , year=
Overparameterization improves robustness to covariate shift in high dimensions , author=. Advances in Neural Information Processing Systems , year=
-
[53]
Spectra of large block matrices
Spectra of large block matrices , author=. arXiv preprint cs/0610045 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[54]
Journal of Functional Analysis , volume=
Applications of realizations (aka linearizations) to free probability , author=. Journal of Functional Analysis , volume=. 2018 , publisher=
work page 2018
-
[55]
Recht, Benjamin and Roelofs, Rebecca and Schmidt, Ludwig and Shankar, Vaishaal , booktitle=. Do
-
[56]
Advances in Neural Information Processing Systems , year=
Measuring robustness to natural distribution shifts in image classification , author=. Advances in Neural Information Processing Systems , year=
-
[57]
Jiang, Yiding and Nagarajan, Vaishnavh and Baek, Christina and Kolter, J Zico , booktitle=. Assessing Generalization of
-
[58]
Agreement-on-the-line: Predicting the Performance of Neural Networks under Distribution Shift , author=. 2022 , booktitle=
work page 2022
-
[59]
International Conference on Machine Learning , year=
The neural tangent kernel in high dimensions: Triple descent and a multi-scale theory of generalization , author=. International Conference on Machine Learning , year=
-
[60]
Advances in Neural Information Processing Systems , year=
Understanding double descent requires a fine-grained bias-variance decomposition , author=. Advances in Neural Information Processing Systems , year=
-
[61]
International Conference on Artificial Intelligence and Statistics , year=
A Random Matrix Perspective on Mixtures of Nonlinearities in High Dimensions , author=. International Conference on Artificial Intelligence and Statistics , year=
-
[62]
Stochastic Processes and their Applications , volume=
On the limiting spectral distribution for a large class of symmetric random matrices with correlated entries , author=. Stochastic Processes and their Applications , volume=. 2015 , publisher=
work page 2015
-
[63]
Banna, Marwa and Najim, Jamal and Yao, Jianfeng , journal=. A. 2020 , publisher=
work page 2020
-
[64]
The matrix Dyson equation and its applications for random matrices
Erd. The matrix. arXiv preprint arXiv:1903.10060 , year=
work page internal anchor Pith review Pith/arXiv arXiv 1903
-
[65]
IEEE Transactions on Information Theory , year=
Universality laws for high-dimensional learning with random features , author=. IEEE Transactions on Information Theory , year=
- [66]
-
[67]
International Mathematics Research Notices , volume=
Operator-valued semicircular elements: solving a quadratic matrix equation with positivity constraints , author=. International Mathematics Research Notices , volume=. 2007 , publisher=
work page 2007
-
[68]
Advances in Neural Information Processing Systems , year=
High-dimensional Asymptotics of Feature Learning: How One Gradient Step Improves the Representation , author=. Advances in Neural Information Processing Systems , year=
-
[69]
International Conference on Learning Representations , year=
Anisotropic Random Feature Regression in High Dimensions , author=. International Conference on Learning Representations , year=
-
[70]
The Annals of Statistics , volume=
Linearized two-layers neural networks in high dimension , author=. The Annals of Statistics , volume=. 2021 , publisher=
work page 2021
-
[71]
The Annals of Statistics , volume=
Distributed linear regression by averaging , author=. The Annals of Statistics , volume=. 2021 , publisher=
work page 2021
-
[72]
The Annals of Statistics , volume=
The spectrum of kernel random matrices , author=. The Annals of Statistics , volume=. 2010 , publisher=
work page 2010
-
[73]
Random Features for Large-Scale Kernel Machines , year =
Rahimi, Ali and Recht, Benjamin , booktitle =. Random Features for Large-Scale Kernel Machines , year =
-
[74]
Proceedings of the IEEE/CVF International Conference on Computer Vision , year=
The many faces of robustness: A critical analysis of out-of-distribution generalization , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , year=
-
[75]
Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay and Hu, Weihua and Yasunaga, Michihiro and Phillips, Richard Lanas and Gao, Irena and others , booktitle=
-
[76]
International Conference on Machine Learning , year=
Accuracy on the line: on the strong correlation between out-of-distribution and in-distribution generalization , author=. International Conference on Machine Learning , year=
-
[77]
Advances in Neural Information Processing Systems , year=
On the Optimal Weighted _2 Regularization in Overparameterized Linear Regression , author=. Advances in Neural Information Processing Systems , year=
-
[78]
The Annals of Statistics , volume=
Surprises in high-dimensional ridgeless least squares interpolation , author=. The Annals of Statistics , volume=. 2022 , publisher=
work page 2022
-
[79]
arXiv preprint arXiv:2208.02753 , year=
Spectral universality of regularized linear regression with nearly deterministic sensing matrices , author=. arXiv preprint arXiv:2208.02753 , year=
-
[80]
Conference on Learning Theory , year=
Universality of empirical risk minimization , author=. Conference on Learning Theory , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.