Limitations of Lazy Training of Two-layers Neural Networks

Andrea Montanari; Behrooz Ghorbani; Song Mei; Theodor Misiakiewicz

arxiv: 1906.08899 · v1 · pith:XERIYRSDnew · submitted 2019-06-21 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Limitations of Lazy Training of Two-layers Neural Networks

Behrooz Ghorbani , Song Mei , Theodor Misiakiewicz , Andrea Montanari This is my paper

Pith reviewed 2026-05-25 19:09 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH

keywords two-layer neural networkslazy trainingrandom featuresneural tangent kernelquadratic modelprediction riskunderparameterized regime

0 comments

The pith

For quadratic targets, two-layer nets with quadratic activations show unbounded prediction risk gaps between random features, neural tangent, and full training when neurons are fewer than dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines supervised learning under a quadratic target model with Gaussian inputs and a two-Gaussian mixture classification model, using two-layer networks with quadratic activations. It compares three regimes: random features training of only the second layer, neural tangent training of a linearization around initialization, and full training of all weights. The central result shows that even the quadratic model produces arbitrarily large differences in achieved prediction risk across these regimes when the number of neurons is smaller than the input dimension. This matters because many theoretical analyses of neural networks rely on the neural tangent or random features regimes, yet these may fall short of what full training can achieve in narrower networks. When the number of neurons exceeds the dimension, the neural tangent and full training regimes both reach zero risk.

Core claim

Even for the simple quadratic model where inputs are d-dimensional Gaussians and responses come from an unknown quadratic function, two-layer networks with quadratic activations achieve prediction risks that differ without bound across the random features regime, the neural tangent regime, and the fully trained regime, when the number of neurons is smaller than d. When the number of neurons exceeds d, both the neural tangent and fully trained regimes achieve zero risk.

What carries the argument

The three training regimes (random features, neural tangent linearization, and full weight updates) and the resulting gaps in prediction risk they produce on the quadratic target model.

If this is right

Random features training can incur much higher risk than the other two regimes when neurons are fewer than dimensions.
Neural tangent training can incur higher risk than full training when neurons are fewer than dimensions.
Both neural tangent and full training reach zero risk equally when neurons exceed dimensions.
The unbounded gap is exhibited already by the quadratic model without requiring more complex data distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Analyses that rely on the neural tangent kernel may systematically underestimate the benefit of non-lazy training in width regimes where neurons are fewer than dimensions.
The exact gap may be tied to the quadratic choice of activation and target, raising the question of whether similar separations hold for other activations or targets.
The result suggests that the advantage of full training appears only below a critical width threshold relative to dimension.

Load-bearing premise

The target is exactly quadratic and the network uses quadratic activations, which permits exact computation of the risk differences.

What would settle it

A calculation or experiment on the quadratic target with quadratic activations showing that the maximum risk gap across the three regimes remains bounded by a constant independent of all other parameters when neurons are fewer than dimensions.

Figures

Figures reproduced from arXiv: 1906.08899 by Andrea Montanari, Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz.

**Figure 2.** Figure 2: Left frame: Prediction (test) error of a two-layer neural networks in fitting a mixture of Gaussians [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper shows an explicit unbounded risk gap between lazy regimes and full training on quadratic targets when width is below dimension.

read the letter

The core point is that random features and neural tangent training can be arbitrarily worse than full training even on a quadratic target with quadratic activations, provided the hidden layer has fewer neurons than the input dimension. When width exceeds dimension both NT and full training reach zero risk, but below that threshold the gap opens up without bound for the Gaussian quadratic model. This is laid out cleanly for two simple data models and gives a concrete separation between the three regimes. The construction is explicit enough that the population risks can be derived in closed form for each case. That is the main contribution: an existence result showing lazy training is not always sufficient, even for elementary supervised problems. The math isolates the regimes without extra assumptions on sample size or optimization details beyond the standard ones. The result is new relative to earlier NTK and random feature analyses because those mostly established approximation or equivalence in certain limits rather than an explicit performance gap. The paper does not overclaim; it sticks to the quadratic setting that makes the separation possible. The modeling choices are narrow by design, but that is what produces the unbounded gap, and the authors are clear about the scope. No circularity appears in the argument, and the claim is an existence statement rather than a generic assertion about all activations or targets. This is useful for anyone working on the limits of lazy training explanations. Readers following the NTK literature will want to see the details and check the derivations. It is solid enough on its own terms to merit sending out for peer review rather than a desk reject.

Referee Report

0 major / 3 minor

Summary. The manuscript compares three regimes for training two-layer networks with quadratic activations: random features (RF, second-layer only), neural tangent (NT, linearized around init), and full NN (all weights). Under a quadratic regression model with Gaussian features and a two-Gaussian mixture classification model, it proves that when hidden width m < ambient dimension d the population risks of the three regimes can differ by an arbitrarily large factor; when m > d both NT and NN achieve zero risk on the quadratic model.

Significance. If the derivations hold, the work supplies an explicit, analytically tractable existence result showing that lazy training can be arbitrarily suboptimal even for quadratic targets and activations. The modeling choices (quadratic everything) are deliberate and sufficient to produce the separation, and the paper supplies closed-form risk expressions together with the width-versus-dimension threshold, which are concrete strengths.

minor comments (3)

[§2.1] §2.1: the population risk is defined via an expectation that is never written explicitly; adding the integral or E[·] notation would remove ambiguity.
[Theorem 3.1] Theorem 3.1: the statement that the gap is 'potentially unbounded' should be accompanied by the explicit scaling (e.g., gap grows as d/m or similar) that is derived in the proof.
[Figure 2] Figure 2 caption: the plotted curves are not labeled with the precise values of d and m used; this makes it hard to verify the m < d regime visually.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance.

Circularity Check

0 steps flagged

No circularity; derivation is self-contained from explicit models

full rationale

The paper's central claim is a mathematical existence result: for the explicitly defined quadratic target (model 1) or two-Gaussian mixture (model 2) with quadratic activations, the population risks of RF, NT, and NN regimes can differ by an unbounded factor when hidden width m < d. This follows from direct analysis of the optimization problems and risk expressions under the stated Gaussian assumptions, without any fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations. The modeling choices are declared at the outset and suffice to produce the separation; the derivation does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the two explicit data-generating processes and the choice of quadratic activations; these are modeling assumptions rather than derived quantities.

axioms (2)

domain assumption Feature vectors are d-dimensional Gaussians (model 1) or a two-component centered Gaussian mixture (model 2)
Stated explicitly as the supervised learning models under study.
domain assumption The network uses quadratic activations
The architecture chosen for the RF/NT/NN comparison.

pith-pipeline@v0.9.0 · 5762 in / 1335 out tokens · 24362 ms · 2026-05-25T19:09:01.232592+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 8 internal anchors

[1]

265--283

Mart \' n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., Tensorflow: A system for large-scale machine learning, 12th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 16), 2016, pp. 265--283

work page 2016
[2]

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang, On exact computation with an infinitely wide neural net, arXiv:1904.11955 (2019)

work page arXiv 1904
[3]

775--783

Ahmed El Alaoui and Michael W Mahoney, Fast randomized kernel ridge regression with statistical guarantees, Advances in Neural Information Processing Systems, 2015, pp. 775--783

work page 2015
[4]

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, arXiv:1811.03962 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

185--209

Francis Bach, Sharp analysis of low-rank kernel matrix approximations, Conference on Learning Theory, 2013, pp. 185--209

work page 2013
[6]

1, 629--681

, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629--681

work page 2017
[7]

1, 714--751

, On the equivalence between kernel quadrature rules and random feature expansions, The Journal of Machine Learning Research 18 (2017), no. 1, 714--751

work page 2017
[8]

St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford university press, 2013

work page 2013
[9]

20, Springer, 2010

Zhidong Bai and Jack W Silverstein, Spectral analysis of large dimensional random matrices, vol. 20, Springer, 2010

work page 2010
[10]

Lenaic Chizat and Francis Bach, A note on lazy training in supervised differentiable programming, arXiv:1812.07956 (2018)

work page arXiv 2018
[11]

4, 303--314

George Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989), no. 4, 303--314

work page 1989
[12]

Simon S Du and Jason D Lee, On the power of over-parametrization in neural networks with quadratic activation, arXiv:1803.01206 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[13]

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, Gradient descent finds global minima of deep neural networks, arXiv:1811.03804 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[14]

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[15]

1, 1--50

Noureddine El Karoui et al., The spectrum of kernel random matrices, The Annals of Statistics 38 (2010), no. 1, 1--50

work page 2010
[16]

3, 211--218

Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (1936), no. 3, 211--218

work page 1936
[17]

org, 2017, pp

Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1233--1242

work page 2017
[18]

Rong Ge, Jason D Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, arXiv:1711.00501 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[19]

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, arXiv:1904.12191 (2019)

work page arXiv 1904
[20]

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, arXiv:1903.08560 (2019)

work page arXiv 1903
[21]

2007--2015

Benjamin Haeffele, Eric Young, and Rene Vidal, Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing, International conference on machine learning, 2014, pp. 2007--2015

work page 2014
[22]

8571--8580

Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571--8580

work page 2018
[23]

Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014

Adam Klivans and Pravesh Kothari, Embedding hard learning problems into gaussian space, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014

work page 2014
[24]

1, 49--58

Thomas G Kurtz, Solutions of ordinary differential equations as limits of pure jump markov processes, Journal of applied Probability 7 (1970), no. 1, 49--58

work page 1970
[25]

89, American Mathematical Soc., 2001

Michel Ledoux, The concentration of measure phenomenon, no. 89, American Mathematical Soc., 2001

work page 2001
[26]

S Lojasiewicz, Sur les trajectoires du gradient d’une fonction analytique, Seminari di geometria 1983 (1982), 115--117

work page 1983
[27]

Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington, Wide neural networks of any depth evolve as linear models under gradient descent, arXiv:1902.06720 (2019)

work page arXiv 1902
[28]

6A, 2747--2774

Song Mei, Yu Bai, and Andrea Montanari, The landscape of empirical risk for nonconvex losses, The Annals of Statistics 46 (2018), no. 6A, 2747--2774

work page 2018
[29]

Ioannis Panageas and Georgios Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, arXiv:1605.00405 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[30]

1177--1184

Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177--1184

work page 2008
[31]

3215--3225

Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215--3225

work page 2017
[32]

1, 1135--1163

Ohad Shamir, Distribution-specific hardness of learning neural networks, The Journal of Machine Learning Research 19 (2018), no. 1, 1135--1163

work page 2018
[33]

2, 742--769

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2019), no. 2, 742--769

work page 2019
[34]

Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010)

work page internal anchor Pith review Pith/arXiv arXiv 2010
[35]

Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for understanding neural networks, arXiv:1904.00687 (2019)

work page arXiv 1904
[36]

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu, Stochastic gradient descent optimizes over-parameterized deep relu networks, arXiv:1811.08888 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[37]

org, 2017, pp

Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon, Recovery guarantees for one-hidden-layer neural networks, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 4140--4149

work page 2017

[1] [1]

265--283

Mart \' n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., Tensorflow: A system for large-scale machine learning, 12th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 16), 2016, pp. 265--283

work page 2016

[2] [2]

Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang, On exact computation with an infinitely wide neural net, arXiv:1904.11955 (2019)

work page arXiv 1904

[3] [3]

775--783

Ahmed El Alaoui and Michael W Mahoney, Fast randomized kernel ridge regression with statistical guarantees, Advances in Neural Information Processing Systems, 2015, pp. 775--783

work page 2015

[4] [4]

Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, arXiv:1811.03962 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[5] [5]

185--209

Francis Bach, Sharp analysis of low-rank kernel matrix approximations, Conference on Learning Theory, 2013, pp. 185--209

work page 2013

[6] [6]

1, 629--681

, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629--681

work page 2017

[7] [7]

1, 714--751

, On the equivalence between kernel quadrature rules and random feature expansions, The Journal of Machine Learning Research 18 (2017), no. 1, 714--751

work page 2017

[8] [8]

St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford university press, 2013

work page 2013

[9] [9]

20, Springer, 2010

Zhidong Bai and Jack W Silverstein, Spectral analysis of large dimensional random matrices, vol. 20, Springer, 2010

work page 2010

[10] [10]

Lenaic Chizat and Francis Bach, A note on lazy training in supervised differentiable programming, arXiv:1812.07956 (2018)

work page arXiv 2018

[11] [11]

4, 303--314

George Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989), no. 4, 303--314

work page 1989

[12] [12]

Simon S Du and Jason D Lee, On the power of over-parametrization in neural networks with quadratic activation, arXiv:1803.01206 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[13] [13]

Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, Gradient descent finds global minima of deep neural networks, arXiv:1811.03804 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[14] [14]

Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[15] [15]

1, 1--50

Noureddine El Karoui et al., The spectrum of kernel random matrices, The Annals of Statistics 38 (2010), no. 1, 1--50

work page 2010

[16] [16]

3, 211--218

Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (1936), no. 3, 211--218

work page 1936

[17] [17]

org, 2017, pp

Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1233--1242

work page 2017

[18] [18]

Rong Ge, Jason D Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, arXiv:1711.00501 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[19] [19]

Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, arXiv:1904.12191 (2019)

work page arXiv 1904

[20] [20]

Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, arXiv:1903.08560 (2019)

work page arXiv 1903

[21] [21]

2007--2015

Benjamin Haeffele, Eric Young, and Rene Vidal, Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing, International conference on machine learning, 2014, pp. 2007--2015

work page 2014

[22] [22]

8571--8580

Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571--8580

work page 2018

[23] [23]

Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014

Adam Klivans and Pravesh Kothari, Embedding hard learning problems into gaussian space, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014

work page 2014

[24] [24]

1, 49--58

Thomas G Kurtz, Solutions of ordinary differential equations as limits of pure jump markov processes, Journal of applied Probability 7 (1970), no. 1, 49--58

work page 1970

[25] [25]

89, American Mathematical Soc., 2001

Michel Ledoux, The concentration of measure phenomenon, no. 89, American Mathematical Soc., 2001

work page 2001

[26] [26]

S Lojasiewicz, Sur les trajectoires du gradient d’une fonction analytique, Seminari di geometria 1983 (1982), 115--117

work page 1983

[27] [27]

Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington, Wide neural networks of any depth evolve as linear models under gradient descent, arXiv:1902.06720 (2019)

work page arXiv 1902

[28] [28]

6A, 2747--2774

Song Mei, Yu Bai, and Andrea Montanari, The landscape of empirical risk for nonconvex losses, The Annals of Statistics 46 (2018), no. 6A, 2747--2774

work page 2018

[29] [29]

Ioannis Panageas and Georgios Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, arXiv:1605.00405 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[30] [30]

1177--1184

Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177--1184

work page 2008

[31] [31]

3215--3225

Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215--3225

work page 2017

[32] [32]

1, 1135--1163

Ohad Shamir, Distribution-specific hardness of learning neural networks, The Journal of Machine Learning Research 19 (2018), no. 1, 1135--1163

work page 2018

[33] [33]

2, 742--769

Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2019), no. 2, 742--769

work page 2019

[34] [34]

Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010)

work page internal anchor Pith review Pith/arXiv arXiv 2010

[35] [35]

Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for understanding neural networks, arXiv:1904.00687 (2019)

work page arXiv 1904

[36] [36]

Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu, Stochastic gradient descent optimizes over-parameterized deep relu networks, arXiv:1811.08888 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018

[37] [37]

org, 2017, pp

Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon, Recovery guarantees for one-hidden-layer neural networks, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 4140--4149

work page 2017