Limitations of Lazy Training of Two-layers Neural Networks
Pith reviewed 2026-05-25 19:09 UTC · model grok-4.3
The pith
For quadratic targets, two-layer nets with quadratic activations show unbounded prediction risk gaps between random features, neural tangent, and full training when neurons are fewer than dimensions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even for the simple quadratic model where inputs are d-dimensional Gaussians and responses come from an unknown quadratic function, two-layer networks with quadratic activations achieve prediction risks that differ without bound across the random features regime, the neural tangent regime, and the fully trained regime, when the number of neurons is smaller than d. When the number of neurons exceeds d, both the neural tangent and fully trained regimes achieve zero risk.
What carries the argument
The three training regimes (random features, neural tangent linearization, and full weight updates) and the resulting gaps in prediction risk they produce on the quadratic target model.
If this is right
- Random features training can incur much higher risk than the other two regimes when neurons are fewer than dimensions.
- Neural tangent training can incur higher risk than full training when neurons are fewer than dimensions.
- Both neural tangent and full training reach zero risk equally when neurons exceed dimensions.
- The unbounded gap is exhibited already by the quadratic model without requiring more complex data distributions.
Where Pith is reading between the lines
- Analyses that rely on the neural tangent kernel may systematically underestimate the benefit of non-lazy training in width regimes where neurons are fewer than dimensions.
- The exact gap may be tied to the quadratic choice of activation and target, raising the question of whether similar separations hold for other activations or targets.
- The result suggests that the advantage of full training appears only below a critical width threshold relative to dimension.
Load-bearing premise
The target is exactly quadratic and the network uses quadratic activations, which permits exact computation of the risk differences.
What would settle it
A calculation or experiment on the quadratic target with quadratic activations showing that the maximum risk gap across the three regimes remains bounded by a constant independent of all other parameters when neurons are fewer than dimensions.
Figures
read the original abstract
We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript compares three regimes for training two-layer networks with quadratic activations: random features (RF, second-layer only), neural tangent (NT, linearized around init), and full NN (all weights). Under a quadratic regression model with Gaussian features and a two-Gaussian mixture classification model, it proves that when hidden width m < ambient dimension d the population risks of the three regimes can differ by an arbitrarily large factor; when m > d both NT and NN achieve zero risk on the quadratic model.
Significance. If the derivations hold, the work supplies an explicit, analytically tractable existence result showing that lazy training can be arbitrarily suboptimal even for quadratic targets and activations. The modeling choices (quadratic everything) are deliberate and sufficient to produce the separation, and the paper supplies closed-form risk expressions together with the width-versus-dimension threshold, which are concrete strengths.
minor comments (3)
- [§2.1] §2.1: the population risk is defined via an expectation that is never written explicitly; adding the integral or E[·] notation would remove ambiguity.
- [Theorem 3.1] Theorem 3.1: the statement that the gap is 'potentially unbounded' should be accompanied by the explicit scaling (e.g., gap grows as d/m or similar) that is derived in the proof.
- [Figure 2] Figure 2 caption: the plotted curves are not labeled with the precise values of d and m used; this makes it hard to verify the m < d regime visually.
Simulated Author's Rebuttal
We thank the referee for their positive assessment of the manuscript and for recommending acceptance.
Circularity Check
No circularity; derivation is self-contained from explicit models
full rationale
The paper's central claim is a mathematical existence result: for the explicitly defined quadratic target (model 1) or two-Gaussian mixture (model 2) with quadratic activations, the population risks of RF, NT, and NN regimes can differ by an unbounded factor when hidden width m < d. This follows from direct analysis of the optimization problems and risk expressions under the stated Gaussian assumptions, without any fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations. The modeling choices are declared at the outset and suffice to produce the separation; the derivation does not reduce to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Feature vectors are d-dimensional Gaussians (model 1) or a two-component centered Gaussian mixture (model 2)
- domain assumption The network uses quadratic activations
Reference graph
Works this paper leans on
-
[1]
Mart \' n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., Tensorflow: A system for large-scale machine learning, 12th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 16), 2016, pp. 265--283
work page 2016
- [2]
- [3]
-
[4]
Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, arXiv:1811.03962 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [5]
-
[6]
, Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629--681
work page 2017
-
[7]
, On the equivalence between kernel quadrature rules and random feature expansions, The Journal of Machine Learning Research 18 (2017), no. 1, 714--751
work page 2017
-
[8]
St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford university press, 2013
work page 2013
-
[9]
Zhidong Bai and Jack W Silverstein, Spectral analysis of large dimensional random matrices, vol. 20, Springer, 2010
work page 2010
- [10]
-
[11]
George Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989), no. 4, 303--314
work page 1989
-
[12]
Simon S Du and Jason D Lee, On the power of over-parametrization in neural networks with quadratic activation, arXiv:1803.01206 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[13]
Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, Gradient descent finds global minima of deep neural networks, arXiv:1811.03804 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[14]
Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
- [15]
-
[16]
Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (1936), no. 3, 211--218
work page 1936
-
[17]
Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1233--1242
work page 2017
-
[18]
Rong Ge, Jason D Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, arXiv:1711.00501 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [19]
- [20]
-
[21]
Benjamin Haeffele, Eric Young, and Rene Vidal, Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing, International conference on machine learning, 2014, pp. 2007--2015
work page 2014
-
[22]
Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571--8580
work page 2018
-
[23]
Adam Klivans and Pravesh Kothari, Embedding hard learning problems into gaussian space, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014
work page 2014
- [24]
-
[25]
89, American Mathematical Soc., 2001
Michel Ledoux, The concentration of measure phenomenon, no. 89, American Mathematical Soc., 2001
work page 2001
-
[26]
S Lojasiewicz, Sur les trajectoires du gradient d’une fonction analytique, Seminari di geometria 1983 (1982), 115--117
work page 1983
- [27]
-
[28]
Song Mei, Yu Bai, and Andrea Montanari, The landscape of empirical risk for nonconvex losses, The Annals of Statistics 46 (2018), no. 6A, 2747--2774
work page 2018
-
[29]
Ioannis Panageas and Georgios Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, arXiv:1605.00405 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[30]
Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177--1184
work page 2008
-
[31]
Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215--3225
work page 2017
-
[32]
Ohad Shamir, Distribution-specific hardness of learning neural networks, The Journal of Machine Learning Research 19 (2018), no. 1, 1135--1163
work page 2018
-
[33]
Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2019), no. 2, 742--769
work page 2019
-
[34]
Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010)
work page internal anchor Pith review Pith/arXiv arXiv 2010
- [35]
-
[36]
Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu, Stochastic gradient descent optimizes over-parameterized deep relu networks, arXiv:1811.08888 (2018)
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[37]
Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon, Recovery guarantees for one-hidden-layer neural networks, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 4140--4149
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.