pith. sign in

arxiv: 1906.08899 · v1 · pith:XERIYRSDnew · submitted 2019-06-21 · 📊 stat.ML · cs.LG· math.ST· stat.TH

Limitations of Lazy Training of Two-layers Neural Networks

Pith reviewed 2026-05-25 19:09 UTC · model grok-4.3

classification 📊 stat.ML cs.LGmath.STstat.TH
keywords two-layer neural networkslazy trainingrandom featuresneural tangent kernelquadratic modelprediction riskunderparameterized regime
0
0 comments X

The pith

For quadratic targets, two-layer nets with quadratic activations show unbounded prediction risk gaps between random features, neural tangent, and full training when neurons are fewer than dimensions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines supervised learning under a quadratic target model with Gaussian inputs and a two-Gaussian mixture classification model, using two-layer networks with quadratic activations. It compares three regimes: random features training of only the second layer, neural tangent training of a linearization around initialization, and full training of all weights. The central result shows that even the quadratic model produces arbitrarily large differences in achieved prediction risk across these regimes when the number of neurons is smaller than the input dimension. This matters because many theoretical analyses of neural networks rely on the neural tangent or random features regimes, yet these may fall short of what full training can achieve in narrower networks. When the number of neurons exceeds the dimension, the neural tangent and full training regimes both reach zero risk.

Core claim

Even for the simple quadratic model where inputs are d-dimensional Gaussians and responses come from an unknown quadratic function, two-layer networks with quadratic activations achieve prediction risks that differ without bound across the random features regime, the neural tangent regime, and the fully trained regime, when the number of neurons is smaller than d. When the number of neurons exceeds d, both the neural tangent and fully trained regimes achieve zero risk.

What carries the argument

The three training regimes (random features, neural tangent linearization, and full weight updates) and the resulting gaps in prediction risk they produce on the quadratic target model.

If this is right

  • Random features training can incur much higher risk than the other two regimes when neurons are fewer than dimensions.
  • Neural tangent training can incur higher risk than full training when neurons are fewer than dimensions.
  • Both neural tangent and full training reach zero risk equally when neurons exceed dimensions.
  • The unbounded gap is exhibited already by the quadratic model without requiring more complex data distributions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Analyses that rely on the neural tangent kernel may systematically underestimate the benefit of non-lazy training in width regimes where neurons are fewer than dimensions.
  • The exact gap may be tied to the quadratic choice of activation and target, raising the question of whether similar separations hold for other activations or targets.
  • The result suggests that the advantage of full training appears only below a critical width threshold relative to dimension.

Load-bearing premise

The target is exactly quadratic and the network uses quadratic activations, which permits exact computation of the risk differences.

What would settle it

A calculation or experiment on the quadratic target with quadratic activations showing that the maximum risk gap across the three regimes remains bounded by a constant independent of all other parameters when neurons are fewer than dimensions.

Figures

Figures reproduced from arXiv: 1906.08899 by Andrea Montanari, Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz.

Figure 1
Figure 1. Figure 1: Left frame: Prediction (test) error of a two-layer neural networks in fitting a quadratic function in [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Left frame: Prediction (test) error of a two-layer neural networks in fitting a mixture of Gaussians [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

We study the supervised learning problem under either of the following two models: (1) Feature vectors ${\boldsymbol x}_i$ are $d$-dimensional Gaussians and responses are $y_i = f_*({\boldsymbol x}_i)$ for $f_*$ an unknown quadratic function; (2) Feature vectors ${\boldsymbol x}_i$ are distributed as a mixture of two $d$-dimensional centered Gaussians, and $y_i$'s are the corresponding class labels. We use two-layers neural networks with quadratic activations, and compare three different learning regimes: the random features (RF) regime in which we only train the second-layer weights; the neural tangent (NT) regime in which we train a linearization of the neural network around its initialization; the fully trained neural network (NN) regime in which we train all the weights in the network. We prove that, even for the simple quadratic model of point (1), there is a potentially unbounded gap between the prediction risk achieved in these three training regimes, when the number of neurons is smaller than the ambient dimension. When the number of neurons is larger than the number of dimensions, the problem is significantly easier and both NT and NN learning achieve zero risk.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript compares three regimes for training two-layer networks with quadratic activations: random features (RF, second-layer only), neural tangent (NT, linearized around init), and full NN (all weights). Under a quadratic regression model with Gaussian features and a two-Gaussian mixture classification model, it proves that when hidden width m < ambient dimension d the population risks of the three regimes can differ by an arbitrarily large factor; when m > d both NT and NN achieve zero risk on the quadratic model.

Significance. If the derivations hold, the work supplies an explicit, analytically tractable existence result showing that lazy training can be arbitrarily suboptimal even for quadratic targets and activations. The modeling choices (quadratic everything) are deliberate and sufficient to produce the separation, and the paper supplies closed-form risk expressions together with the width-versus-dimension threshold, which are concrete strengths.

minor comments (3)
  1. [§2.1] §2.1: the population risk is defined via an expectation that is never written explicitly; adding the integral or E[·] notation would remove ambiguity.
  2. [Theorem 3.1] Theorem 3.1: the statement that the gap is 'potentially unbounded' should be accompanied by the explicit scaling (e.g., gap grows as d/m or similar) that is derived in the proof.
  3. [Figure 2] Figure 2 caption: the plotted curves are not labeled with the precise values of d and m used; this makes it hard to verify the m < d regime visually.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance.

Circularity Check

0 steps flagged

No circularity; derivation is self-contained from explicit models

full rationale

The paper's central claim is a mathematical existence result: for the explicitly defined quadratic target (model 1) or two-Gaussian mixture (model 2) with quadratic activations, the population risks of RF, NT, and NN regimes can differ by an unbounded factor when hidden width m < d. This follows from direct analysis of the optimization problems and risk expressions under the stated Gaussian assumptions, without any fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations. The modeling choices are declared at the outset and suffice to produce the separation; the derivation does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the two explicit data-generating processes and the choice of quadratic activations; these are modeling assumptions rather than derived quantities.

axioms (2)
  • domain assumption Feature vectors are d-dimensional Gaussians (model 1) or a two-component centered Gaussian mixture (model 2)
    Stated explicitly as the supervised learning models under study.
  • domain assumption The network uses quadratic activations
    The architecture chosen for the RF/NT/NN comparison.

pith-pipeline@v0.9.0 · 5762 in / 1335 out tokens · 24362 ms · 2026-05-25T19:09:01.232592+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 8 internal anchors

  1. [1]

    265--283

    Mart \' n Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et al., Tensorflow: A system for large-scale machine learning, 12th \ USENIX \ Symposium on Operating Systems Design and Implementation ( \ OSDI \ 16), 2016, pp. 265--283

  2. [2]

    Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang, On exact computation with an infinitely wide neural net, arXiv:1904.11955 (2019)

  3. [3]

    775--783

    Ahmed El Alaoui and Michael W Mahoney, Fast randomized kernel ridge regression with statistical guarantees, Advances in Neural Information Processing Systems, 2015, pp. 775--783

  4. [4]

    Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song, A convergence theory for deep learning via over-parameterization, arXiv:1811.03962 (2018)

  5. [5]

    185--209

    Francis Bach, Sharp analysis of low-rank kernel matrix approximations, Conference on Learning Theory, 2013, pp. 185--209

  6. [6]

    1, 629--681

    , Breaking the curse of dimensionality with convex neural networks, The Journal of Machine Learning Research 18 (2017), no. 1, 629--681

  7. [7]

    1, 714--751

    , On the equivalence between kernel quadrature rules and random feature expansions, The Journal of Machine Learning Research 18 (2017), no. 1, 714--751

  8. [8]

    St \'e phane Boucheron, G \'a bor Lugosi, and Pascal Massart, Concentration inequalities: A nonasymptotic theory of independence, Oxford university press, 2013

  9. [9]

    20, Springer, 2010

    Zhidong Bai and Jack W Silverstein, Spectral analysis of large dimensional random matrices, vol. 20, Springer, 2010

  10. [10]

    Lenaic Chizat and Francis Bach, A note on lazy training in supervised differentiable programming, arXiv:1812.07956 (2018)

  11. [11]

    4, 303--314

    George Cybenko, Approximation by superpositions of a sigmoidal function, Mathematics of control, signals and systems 2 (1989), no. 4, 303--314

  12. [12]

    Simon S Du and Jason D Lee, On the power of over-parametrization in neural networks with quadratic activation, arXiv:1803.01206 (2018)

  13. [13]

    Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai, Gradient descent finds global minima of deep neural networks, arXiv:1811.03804 (2018)

  14. [14]

    Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh, Gradient descent provably optimizes over-parameterized neural networks, arXiv:1810.02054 (2018)

  15. [15]

    1, 1--50

    Noureddine El Karoui et al., The spectrum of kernel random matrices, The Annals of Statistics 38 (2010), no. 1, 1--50

  16. [16]

    3, 211--218

    Carl Eckart and Gale Young, The approximation of one matrix by another of lower rank, Psychometrika 1 (1936), no. 3, 211--218

  17. [17]

    org, 2017, pp

    Rong Ge, Chi Jin, and Yi Zheng, No spurious local minima in nonconvex low rank problems: A unified geometric analysis, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 1233--1242

  18. [18]

    Rong Ge, Jason D Lee, and Tengyu Ma, Learning one-hidden-layer neural networks with landscape design, arXiv:1711.00501 (2017)

  19. [19]

    Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari, Linearized two-layers neural networks in high dimension, arXiv:1904.12191 (2019)

  20. [20]

    Trevor Hastie, Andrea Montanari, Saharon Rosset, and Ryan J Tibshirani, Surprises in high-dimensional ridgeless least squares interpolation, arXiv:1903.08560 (2019)

  21. [21]

    2007--2015

    Benjamin Haeffele, Eric Young, and Rene Vidal, Structured low-rank matrix factorization: Optimality, algorithm, and applications to image processing, International conference on machine learning, 2014, pp. 2007--2015

  22. [22]

    8571--8580

    Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler, Neural tangent kernel: Convergence and generalization in neural networks, Advances in neural information processing systems, 2018, pp. 8571--8580

  23. [23]

    Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014

    Adam Klivans and Pravesh Kothari, Embedding hard learning problems into gaussian space, Approximation, Randomization, and Combinatorial Optimization. Algorithms and Techniques (APPROX/RANDOM 2014), Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik, 2014

  24. [24]

    1, 49--58

    Thomas G Kurtz, Solutions of ordinary differential equations as limits of pure jump markov processes, Journal of applied Probability 7 (1970), no. 1, 49--58

  25. [25]

    89, American Mathematical Soc., 2001

    Michel Ledoux, The concentration of measure phenomenon, no. 89, American Mathematical Soc., 2001

  26. [26]

    S Lojasiewicz, Sur les trajectoires du gradient d’une fonction analytique, Seminari di geometria 1983 (1982), 115--117

  27. [27]

    Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, and Jeffrey Pennington, Wide neural networks of any depth evolve as linear models under gradient descent, arXiv:1902.06720 (2019)

  28. [28]

    6A, 2747--2774

    Song Mei, Yu Bai, and Andrea Montanari, The landscape of empirical risk for nonconvex losses, The Annals of Statistics 46 (2018), no. 6A, 2747--2774

  29. [29]

    Ioannis Panageas and Georgios Piliouras, Gradient descent only converges to minimizers: Non-isolated critical points and invariant regions, arXiv:1605.00405 (2016)

  30. [30]

    1177--1184

    Ali Rahimi and Benjamin Recht, Random features for large-scale kernel machines, Advances in neural information processing systems, 2008, pp. 1177--1184

  31. [31]

    3215--3225

    Alessandro Rudi and Lorenzo Rosasco, Generalization properties of learning with random features, Advances in Neural Information Processing Systems, 2017, pp. 3215--3225

  32. [32]

    1, 1135--1163

    Ohad Shamir, Distribution-specific hardness of learning neural networks, The Journal of Machine Learning Research 19 (2018), no. 1, 1135--1163

  33. [33]

    2, 742--769

    Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee, Theoretical insights into the optimization landscape of over-parameterized shallow neural networks, IEEE Transactions on Information Theory 65 (2019), no. 2, 742--769

  34. [34]

    Roman Vershynin, Introduction to the non-asymptotic analysis of random matrices, arXiv:1011.3027 (2010)

  35. [35]

    Gilad Yehudai and Ohad Shamir, On the power and limitations of random features for understanding neural networks, arXiv:1904.00687 (2019)

  36. [36]

    Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu, Stochastic gradient descent optimizes over-parameterized deep relu networks, arXiv:1811.08888 (2018)

  37. [37]

    org, 2017, pp

    Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon, Recovery guarantees for one-hidden-layer neural networks, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, 2017, pp. 4140--4149