pith. sign in

arxiv: 2206.00939 · v3 · submitted 2022-06-02 · 📊 stat.ML · cs.LG

Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

Pith reviewed 2026-05-24 11:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LG
keywords gradient flowReLU networksimplicit biasorthogonal inputsmean squared errorshallow networksconvergencevariation norm
0
0 comments X

The pith

For orthogonal inputs, gradient flow on one-hidden-layer ReLU networks reaches zero loss and selects the minimum variation norm solution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the continuous-time gradient flow of a shallow ReLU network trained on squared loss when the input vectors are pairwise orthogonal. It derives an exact description of the trajectory starting from small random initialization and shows that the flow reaches zero training loss despite the non-convex landscape. The same description reveals an implicit bias that selects, among all zero-loss solutions, the one with smallest variation norm. Additional structure appears in the early phase, where neurons align to the input directions, and in the overall path, which moves from saddle point to saddle point.

Core claim

For orthogonal input vectors, the gradient flow of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation converges to zero loss and is biased towards the minimum variation norm solution. The orthogonality decouples the dynamics across input directions, yielding a closed-form description of the entire trajectory that also captures the initial alignment phenomenon and the saddle-to-saddle progression.

What carries the argument

Closed-form description of the gradient flow obtained by decoupling the evolution across orthogonal input directions.

If this is right

  • The flow converges to a global minimizer of the training loss.
  • Among all networks that fit the data perfectly, the one reached has the smallest variation norm.
  • Neurons rapidly align their activation patterns to the orthogonal input directions at the start of training.
  • The optimization path traverses a sequence of saddle points before reaching the final solution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The minimum-variation-norm bias may explain why the learned function remains stable under small perturbations of the inputs.
  • Similar decoupling arguments could be attempted for inputs that are only approximately orthogonal or lie in low-dimensional subspaces.
  • The saddle-to-saddle structure suggests that the loss landscape contains a chain of critical points whose indices decrease along the flow.

Load-bearing premise

The input vectors are pairwise orthogonal.

What would settle it

Numerical integration of the gradient flow on a small orthogonal data set that either fails to reach zero loss or ends at a solution whose variation norm is not the smallest among all interpolators.

Figures

Figures reproduced from arXiv: 2206.00939 by Etienne Boursier, Loucas Pillaud-Vivien, Nicolas Flammarion.

Figure 1
Figure 1. Figure 1: Timeline of the training dynamics. Neuron alignment phase. During the first phase, all the neurons remain small in norm, while moving tangentially (i.e. in directions). The neurons align according to several key directions: an initial clustering of neurons’ directions happens in this early phase, as observed by Maennel et al. [2018]. As the neurons have small norm, hθ t ≈ 0 for this phase and Equation (9) … view at source ↗
Figure 3
Figure 3. Figure 3: shows the evolution of the loss during train￾ing. The saddle to saddle dynamics is well observed here: the parameters vector starts from the 0 sad￾dle point at initialisation and needs 5000 iterations to leave this first saddle. A second saddle is then encountered at the end of the second phase and the trajectory only leaves this saddle around iteration 11000, once the norm of the neurons in S−,1 start bei… view at source ↗
Figure 2
Figure 2. Figure 2: State of training at different stages. The green dots correspond to the data, while the green line is the estimated function hθ. Each blue star represents a neuron wj : its x-axis value is given by −wj,2/wj,1, which coincides with the position of the kink of its associated ReLU; its y-axis value is given by sjkwjk, which we recall is the associated value of the output layer. a truly non-convex landscape an… view at source ↗
Figure 4
Figure 4. Figure 4: State of training at different stages and loss profile. −2 −1 0 1 2 3 −2 −1 0 1 h θ(x) iteration: 0, frame: 0 x −wj,2/wj,1 −3 −2 −1 0 1 2 3 s j k w j k (a) Initialisation (Iteration 0) −2 −1 0 1 2 3 −2 −1 0 1 h θ(x) iteration: 5009, frame: 124 x −wj,2/wj,1 −3 −2 −1 0 1 2 3 s j k w j k (b) End of training (Iteration 5000) [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics for large initialisation. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: State of training at different stages. Each red (resp. purple) star represents a single neuron with sj = −1 (resp. sj = 1): it shows (in polar coordinates) the projection of the hidden layer weight onto the 2 dimensional space spanned by the two principal components of the hidden layer weights at the final state of training. The inner circle corresponds to 0 norm vectors, whose direction is given by the an… view at source ↗
Figure 7
Figure 7. Figure 7: Additional information on the high dimensional experiment. parameters when projected onto the 2 dimensional space of the two principal components of the 200 × 150 matrix associated to the hidden layer of the network. The training loss profile is given by Figure 7a. Figure 7b finally shows the explained variance ratio of the principal components of the PCA used in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
read the original abstract

The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that, for pairwise orthogonal input vectors, the gradient flow of one-hidden-layer ReLU networks trained on the square loss from small initialization converges to zero loss and exhibits an implicit bias towards the minimum variation norm interpolator. It further provides a quantitative description of the initial alignment phase and establishes that the dynamics follow a specific saddle-to-saddle trajectory.

Significance. This result offers a rare closed-form characterization of the training dynamics in a non-convex setting, made possible by the orthogonality assumption that decouples the problem. The explicit convergence proof and bias characterization are significant contributions to the theory of implicit bias in neural networks. The work delivers parameter-free derivations of the flow and falsifiable predictions for the orthogonal case.

major comments (2)
  1. [§3] §3 (Orthogonality-based decoupling): the central closed-form description rests on showing that the gradient flow equations decouple across input directions when inputs are pairwise orthogonal; the derivation must explicitly verify that all cross terms vanish and that each direction evolves independently, as this step is load-bearing for the entire analysis.
  2. [§5] §5 (Convergence to zero loss): the proof that the flow reaches zero loss via the saddle-to-saddle path assumes small initialization; the precise condition on the initial scale relative to the data norms must be stated quantitatively, otherwise the convergence claim does not hold for arbitrary small initialization.
minor comments (2)
  1. [Abstract] Abstract: the term 'minimum variation norm' appears without a one-sentence definition or pointer to its precise mathematical expression; a brief clarification would aid readers.
  2. [Introduction] Notation: the distinction between the continuous-time gradient flow ODE and any discrete gradient descent implementation is not always explicit in early sections; consistent use of 'flow' versus 'descent' would prevent confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive comments on the manuscript. The suggestions will improve the clarity and rigor of the presentation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [§3] §3 (Orthogonality-based decoupling): the central closed-form description rests on showing that the gradient flow equations decouple across input directions when inputs are pairwise orthogonal; the derivation must explicitly verify that all cross terms vanish and that each direction evolves independently, as this step is load-bearing for the entire analysis.

    Authors: We agree that an explicit verification of the decoupling is necessary for the closed-form analysis. In the revised manuscript we will add a dedicated lemma (or expanded calculation) in Section 3 that starts from the gradient-flow ODE, substitutes the orthogonality condition x_i · x_j = 0 for i ≠ j, and shows term-by-term that all cross-derivative contributions vanish, thereby confirming that each input direction evolves independently. revision: yes

  2. Referee: [§5] §5 (Convergence to zero loss): the proof that the flow reaches zero loss via the saddle-to-saddle path assumes small initialization; the precise condition on the initial scale relative to the data norms must be stated quantitatively, otherwise the convergence claim does not hold for arbitrary small initialization.

    Authors: We thank the referee for this observation. The convergence statement does rely on the initialization being sufficiently small relative to the data norms. In the revision we will state the required quantitative bound explicitly (e.g., that the initial scale ε satisfies ε < c / max_i ||x_i|| for a positive constant c depending only on the problem parameters) and indicate where this bound enters the saddle-to-saddle argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives gradient-flow dynamics for shallow ReLU networks under the explicit, load-bearing assumption of pairwise orthogonal inputs. This assumption is invoked once to decouple the dynamics across directions and obtain a closed-form characterization; the subsequent convergence-to-zero-loss and implicit-bias statements are then proved directly from the resulting ODEs. No parameter is fitted to data and then relabeled a prediction, no self-citation supplies a uniqueness theorem that forces the result, and the derivation does not redefine any target quantity in terms of itself. The central claims therefore remain independent of the inputs they are derived from.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption of orthogonal inputs to obtain independent evolution of directions; no free parameters or invented entities are indicated in the abstract.

axioms (1)
  • domain assumption Input vectors are pairwise orthogonal
    Invoked to decouple the gradient flow across input directions and enable closed-form tracking of neuron alignments and loss decrease.

pith-pipeline@v0.9.0 · 5644 in / 1238 out tokens · 27349 ms · 2026-05-24T11:25:02.437014+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

  1. [1]

    An initial alignment between neural network and target is needed for gradient descent to learn

    Emmanuel Abbe, Elisabetta Cornacchia, Jan Hazla, and Christopher Marquis. An initial alignment between neural network and target is needed for gradient descent to learn. In International Conference on Machine Learning, pages 33--52. PMLR, 2022

  2. [2]

    Learning and generalization in overparameterized neural networks, going beyond two layers

    Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019

  3. [3]

    Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

    Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322--332. PMLR, 2019

  4. [4]

    Breaking the curse of dimensionality with convex neural networks

    Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18 0 (1): 0 629--681, 2017

  5. [5]

    Numerical influence of ReLu ’(0) on backpropagation

    David Bertoin, J \'e r \^o me Bolte, S \'e bastien Gerchinovitz, and Edouard Pauwels. Numerical influence of ReLu ’(0) on backpropagation. Advances in Neural Information Processing Systems, 34, 2021

  6. [6]

    On the inductive bias of neural tangent kernels

    Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019

  7. [7]

    The ojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems

    J \'e r \^o me Bolte, Aris Daniilidis, and Adrian Lewis. The ojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17 0 (4): 0 1205--1223, 2007

  8. [8]

    Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity

    J \'e r \^o me Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet. Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society, 362 0 (6): 0 3319--3363, 2010

  9. [9]

    Convergence of gradient descent for deep neural networks

    Sourav Chatterjee. Convergence of gradient descent for deep neural networks. arXiv preprint arXiv:2203.16462, 2022

  10. [10]

    On feature learning in neural networks with global convergence guarantees

    Zhengdao Chen, Eric Vanden-Eijnden, and Joan Bruna. On feature learning in neural networks with global convergence guarantees. arXiv preprint arXiv:2204.10782, 2022

  11. [11]

    On the global convergence of gradient descent for over-parameterized models using optimal transport

    Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018

  12. [12]

    Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

    Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305--1338. PMLR, 2020

  13. [13]

    On lazy training in differentiable programming

    Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019

  14. [14]

    Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

    Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014

  15. [15]

    Sparsest piecewise-linear regression of one-dimensional data

    Thomas Debarre, Quentin Denoyelle, Michael Unser, and Julien Fageot. Sparsest piecewise-linear regression of one-dimensional data. Journal of Computational and Applied Mathematics, 406: 0 114044, 2022

  16. [16]

    Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

    Simon Eberle, Arnulf Jentzen, Adrian Riekert, and Georg S Weiss. Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation. arXiv preprint arXiv:2108.08106, 2021

  17. [17]

    Neural tangent kernel: Convergence and generalization in neural networks

    Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018

  18. [18]

    arXiv preprint arXiv:2106.15933 , year=

    Arthur Jacot, Fran c ois Ged, Franck Gabriel, Berfin S im s ek, and Cl \'e ment Hongler. Saddle-to-saddle dynamics in deep linear networks: S mall initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933, 2021

  19. [19]

    Gradient descent aligns the layers of deep linear networks

    Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019 a

  20. [20]

    The implicit bias of gradient descent on nonseparable data

    Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772--1798. PMLR, 2019 b

  21. [21]

    Directional convergence and alignment in deep learning

    Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33: 0 17176--17186, 2020

  22. [22]

    Sgd on neural networks learns functions of increasing complexity

    Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019

  23. [23]

    Bounds on rates of variable-basis and neural-network approximation

    Vera Kurkov \'a and Marcello Sanguineti. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory, 47 0 (6): 0 2659--2665, 2001

  24. [24]

    Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations

    Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. The Journal of Machine Learning Research, 20 0 (1): 0 1474--1520, 2019

  25. [25]

    Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning

    Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2020

  26. [26]

    Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

    Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022

  27. [27]

    Gradient descent maximizes the margin of homogeneous neural networks

    Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019

  28. [28]

    Gradient descent on two-layer nets: Margin maximization and simplicity bias

    Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, and Sanjeev Arora. Gradient descent on two-layer nets: Margin maximization and simplicity bias. Advances in Neural Information Processing Systems, 34, 2021

  29. [29]

    Gradient Descent Quantizes ReLU Network Features

    Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes ReLu network features. arXiv preprint arXiv:1803.08367, 2018

  30. [30]

    A mean field view of the landscape of two-layer neural networks

    Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 0 (33): 0 E7665--E7671, 2018

  31. [31]

    On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks

    Hancheng Min, Salma Tarmoun, Rene Vidal, and Enrique Mallada. On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7760--7768. PMLR, 18--24 Jul 2021

  32. [32]

    In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

    Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014

  33. [33]

    A function space view of bounded norm infinite width ReLu nets: The multivariate case

    Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width ReLu nets: The multivariate case. arXiv preprint arXiv:1910.01635, 2019

  34. [34]

    What kinds of functions do deep neural networks learn? I nsights from variational spline theory

    Rahul Parhi and Robert D Nowak. What kinds of functions do deep neural networks learn? I nsights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4 0 (2): 0 464--489, 2022

  35. [35]

    Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity

    Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34, 2021

  36. [36]

    Learning sparse features can lead to overfitting in neural networks

    Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, and Matthieu Wyart. Learning sparse features can lead to overfitting in neural networks. arXiv preprint arXiv:2206.12314, 2022

  37. [37]

    The inductive bias of ReLU networks on orthogonally separable data

    Mary Phuong and Christoph H Lampert. The inductive bias of ReLU networks on orthogonally separable data. In International Conference on Learning Representations, 2020

  38. [38]

    Trainability and accuracy of artificial neural networks: An interacting particle system approach

    Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75 0 (9): 0 1889--1935, 2022. doi:https://doi.org/10.1002/cpa.22074

  39. [39]

    The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks

    Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3889--3934. PMLR, 15--19 Aug 2021

  40. [40]

    How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667--2690

    Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667--2690. PMLR, 2019

  41. [41]

    Mean-field analysis of piecewise linear solutions for wide ReLu networks

    Alexander Shevchenko, Vyacheslav Kungurtsev, and Marco Mondelli. Mean-field analysis of piecewise linear solutions for wide ReLu networks. arXiv preprint arXiv:2111.02278, 2021

  42. [42]

    Mean field analysis of neural networks: A law of large numbers

    Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 0 (2): 0 725--752, 2020

  43. [43]

    The implicit bias of gradient descent on separable data

    Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19 0 (1): 0 2822--2878, 2018

  44. [44]

    Implicit regularization in ReLu networks with the square loss

    Gal Vardi and Ohad Shamir. Implicit regularization in ReLu networks with the square loss. In Conference on Learning Theory, pages 4224--4258. PMLR, 2021

  45. [45]

    The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program

    Yifei Wang and Mert Pilanci. The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program. arXiv preprint arXiv:2110.06488, 2021

  46. [46]

    On the convergence of gradient descent training for two-layer ReLu -networks in the mean field regime

    Stephan Wojtowytsch. On the convergence of gradient descent training for two-layer ReLu -networks in the mean field regime. arXiv preprint arXiv:2005.13530, 2020

  47. [47]

    Kernel and rich regimes in overparametrized models

    Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635--3673. PMLR, 2020

  48. [48]

    Tensor programs iv: Feature learning in infinite-width neural networks

    Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727--11737. PMLR, 2021

  49. [49]

    A unifying view on implicit bias in training linear neural networks

    Chulhee Yun, Shankar Krishnan, and Hossein Mobahi. A unifying view on implicit bias in training linear neural networks. In International Conference on Learning Representations, 2021

  50. [50]

    Understanding deep learning (still) requires rethinking generalization

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64 0 (3): 0 107--115, 2021

  51. [51]

    A local convergence theory for mildly over-parameterized two-layer neural network

    Mo Zhou, Rong Ge, and Chi Jin. A local convergence theory for mildly over-parameterized two-layer neural network. In Conference on Learning Theory, pages 4577--4632. PMLR, 2021