Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

Etienne Boursier; Loucas Pillaud-Vivien; Nicolas Flammarion

arxiv: 2206.00939 · v3 · submitted 2022-06-02 · 📊 stat.ML · cs.LG

Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs

Etienne Boursier , Loucas Pillaud-Vivien , Nicolas Flammarion This is my paper

Pith reviewed 2026-05-24 11:25 UTC · model grok-4.3

classification 📊 stat.ML cs.LG

keywords gradient flowReLU networksimplicit biasorthogonal inputsmean squared errorshallow networksconvergencevariation norm

0 comments

The pith

For orthogonal inputs, gradient flow on one-hidden-layer ReLU networks reaches zero loss and selects the minimum variation norm solution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies the continuous-time gradient flow of a shallow ReLU network trained on squared loss when the input vectors are pairwise orthogonal. It derives an exact description of the trajectory starting from small random initialization and shows that the flow reaches zero training loss despite the non-convex landscape. The same description reveals an implicit bias that selects, among all zero-loss solutions, the one with smallest variation norm. Additional structure appears in the early phase, where neurons align to the input directions, and in the overall path, which moves from saddle point to saddle point.

Core claim

For orthogonal input vectors, the gradient flow of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation converges to zero loss and is biased towards the minimum variation norm solution. The orthogonality decouples the dynamics across input directions, yielding a closed-form description of the entire trajectory that also captures the initial alignment phenomenon and the saddle-to-saddle progression.

What carries the argument

Closed-form description of the gradient flow obtained by decoupling the evolution across orthogonal input directions.

If this is right

The flow converges to a global minimizer of the training loss.
Among all networks that fit the data perfectly, the one reached has the smallest variation norm.
Neurons rapidly align their activation patterns to the orthogonal input directions at the start of training.
The optimization path traverses a sequence of saddle points before reaching the final solution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The minimum-variation-norm bias may explain why the learned function remains stable under small perturbations of the inputs.
Similar decoupling arguments could be attempted for inputs that are only approximately orthogonal or lie in low-dimensional subspaces.
The saddle-to-saddle structure suggests that the loss landscape contains a chain of critical points whose indices decrease along the flow.

Load-bearing premise

The input vectors are pairwise orthogonal.

What would settle it

Numerical integration of the gradient flow on a small orthogonal data set that either fails to reach zero loss or ends at a solution whose variation norm is not the smallest among all interpolators.

Figures

Figures reproduced from arXiv: 2206.00939 by Etienne Boursier, Loucas Pillaud-Vivien, Nicolas Flammarion.

**Figure 1.** Figure 1: Timeline of the training dynamics. Neuron alignment phase. During the first phase, all the neurons remain small in norm, while moving tangentially (i.e. in directions). The neurons align according to several key directions: an initial clustering of neurons’ directions happens in this early phase, as observed by Maennel et al. [2018]. As the neurons have small norm, hθ t ≈ 0 for this phase and Equation (9) … view at source ↗

**Figure 3.** Figure 3: shows the evolution of the loss during training. The saddle to saddle dynamics is well observed here: the parameters vector starts from the 0 saddle point at initialisation and needs 5000 iterations to leave this first saddle. A second saddle is then encountered at the end of the second phase and the trajectory only leaves this saddle around iteration 11000, once the norm of the neurons in S−,1 start bei… view at source ↗

**Figure 2.** Figure 2: State of training at different stages. The green dots correspond to the data, while the green line is the estimated function hθ. Each blue star represents a neuron wj : its x-axis value is given by −wj,2/wj,1, which coincides with the position of the kink of its associated ReLU; its y-axis value is given by sjkwjk, which we recall is the associated value of the output layer. a truly non-convex landscape an… view at source ↗

**Figure 4.** Figure 4: State of training at different stages and loss profile. −2 −1 0 1 2 3 −2 −1 0 1 h θ(x) iteration: 0, frame: 0 x −wj,2/wj,1 −3 −2 −1 0 1 2 3 s j k w j k (a) Initialisation (Iteration 0) −2 −1 0 1 2 3 −2 −1 0 1 h θ(x) iteration: 5009, frame: 124 x −wj,2/wj,1 −3 −2 −1 0 1 2 3 s j k w j k (b) End of training (Iteration 5000) [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics for large initialisation. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: State of training at different stages. Each red (resp. purple) star represents a single neuron with sj = −1 (resp. sj = 1): it shows (in polar coordinates) the projection of the hidden layer weight onto the 2 dimensional space spanned by the two principal components of the hidden layer weights at the final state of training. The inner circle corresponds to 0 norm vectors, whose direction is given by the an… view at source ↗

**Figure 7.** Figure 7: Additional information on the high dimensional experiment. parameters when projected onto the 2 dimensional space of the two principal components of the 200 × 150 matrix associated to the hidden layer of the network. The training loss profile is given by Figure 7a. Figure 7b finally shows the explained variance ratio of the principal components of the PCA used in [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a clean closed-form description of gradient flow for shallow ReLU nets under orthogonal inputs, proving convergence to zero loss and min-variation-norm bias, but the assumption keeps the result narrow.

read the letter

The core result is that, when inputs are pairwise orthogonal, gradient flow on a one-hidden-layer ReLU network with small initialization and square loss reaches zero training loss and aligns with the minimum variation norm solution. They also track the initial alignment phase and show the trajectory passes through a sequence of saddles. That level of explicit dynamics is new for this architecture and loss; prior work mostly stopped at high-level bias statements. The orthogonality assumption lets them decouple the coordinates and write the flow in closed form, which is the main technical move and appears to hold up inside the stated setting. The proofs are presented as complete for the continuous case. The obvious limitation is the input assumption itself. Real data vectors are rarely orthogonal, so the decoupling does not carry over and the result stays a special-case analysis. The paper also stays with continuous flow; the usual gap to discrete gradient descent with finite step size is left open. No circularity or fitting issues appear. This is useful reading for anyone tracking exact implicit-bias calculations in simplified non-convex models. It is narrow enough that most practitioners will not cite it directly, but the explicit saddle-to-saddle description is worth having on record. A serious editor should send it to referees rather than desk-reject; the claims are delimited and the derivation is the kind of concrete progress the area needs.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that, for pairwise orthogonal input vectors, the gradient flow of one-hidden-layer ReLU networks trained on the square loss from small initialization converges to zero loss and exhibits an implicit bias towards the minimum variation norm interpolator. It further provides a quantitative description of the initial alignment phase and establishes that the dynamics follow a specific saddle-to-saddle trajectory.

Significance. This result offers a rare closed-form characterization of the training dynamics in a non-convex setting, made possible by the orthogonality assumption that decouples the problem. The explicit convergence proof and bias characterization are significant contributions to the theory of implicit bias in neural networks. The work delivers parameter-free derivations of the flow and falsifiable predictions for the orthogonal case.

major comments (2)

[§3] §3 (Orthogonality-based decoupling): the central closed-form description rests on showing that the gradient flow equations decouple across input directions when inputs are pairwise orthogonal; the derivation must explicitly verify that all cross terms vanish and that each direction evolves independently, as this step is load-bearing for the entire analysis.
[§5] §5 (Convergence to zero loss): the proof that the flow reaches zero loss via the saddle-to-saddle path assumes small initialization; the precise condition on the initial scale relative to the data norms must be stated quantitatively, otherwise the convergence claim does not hold for arbitrary small initialization.

minor comments (2)

[Abstract] Abstract: the term 'minimum variation norm' appears without a one-sentence definition or pointer to its precise mathematical expression; a brief clarification would aid readers.
[Introduction] Notation: the distinction between the continuous-time gradient flow ODE and any discrete gradient descent implementation is not always explicit in early sections; consistent use of 'flow' versus 'descent' would prevent confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and the constructive comments on the manuscript. The suggestions will improve the clarity and rigor of the presentation. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3] §3 (Orthogonality-based decoupling): the central closed-form description rests on showing that the gradient flow equations decouple across input directions when inputs are pairwise orthogonal; the derivation must explicitly verify that all cross terms vanish and that each direction evolves independently, as this step is load-bearing for the entire analysis.

Authors: We agree that an explicit verification of the decoupling is necessary for the closed-form analysis. In the revised manuscript we will add a dedicated lemma (or expanded calculation) in Section 3 that starts from the gradient-flow ODE, substitutes the orthogonality condition x_i · x_j = 0 for i ≠ j, and shows term-by-term that all cross-derivative contributions vanish, thereby confirming that each input direction evolves independently. revision: yes
Referee: [§5] §5 (Convergence to zero loss): the proof that the flow reaches zero loss via the saddle-to-saddle path assumes small initialization; the precise condition on the initial scale relative to the data norms must be stated quantitatively, otherwise the convergence claim does not hold for arbitrary small initialization.

Authors: We thank the referee for this observation. The convergence statement does rely on the initialization being sufficiently small relative to the data norms. In the revision we will state the required quantitative bound explicitly (e.g., that the initial scale ε satisfies ε < c / max_i ||x_i|| for a positive constant c depending only on the problem parameters) and indicate where this bound enters the saddle-to-saddle argument. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper derives gradient-flow dynamics for shallow ReLU networks under the explicit, load-bearing assumption of pairwise orthogonal inputs. This assumption is invoked once to decouple the dynamics across directions and obtain a closed-form characterization; the subsequent convergence-to-zero-loss and implicit-bias statements are then proved directly from the resulting ODEs. No parameter is fitted to data and then relabeled a prediction, no self-citation supplies a uniqueness theorem that forces the result, and the derivation does not redefine any target quantity in terms of itself. The central claims therefore remain independent of the inputs they are derived from.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the domain assumption of orthogonal inputs to obtain independent evolution of directions; no free parameters or invented entities are indicated in the abstract.

axioms (1)

domain assumption Input vectors are pairwise orthogonal
Invoked to decouple the gradient flow across input directions and enable closed-form tracking of neuron alignments and loss decrease.

pith-pipeline@v0.9.0 · 5644 in / 1238 out tokens · 27349 ms · 2026-05-24T11:25:02.437014+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

characterise its implicit bias towards minimum variation norm

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 2 internal anchors

[1]

An initial alignment between neural network and target is needed for gradient descent to learn

Emmanuel Abbe, Elisabetta Cornacchia, Jan Hazla, and Christopher Marquis. An initial alignment between neural network and target is needed for gradient descent to learn. In International Conference on Machine Learning, pages 33--52. PMLR, 2022

work page 2022
[2]

Learning and generalization in overparameterized neural networks, going beyond two layers

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019

work page 2019
[3]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322--332. PMLR, 2019

work page 2019
[4]

Breaking the curse of dimensionality with convex neural networks

Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18 0 (1): 0 629--681, 2017

work page 2017
[5]

Numerical influence of ReLu ’(0) on backpropagation

David Bertoin, J \'e r \^o me Bolte, S \'e bastien Gerchinovitz, and Edouard Pauwels. Numerical influence of ReLu ’(0) on backpropagation. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[6]

On the inductive bias of neural tangent kernels

Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[7]

The ojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems

J \'e r \^o me Bolte, Aris Daniilidis, and Adrian Lewis. The ojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17 0 (4): 0 1205--1223, 2007

work page 2007
[8]

Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity

J \'e r \^o me Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet. Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society, 362 0 (6): 0 3319--3363, 2010

work page 2010
[9]

Convergence of gradient descent for deep neural networks

Sourav Chatterjee. Convergence of gradient descent for deep neural networks. arXiv preprint arXiv:2203.16462, 2022

work page arXiv 2022
[10]

On feature learning in neural networks with global convergence guarantees

Zhengdao Chen, Eric Vanden-Eijnden, and Joan Bruna. On feature learning in neural networks with global convergence guarantees. arXiv preprint arXiv:2204.10782, 2022

work page arXiv 2022
[11]

On the global convergence of gradient descent for over-parameterized models using optimal transport

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018

work page 2018
[12]

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305--1338. PMLR, 2020

work page 2020
[13]

On lazy training in differentiable programming

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019

work page 2019
[14]

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014

work page 2014
[15]

Sparsest piecewise-linear regression of one-dimensional data

Thomas Debarre, Quentin Denoyelle, Michael Unser, and Julien Fageot. Sparsest piecewise-linear regression of one-dimensional data. Journal of Computational and Applied Mathematics, 406: 0 114044, 2022

work page 2022
[16]

Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

Simon Eberle, Arnulf Jentzen, Adrian Riekert, and Georg S Weiss. Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation. arXiv preprint arXiv:2108.08106, 2021

work page arXiv 2021
[17]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018

work page 2018
[18]

arXiv preprint arXiv:2106.15933 , year=

Arthur Jacot, Fran c ois Ged, Franck Gabriel, Berfin S im s ek, and Cl \'e ment Hongler. Saddle-to-saddle dynamics in deep linear networks: S mall initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933, 2021

work page arXiv 2021
[19]

Gradient descent aligns the layers of deep linear networks

Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019 a

work page 2019
[20]

The implicit bias of gradient descent on nonseparable data

Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772--1798. PMLR, 2019 b

work page 2019
[21]

Directional convergence and alignment in deep learning

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33: 0 17176--17186, 2020

work page 2020
[22]

Sgd on neural networks learns functions of increasing complexity

Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019

work page 2019
[23]

Bounds on rates of variable-basis and neural-network approximation

Vera Kurkov \'a and Marcello Sanguineti. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory, 47 0 (6): 0 2659--2665, 2001

work page 2001
[24]

Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations

Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. The Journal of Machine Learning Research, 20 0 (1): 0 1474--1520, 2019

work page 2019
[25]

Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning

Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2020

work page 2020
[26]

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022

work page 2022
[27]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019

work page 2019
[28]

Gradient descent on two-layer nets: Margin maximization and simplicity bias

Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, and Sanjeev Arora. Gradient descent on two-layer nets: Margin maximization and simplicity bias. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[29]

Gradient Descent Quantizes ReLU Network Features

Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes ReLu network features. arXiv preprint arXiv:1803.08367, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

A mean field view of the landscape of two-layer neural networks

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 0 (33): 0 E7665--E7671, 2018

work page 2018
[31]

On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks

Hancheng Min, Salma Tarmoun, Rene Vidal, and Enrique Mallada. On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7760--7768. PMLR, 18--24 Jul 2021

work page 2021
[32]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[33]

A function space view of bounded norm infinite width ReLu nets: The multivariate case

Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width ReLu nets: The multivariate case. arXiv preprint arXiv:1910.01635, 2019

work page arXiv 1910
[34]

What kinds of functions do deep neural networks learn? I nsights from variational spline theory

Rahul Parhi and Robert D Nowak. What kinds of functions do deep neural networks learn? I nsights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4 0 (2): 0 464--489, 2022

work page 2022
[35]

Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity

Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34, 2021

work page 2021
[36]

Learning sparse features can lead to overfitting in neural networks

Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, and Matthieu Wyart. Learning sparse features can lead to overfitting in neural networks. arXiv preprint arXiv:2206.12314, 2022

work page arXiv 2022
[37]

The inductive bias of ReLU networks on orthogonally separable data

Mary Phuong and Christoph H Lampert. The inductive bias of ReLU networks on orthogonally separable data. In International Conference on Learning Representations, 2020

work page 2020
[38]

Trainability and accuracy of artificial neural networks: An interacting particle system approach

Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75 0 (9): 0 1889--1935, 2022. doi:https://doi.org/10.1002/cpa.22074

work page doi:10.1002/cpa.22074 1935
[39]

The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks

Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3889--3934. PMLR, 15--19 Aug 2021

work page 2021
[40]

How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667--2690

Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667--2690. PMLR, 2019

work page 2019
[41]

Mean-field analysis of piecewise linear solutions for wide ReLu networks

Alexander Shevchenko, Vyacheslav Kungurtsev, and Marco Mondelli. Mean-field analysis of piecewise linear solutions for wide ReLu networks. arXiv preprint arXiv:2111.02278, 2021

work page arXiv 2021
[42]

Mean field analysis of neural networks: A law of large numbers

Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 0 (2): 0 725--752, 2020

work page 2020
[43]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19 0 (1): 0 2822--2878, 2018

work page 2018
[44]

Implicit regularization in ReLu networks with the square loss

Gal Vardi and Ohad Shamir. Implicit regularization in ReLu networks with the square loss. In Conference on Learning Theory, pages 4224--4258. PMLR, 2021

work page 2021
[45]

The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program

Yifei Wang and Mert Pilanci. The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program. arXiv preprint arXiv:2110.06488, 2021

work page arXiv 2021
[46]

On the convergence of gradient descent training for two-layer ReLu -networks in the mean field regime

Stephan Wojtowytsch. On the convergence of gradient descent training for two-layer ReLu -networks in the mean field regime. arXiv preprint arXiv:2005.13530, 2020

work page arXiv 2005
[47]

Kernel and rich regimes in overparametrized models

Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635--3673. PMLR, 2020

work page 2020
[48]

Tensor programs iv: Feature learning in infinite-width neural networks

Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727--11737. PMLR, 2021

work page 2021
[49]

A unifying view on implicit bias in training linear neural networks

Chulhee Yun, Shankar Krishnan, and Hossein Mobahi. A unifying view on implicit bias in training linear neural networks. In International Conference on Learning Representations, 2021

work page 2021
[50]

Understanding deep learning (still) requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64 0 (3): 0 107--115, 2021

work page 2021
[51]

A local convergence theory for mildly over-parameterized two-layer neural network

Mo Zhou, Rong Ge, and Chi Jin. A local convergence theory for mildly over-parameterized two-layer neural network. In Conference on Learning Theory, pages 4577--4632. PMLR, 2021

work page 2021

[1] [1]

An initial alignment between neural network and target is needed for gradient descent to learn

Emmanuel Abbe, Elisabetta Cornacchia, Jan Hazla, and Christopher Marquis. An initial alignment between neural network and target is needed for gradient descent to learn. In International Conference on Machine Learning, pages 33--52. PMLR, 2022

work page 2022

[2] [2]

Learning and generalization in overparameterized neural networks, going beyond two layers

Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019

work page 2019

[3] [3]

Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks

Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322--332. PMLR, 2019

work page 2019

[4] [4]

Breaking the curse of dimensionality with convex neural networks

Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18 0 (1): 0 629--681, 2017

work page 2017

[5] [5]

Numerical influence of ReLu ’(0) on backpropagation

David Bertoin, J \'e r \^o me Bolte, S \'e bastien Gerchinovitz, and Edouard Pauwels. Numerical influence of ReLu ’(0) on backpropagation. Advances in Neural Information Processing Systems, 34, 2021

work page 2021

[6] [6]

On the inductive bias of neural tangent kernels

Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[7] [7]

The ojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems

J \'e r \^o me Bolte, Aris Daniilidis, and Adrian Lewis. The ojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17 0 (4): 0 1205--1223, 2007

work page 2007

[8] [8]

Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity

J \'e r \^o me Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet. Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society, 362 0 (6): 0 3319--3363, 2010

work page 2010

[9] [9]

Convergence of gradient descent for deep neural networks

Sourav Chatterjee. Convergence of gradient descent for deep neural networks. arXiv preprint arXiv:2203.16462, 2022

work page arXiv 2022

[10] [10]

On feature learning in neural networks with global convergence guarantees

Zhengdao Chen, Eric Vanden-Eijnden, and Joan Bruna. On feature learning in neural networks with global convergence guarantees. arXiv preprint arXiv:2204.10782, 2022

work page arXiv 2022

[11] [11]

On the global convergence of gradient descent for over-parameterized models using optimal transport

Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018

work page 2018

[12] [12]

Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss

Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305--1338. PMLR, 2020

work page 2020

[13] [13]

On lazy training in differentiable programming

Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019

work page 2019

[14] [14]

Identifying and attacking the saddle point problem in high-dimensional non-convex optimization

Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014

work page 2014

[15] [15]

Sparsest piecewise-linear regression of one-dimensional data

Thomas Debarre, Quentin Denoyelle, Michael Unser, and Julien Fageot. Sparsest piecewise-linear regression of one-dimensional data. Journal of Computational and Applied Mathematics, 406: 0 114044, 2022

work page 2022

[16] [16]

Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation

Simon Eberle, Arnulf Jentzen, Adrian Riekert, and Georg S Weiss. Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation. arXiv preprint arXiv:2108.08106, 2021

work page arXiv 2021

[17] [17]

Neural tangent kernel: Convergence and generalization in neural networks

Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018

work page 2018

[18] [18]

arXiv preprint arXiv:2106.15933 , year=

Arthur Jacot, Fran c ois Ged, Franck Gabriel, Berfin S im s ek, and Cl \'e ment Hongler. Saddle-to-saddle dynamics in deep linear networks: S mall initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933, 2021

work page arXiv 2021

[19] [19]

Gradient descent aligns the layers of deep linear networks

Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019 a

work page 2019

[20] [20]

The implicit bias of gradient descent on nonseparable data

Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772--1798. PMLR, 2019 b

work page 2019

[21] [21]

Directional convergence and alignment in deep learning

Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33: 0 17176--17186, 2020

work page 2020

[22] [22]

Sgd on neural networks learns functions of increasing complexity

Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019

work page 2019

[23] [23]

Bounds on rates of variable-basis and neural-network approximation

Vera Kurkov \'a and Marcello Sanguineti. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory, 47 0 (6): 0 2659--2665, 2001

work page 2001

[24] [24]

Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations

Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. The Journal of Machine Learning Research, 20 0 (1): 0 1474--1520, 2019

work page 2019

[25] [25]

Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning

Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2020

work page 2020

[26] [26]

Loss landscapes and optimization in over-parameterized non-linear systems and neural networks

Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022

work page 2022

[27] [27]

Gradient descent maximizes the margin of homogeneous neural networks

Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019

work page 2019

[28] [28]

Gradient descent on two-layer nets: Margin maximization and simplicity bias

Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, and Sanjeev Arora. Gradient descent on two-layer nets: Margin maximization and simplicity bias. Advances in Neural Information Processing Systems, 34, 2021

work page 2021

[29] [29]

Gradient Descent Quantizes ReLU Network Features

Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes ReLu network features. arXiv preprint arXiv:1803.08367, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

A mean field view of the landscape of two-layer neural networks

Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 0 (33): 0 E7665--E7671, 2018

work page 2018

[31] [31]

On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks

Hancheng Min, Salma Tarmoun, Rene Vidal, and Enrique Mallada. On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7760--7768. PMLR, 18--24 Jul 2021

work page 2021

[32] [32]

In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning

Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[33] [33]

A function space view of bounded norm infinite width ReLu nets: The multivariate case

Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width ReLu nets: The multivariate case. arXiv preprint arXiv:1910.01635, 2019

work page arXiv 1910

[34] [34]

What kinds of functions do deep neural networks learn? I nsights from variational spline theory

Rahul Parhi and Robert D Nowak. What kinds of functions do deep neural networks learn? I nsights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4 0 (2): 0 464--489, 2022

work page 2022

[35] [35]

Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity

Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34, 2021

work page 2021

[36] [36]

Learning sparse features can lead to overfitting in neural networks

Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, and Matthieu Wyart. Learning sparse features can lead to overfitting in neural networks. arXiv preprint arXiv:2206.12314, 2022

work page arXiv 2022

[37] [37]

The inductive bias of ReLU networks on orthogonally separable data

Mary Phuong and Christoph H Lampert. The inductive bias of ReLU networks on orthogonally separable data. In International Conference on Learning Representations, 2020

work page 2020

[38] [38]

Trainability and accuracy of artificial neural networks: An interacting particle system approach

Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75 0 (9): 0 1889--1935, 2022. doi:https://doi.org/10.1002/cpa.22074

work page doi:10.1002/cpa.22074 1935

[39] [39]

The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks

Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3889--3934. PMLR, 15--19 Aug 2021

work page 2021

[40] [40]

How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667--2690

Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667--2690. PMLR, 2019

work page 2019

[41] [41]

Mean-field analysis of piecewise linear solutions for wide ReLu networks

Alexander Shevchenko, Vyacheslav Kungurtsev, and Marco Mondelli. Mean-field analysis of piecewise linear solutions for wide ReLu networks. arXiv preprint arXiv:2111.02278, 2021

work page arXiv 2021

[42] [42]

Mean field analysis of neural networks: A law of large numbers

Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 0 (2): 0 725--752, 2020

work page 2020

[43] [43]

The implicit bias of gradient descent on separable data

Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19 0 (1): 0 2822--2878, 2018

work page 2018

[44] [44]

Implicit regularization in ReLu networks with the square loss

Gal Vardi and Ohad Shamir. Implicit regularization in ReLu networks with the square loss. In Conference on Learning Theory, pages 4224--4258. PMLR, 2021

work page 2021

[45] [45]

The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program

Yifei Wang and Mert Pilanci. The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program. arXiv preprint arXiv:2110.06488, 2021

work page arXiv 2021

[46] [46]

On the convergence of gradient descent training for two-layer ReLu -networks in the mean field regime

Stephan Wojtowytsch. On the convergence of gradient descent training for two-layer ReLu -networks in the mean field regime. arXiv preprint arXiv:2005.13530, 2020

work page arXiv 2005

[47] [47]

Kernel and rich regimes in overparametrized models

Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635--3673. PMLR, 2020

work page 2020

[48] [48]

Tensor programs iv: Feature learning in infinite-width neural networks

Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727--11737. PMLR, 2021

work page 2021

[49] [49]

A unifying view on implicit bias in training linear neural networks

Chulhee Yun, Shankar Krishnan, and Hossein Mobahi. A unifying view on implicit bias in training linear neural networks. In International Conference on Learning Representations, 2021

work page 2021

[50] [50]

Understanding deep learning (still) requires rethinking generalization

Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64 0 (3): 0 107--115, 2021

work page 2021

[51] [51]

A local convergence theory for mildly over-parameterized two-layer neural network

Mo Zhou, Rong Ge, and Chi Jin. A local convergence theory for mildly over-parameterized two-layer neural network. In Conference on Learning Theory, pages 4577--4632. PMLR, 2021

work page 2021