Gradient flow dynamics of shallow ReLU networks for square loss and orthogonal inputs
Pith reviewed 2026-05-24 11:25 UTC · model grok-4.3
The pith
For orthogonal inputs, gradient flow on one-hidden-layer ReLU networks reaches zero loss and selects the minimum variation norm solution.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For orthogonal input vectors, the gradient flow of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation converges to zero loss and is biased towards the minimum variation norm solution. The orthogonality decouples the dynamics across input directions, yielding a closed-form description of the entire trajectory that also captures the initial alignment phenomenon and the saddle-to-saddle progression.
What carries the argument
Closed-form description of the gradient flow obtained by decoupling the evolution across orthogonal input directions.
If this is right
- The flow converges to a global minimizer of the training loss.
- Among all networks that fit the data perfectly, the one reached has the smallest variation norm.
- Neurons rapidly align their activation patterns to the orthogonal input directions at the start of training.
- The optimization path traverses a sequence of saddle points before reaching the final solution.
Where Pith is reading between the lines
- The minimum-variation-norm bias may explain why the learned function remains stable under small perturbations of the inputs.
- Similar decoupling arguments could be attempted for inputs that are only approximately orthogonal or lie in low-dimensional subspaces.
- The saddle-to-saddle structure suggests that the loss landscape contains a chain of critical points whose indices decrease along the flow.
Load-bearing premise
The input vectors are pairwise orthogonal.
What would settle it
Numerical integration of the gradient flow on a small orthogonal data set that either fails to reach zero loss or ends at a solution whose variation norm is not the smallest among all interpolators.
Figures
read the original abstract
The training of neural networks by gradient descent methods is a cornerstone of the deep learning revolution. Yet, despite some recent progress, a complete theory explaining its success is still missing. This article presents, for orthogonal input vectors, a precise description of the gradient flow dynamics of training one-hidden layer ReLU neural networks for the mean squared error at small initialisation. In this setting, despite non-convexity, we show that the gradient flow converges to zero loss and characterise its implicit bias towards minimum variation norm. Furthermore, some interesting phenomena are highlighted: a quantitative description of the initial alignment phenomenon and a proof that the process follows a specific saddle to saddle dynamics.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that, for pairwise orthogonal input vectors, the gradient flow of one-hidden-layer ReLU networks trained on the square loss from small initialization converges to zero loss and exhibits an implicit bias towards the minimum variation norm interpolator. It further provides a quantitative description of the initial alignment phase and establishes that the dynamics follow a specific saddle-to-saddle trajectory.
Significance. This result offers a rare closed-form characterization of the training dynamics in a non-convex setting, made possible by the orthogonality assumption that decouples the problem. The explicit convergence proof and bias characterization are significant contributions to the theory of implicit bias in neural networks. The work delivers parameter-free derivations of the flow and falsifiable predictions for the orthogonal case.
major comments (2)
- [§3] §3 (Orthogonality-based decoupling): the central closed-form description rests on showing that the gradient flow equations decouple across input directions when inputs are pairwise orthogonal; the derivation must explicitly verify that all cross terms vanish and that each direction evolves independently, as this step is load-bearing for the entire analysis.
- [§5] §5 (Convergence to zero loss): the proof that the flow reaches zero loss via the saddle-to-saddle path assumes small initialization; the precise condition on the initial scale relative to the data norms must be stated quantitatively, otherwise the convergence claim does not hold for arbitrary small initialization.
minor comments (2)
- [Abstract] Abstract: the term 'minimum variation norm' appears without a one-sentence definition or pointer to its precise mathematical expression; a brief clarification would aid readers.
- [Introduction] Notation: the distinction between the continuous-time gradient flow ODE and any discrete gradient descent implementation is not always explicit in early sections; consistent use of 'flow' versus 'descent' would prevent confusion.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and the constructive comments on the manuscript. The suggestions will improve the clarity and rigor of the presentation. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Orthogonality-based decoupling): the central closed-form description rests on showing that the gradient flow equations decouple across input directions when inputs are pairwise orthogonal; the derivation must explicitly verify that all cross terms vanish and that each direction evolves independently, as this step is load-bearing for the entire analysis.
Authors: We agree that an explicit verification of the decoupling is necessary for the closed-form analysis. In the revised manuscript we will add a dedicated lemma (or expanded calculation) in Section 3 that starts from the gradient-flow ODE, substitutes the orthogonality condition x_i · x_j = 0 for i ≠ j, and shows term-by-term that all cross-derivative contributions vanish, thereby confirming that each input direction evolves independently. revision: yes
-
Referee: [§5] §5 (Convergence to zero loss): the proof that the flow reaches zero loss via the saddle-to-saddle path assumes small initialization; the precise condition on the initial scale relative to the data norms must be stated quantitatively, otherwise the convergence claim does not hold for arbitrary small initialization.
Authors: We thank the referee for this observation. The convergence statement does rely on the initialization being sufficiently small relative to the data norms. In the revision we will state the required quantitative bound explicitly (e.g., that the initial scale ε satisfies ε < c / max_i ||x_i|| for a positive constant c depending only on the problem parameters) and indicate where this bound enters the saddle-to-saddle argument. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper derives gradient-flow dynamics for shallow ReLU networks under the explicit, load-bearing assumption of pairwise orthogonal inputs. This assumption is invoked once to decouple the dynamics across directions and obtain a closed-form characterization; the subsequent convergence-to-zero-loss and implicit-bias statements are then proved directly from the resulting ODEs. No parameter is fitted to data and then relabeled a prediction, no self-citation supplies a uniqueness theorem that forces the result, and the derivation does not redefine any target quantity in terms of itself. The central claims therefore remain independent of the inputs they are derived from.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Input vectors are pairwise orthogonal
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
characterise its implicit bias towards minimum variation norm
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
An initial alignment between neural network and target is needed for gradient descent to learn
Emmanuel Abbe, Elisabetta Cornacchia, Jan Hazla, and Christopher Marquis. An initial alignment between neural network and target is needed for gradient descent to learn. In International Conference on Machine Learning, pages 33--52. PMLR, 2022
work page 2022
-
[2]
Learning and generalization in overparameterized neural networks, going beyond two layers
Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparameterized neural networks, going beyond two layers. Advances in neural information processing systems, 32, 2019
work page 2019
-
[3]
Sanjeev Arora, Simon Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In International Conference on Machine Learning, pages 322--332. PMLR, 2019
work page 2019
-
[4]
Breaking the curse of dimensionality with convex neural networks
Francis Bach. Breaking the curse of dimensionality with convex neural networks. The Journal of Machine Learning Research, 18 0 (1): 0 629--681, 2017
work page 2017
-
[5]
Numerical influence of ReLu ’(0) on backpropagation
David Bertoin, J \'e r \^o me Bolte, S \'e bastien Gerchinovitz, and Edouard Pauwels. Numerical influence of ReLu ’(0) on backpropagation. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[6]
On the inductive bias of neural tangent kernels
Alberto Bietti and Julien Mairal. On the inductive bias of neural tangent kernels. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[7]
J \'e r \^o me Bolte, Aris Daniilidis, and Adrian Lewis. The ojasiewicz inequality for nonsmooth subanalytic functions with applications to subgradient dynamical systems. SIAM Journal on Optimization, 17 0 (4): 0 1205--1223, 2007
work page 2007
-
[8]
Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity
J \'e r \^o me Bolte, Aris Daniilidis, Olivier Ley, and Laurent Mazet. Characterizations of ojasiewicz inequalities: subgradient flows, talweg, convexity. Transactions of the American Mathematical Society, 362 0 (6): 0 3319--3363, 2010
work page 2010
-
[9]
Convergence of gradient descent for deep neural networks
Sourav Chatterjee. Convergence of gradient descent for deep neural networks. arXiv preprint arXiv:2203.16462, 2022
-
[10]
On feature learning in neural networks with global convergence guarantees
Zhengdao Chen, Eric Vanden-Eijnden, and Joan Bruna. On feature learning in neural networks with global convergence guarantees. arXiv preprint arXiv:2204.10782, 2022
-
[11]
On the global convergence of gradient descent for over-parameterized models using optimal transport
Lenaic Chizat and Francis Bach. On the global convergence of gradient descent for over-parameterized models using optimal transport. Advances in neural information processing systems, 31, 2018
work page 2018
-
[12]
Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss
Lenaic Chizat and Francis Bach. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pages 1305--1338. PMLR, 2020
work page 2020
-
[13]
On lazy training in differentiable programming
Lenaic Chizat, Edouard Oyallon, and Francis Bach. On lazy training in differentiable programming. Advances in Neural Information Processing Systems, 32, 2019
work page 2019
-
[14]
Identifying and attacking the saddle point problem in high-dimensional non-convex optimization
Yann N Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. Advances in neural information processing systems, 27, 2014
work page 2014
-
[15]
Sparsest piecewise-linear regression of one-dimensional data
Thomas Debarre, Quentin Denoyelle, Michael Unser, and Julien Fageot. Sparsest piecewise-linear regression of one-dimensional data. Journal of Computational and Applied Mathematics, 406: 0 114044, 2022
work page 2022
-
[16]
Simon Eberle, Arnulf Jentzen, Adrian Riekert, and Georg S Weiss. Existence, uniqueness, and convergence rates for gradient flows in the training of artificial neural networks with ReLU activation. arXiv preprint arXiv:2108.08106, 2021
-
[17]
Neural tangent kernel: Convergence and generalization in neural networks
Arthur Jacot, Franck Gabriel, and Cl \'e ment Hongler. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018
work page 2018
-
[18]
arXiv preprint arXiv:2106.15933 , year=
Arthur Jacot, Fran c ois Ged, Franck Gabriel, Berfin S im s ek, and Cl \'e ment Hongler. Saddle-to-saddle dynamics in deep linear networks: S mall initialization training, symmetry, and sparsity. arXiv preprint arXiv:2106.15933, 2021
-
[19]
Gradient descent aligns the layers of deep linear networks
Ziwei Ji and Matus Telgarsky. Gradient descent aligns the layers of deep linear networks. In International Conference on Learning Representations, 2019 a
work page 2019
-
[20]
The implicit bias of gradient descent on nonseparable data
Ziwei Ji and Matus Telgarsky. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pages 1772--1798. PMLR, 2019 b
work page 2019
-
[21]
Directional convergence and alignment in deep learning
Ziwei Ji and Matus Telgarsky. Directional convergence and alignment in deep learning. Advances in Neural Information Processing Systems, 33: 0 17176--17186, 2020
work page 2020
-
[22]
Sgd on neural networks learns functions of increasing complexity
Dimitris Kalimeris, Gal Kaplun, Preetum Nakkiran, Benjamin Edelman, Tristan Yang, Boaz Barak, and Haofeng Zhang. Sgd on neural networks learns functions of increasing complexity. Advances in neural information processing systems, 32, 2019
work page 2019
-
[23]
Bounds on rates of variable-basis and neural-network approximation
Vera Kurkov \'a and Marcello Sanguineti. Bounds on rates of variable-basis and neural-network approximation. IEEE Transactions on Information Theory, 47 0 (6): 0 2659--2665, 2001
work page 2001
-
[24]
Qianxiao Li, Cheng Tai, and E Weinan. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. The Journal of Machine Learning Research, 20 0 (1): 0 1474--1520, 2019
work page 2019
-
[25]
Zhiyuan Li, Yuping Luo, and Kaifeng Lyu. Towards resolving the implicit bias of gradient descent for matrix factorization: Greedy low-rank learning. In International Conference on Learning Representations, 2020
work page 2020
-
[26]
Loss landscapes and optimization in over-parameterized non-linear systems and neural networks
Chaoyue Liu, Libin Zhu, and Mikhail Belkin. Loss landscapes and optimization in over-parameterized non-linear systems and neural networks. Applied and Computational Harmonic Analysis, 2022
work page 2022
-
[27]
Gradient descent maximizes the margin of homogeneous neural networks
Kaifeng Lyu and Jian Li. Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, 2019
work page 2019
-
[28]
Gradient descent on two-layer nets: Margin maximization and simplicity bias
Kaifeng Lyu, Zhiyuan Li, Runzhe Wang, and Sanjeev Arora. Gradient descent on two-layer nets: Margin maximization and simplicity bias. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[29]
Gradient Descent Quantizes ReLU Network Features
Hartmut Maennel, Olivier Bousquet, and Sylvain Gelly. Gradient descent quantizes ReLu network features. arXiv preprint arXiv:1803.08367, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[30]
A mean field view of the landscape of two-layer neural networks
Song Mei, Andrea Montanari, and Phan-Minh Nguyen. A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences, 115 0 (33): 0 E7665--E7671, 2018
work page 2018
-
[31]
Hancheng Min, Salma Tarmoun, Rene Vidal, and Enrique Mallada. On the explicit role of initialization on the convergence and implicit bias of overparametrized linear networks. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7760--7768. PMLR, 18--24 Jul 2021
work page 2021
-
[32]
In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning
Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. arXiv preprint arXiv:1412.6614, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[33]
A function space view of bounded norm infinite width ReLu nets: The multivariate case
Greg Ongie, Rebecca Willett, Daniel Soudry, and Nathan Srebro. A function space view of bounded norm infinite width ReLu nets: The multivariate case. arXiv preprint arXiv:1910.01635, 2019
-
[34]
What kinds of functions do deep neural networks learn? I nsights from variational spline theory
Rahul Parhi and Robert D Nowak. What kinds of functions do deep neural networks learn? I nsights from variational spline theory. SIAM Journal on Mathematics of Data Science, 4 0 (2): 0 464--489, 2022
work page 2022
-
[35]
Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity
Scott Pesme, Loucas Pillaud-Vivien, and Nicolas Flammarion. Implicit bias of sgd for diagonal linear networks: A provable benefit of stochasticity. Advances in Neural Information Processing Systems, 34, 2021
work page 2021
-
[36]
Learning sparse features can lead to overfitting in neural networks
Leonardo Petrini, Francesco Cagnetta, Eric Vanden-Eijnden, and Matthieu Wyart. Learning sparse features can lead to overfitting in neural networks. arXiv preprint arXiv:2206.12314, 2022
-
[37]
The inductive bias of ReLU networks on orthogonally separable data
Mary Phuong and Christoph H Lampert. The inductive bias of ReLU networks on orthogonally separable data. In International Conference on Learning Representations, 2020
work page 2020
-
[38]
Trainability and accuracy of artificial neural networks: An interacting particle system approach
Grant Rotskoff and Eric Vanden-Eijnden. Trainability and accuracy of artificial neural networks: An interacting particle system approach. Communications on Pure and Applied Mathematics, 75 0 (9): 0 1889--1935, 2022. doi:https://doi.org/10.1002/cpa.22074
-
[39]
Itay M Safran, Gilad Yehudai, and Ohad Shamir. The effects of mild over-parameterization on the optimization landscape of shallow ReLU neural networks. In Proceedings of Thirty Fourth Conference on Learning Theory, volume 134 of Proceedings of Machine Learning Research, pages 3889--3934. PMLR, 15--19 Aug 2021
work page 2021
-
[40]
Pedro Savarese, Itay Evron, Daniel Soudry, and Nathan Srebro. How do infinite width bounded norm networks look in function space? In Conference on Learning Theory, pages 2667--2690. PMLR, 2019
work page 2019
-
[41]
Mean-field analysis of piecewise linear solutions for wide ReLu networks
Alexander Shevchenko, Vyacheslav Kungurtsev, and Marco Mondelli. Mean-field analysis of piecewise linear solutions for wide ReLu networks. arXiv preprint arXiv:2111.02278, 2021
-
[42]
Mean field analysis of neural networks: A law of large numbers
Justin Sirignano and Konstantinos Spiliopoulos. Mean field analysis of neural networks: A law of large numbers. SIAM Journal on Applied Mathematics, 80 0 (2): 0 725--752, 2020
work page 2020
-
[43]
The implicit bias of gradient descent on separable data
Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19 0 (1): 0 2822--2878, 2018
work page 2018
-
[44]
Implicit regularization in ReLu networks with the square loss
Gal Vardi and Ohad Shamir. Implicit regularization in ReLu networks with the square loss. In Conference on Learning Theory, pages 4224--4258. PMLR, 2021
work page 2021
-
[45]
Yifei Wang and Mert Pilanci. The convex geometry of backpropagation: Neural network gradient flows converge to extreme points of the dual convex program. arXiv preprint arXiv:2110.06488, 2021
-
[46]
Stephan Wojtowytsch. On the convergence of gradient descent training for two-layer ReLu -networks in the mean field regime. arXiv preprint arXiv:2005.13530, 2020
-
[47]
Kernel and rich regimes in overparametrized models
Blake Woodworth, Suriya Gunasekar, Jason D Lee, Edward Moroshko, Pedro Savarese, Itay Golan, Daniel Soudry, and Nathan Srebro. Kernel and rich regimes in overparametrized models. In Conference on Learning Theory, pages 3635--3673. PMLR, 2020
work page 2020
-
[48]
Tensor programs iv: Feature learning in infinite-width neural networks
Greg Yang and Edward J Hu. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pages 11727--11737. PMLR, 2021
work page 2021
-
[49]
A unifying view on implicit bias in training linear neural networks
Chulhee Yun, Shankar Krishnan, and Hossein Mobahi. A unifying view on implicit bias in training linear neural networks. In International Conference on Learning Representations, 2021
work page 2021
-
[50]
Understanding deep learning (still) requires rethinking generalization
Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64 0 (3): 0 107--115, 2021
work page 2021
-
[51]
A local convergence theory for mildly over-parameterized two-layer neural network
Mo Zhou, Rong Ge, and Chi Jin. A local convergence theory for mildly over-parameterized two-layer neural network. In Conference on Learning Theory, pages 4577--4632. PMLR, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.