Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

Arnulf Jentzen; Timo Kr\"oger

arxiv: 2102.11840 · v2 · pith:AMPODYNAnew · submitted 2021-02-23 · 💻 cs.LG · cs.NA· math.NA· math.PR

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

Arnulf Jentzen , Timo Kr\"oger This is my paper

Pith reviewed 2026-05-24 14:08 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAmath.PR

keywords overparameterized neural networksgradient descentlinear convergencepiecewise affine activationReLUmean squared errorshallow networksbatch gradient descent

0 comments

The pith

Batch gradient descent on sufficiently wide shallow networks with non-affine piecewise affine activations converges linearly to zero mean squared error with high probability over random initialization when inputs are distinct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that overparameterized shallow fully connected neural networks with activations such as ReLU reach exact interpolation of distinct training points under batch gradient descent. The mean squared error decreases at a linear rate once the network width exceeds a threshold and the learning rate stays below another threshold. The result applies to randomly initialized networks and holds with high probability. It explains the observed success of gradient descent on non-convex, non-smooth objectives in the overparameterized regime.

Core claim

Given that the activation function is not affine and the training input data are pairwise distinct, the mean squared error of a randomly initialized overparameterized shallow artificial neural network with piecewise affine activation optimized via batch gradient descent converges to zero at a linear convergence rate with high probability when the width is sufficiently large and the learning rate is sufficiently small.

What carries the argument

The gradient flow dynamics of batch gradient descent on the mean squared error loss for overparameterized networks with piecewise affine activations, which permit exact interpolation once width is large.

If this is right

The training error contracts geometrically at each gradient step once the width and step-size conditions hold.
The guarantee applies with high probability over the random initialization of weights.
The same linear rate extends to any piecewise affine activation that is not globally affine.
The result requires only shallow fully connected architectures and mean squared error loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar linear rates may hold for stochastic gradient descent variants if batch size is controlled.
The distinct-input condition could be relaxed by considering data with small perturbations.
The width threshold might be made explicit in terms of the number of training points and the activation's kink locations.

Load-bearing premise

The activation function is not affine and every pair of training inputs is distinct.

What would settle it

An explicit counterexample consisting of distinct inputs, a non-affine piecewise affine activation, a network width above the paper's threshold, and a learning rate below the paper's threshold where the mean squared error fails to decay linearly under batch gradient descent.

read the original abstract

In recent years, artificial neural networks have developed into a powerful tool for addressing a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why gradient descent optimization algorithms with random initialization, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations, even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this issue in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article, we provide a further contribution to this area of research by considering overparameterized fully connected shallow artificial neural networks with piecewise affine activation, such as the rectified linear unit activation. Specifically, given that the activation function is not affine and the training input data are pairwise distinct, we show that, with high probability, the mean squared error of such a randomly initialized artificial neural network optimized via batch gradient descent converges to zero at a linear convergence rate as long as the width of the artificial neural network is sufficiently large and the learning rate is sufficiently small.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Paper proves linear convergence for GD on overparametrized shallow nets with piecewise affine activations under non-affine and distinct-input conditions.

read the letter

This paper delivers a linear convergence rate for the training loss of overparameterized shallow neural networks with piecewise affine activations when optimized by batch gradient descent. The result holds with high probability over random initialization if the activation is non-affine, the inputs are pairwise distinct, the width is large enough, and the learning rate is small enough. The new element is the handling of piecewise affine activations such as ReLU. Previous analyses often required smoother activations to get the linear rate, so this extends the theory to a more commonly used family while keeping the overparameterization framework. The conditions prevent the network from behaving like a linear model or having dependent features. The paper states the theorem cleanly and ties it to the broader question of why gradient descent succeeds in non-convex settings. The approach follows the standard path of showing that the Gram matrix remains positive definite along the trajectory. The main limitation is that the proof details are not visible in the provided abstract, so the exact way they bound the changes in the features during training would need checking. The 'sufficiently large' width and 'sufficiently small' learning rate are typical but can lead to very conservative bounds in practice. The result is also restricted to shallow networks and squared loss. This work is for specialists in mathematical analysis of neural network training. A reader tracking results on overparameterized models would get a useful data point from it. It should go to peer review because the claim is specific and the conditions are explicit, making it a solid addition even if the constants are not optimized.

Referee Report

0 major / 3 minor

Summary. The manuscript proves that for overparameterized shallow fully-connected networks with non-affine piecewise-affine activations (e.g., ReLU) and pairwise-distinct training inputs, batch gradient descent with random initialization achieves linear convergence of the mean-squared training error to zero with high probability, provided the width is sufficiently large and the step size sufficiently small.

Significance. If the central theorem holds, the result strengthens the theoretical account of global convergence for gradient methods on non-convex, non-smooth losses in the overparameterized regime by giving explicit width and step-size conditions that keep the effective Gram matrix positive definite throughout training. The derivation supplies a concrete, falsifiable scaling that can be checked against numerical experiments.

minor comments (3)

[Main theorem statement] The statement of the main theorem (presumably in §3) should include an explicit lower bound on the required width in terms of the number of data points and the minimal separation of the inputs, rather than the qualitative phrase “sufficiently large.”
[§2 or §3] The proof sketch in the abstract and introduction refers to “the effective Gram matrix”; the precise definition of this matrix (likely appearing after the network is written in the piecewise-linear form) should be given before any claim about its positive-definiteness is used.
[Numerical section] Figure 1 (or the first numerical illustration) plots training loss versus iteration; the caption should state the exact width, step size, and data dimension used so that the linear-rate regime can be directly compared with the theorem’s hypotheses.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were listed in the report, so we have no individual points to address.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper derives linear convergence of batch gradient descent for overparameterized shallow networks with non-affine piecewise-linear activations and distinct inputs. The argument proceeds from random initialization to positive-definiteness of the effective Gram matrix (ensured by width and non-degeneracy conditions) and then to a contraction mapping on the loss, all under explicit probabilistic and analytic bounds. No step reduces the claimed rate to a fitted quantity, a self-citation chain, or a definitional tautology; the result is obtained from first-principles analysis of the gradient flow and is externally falsifiable via the stated assumptions on activation and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on standard real-analysis and probability axioms for gradient descent and piecewise-linear functions; no free parameters, invented entities, or ad-hoc assumptions are visible in the abstract.

axioms (1)

standard math Standard properties of gradient descent iterates and piecewise affine functions
Invoked implicitly to establish the linear convergence under the width and step-size conditions.

pith-pipeline@v0.9.0 · 5737 in / 1110 out tokens · 24468 ms · 2026-05-24T14:08:32.502868+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 6 internal anchors

[1]

D., and Sabanis, S

Akyildiz, ¨O. D., and Sabanis, S. Nonasymptotic analysis of Stochastic Gradient Hamiltonia n Monte Carlo under local conditions for nonconvex optimization. arXiv:2002.05465 (2021), 26 pages

work page arXiv 2002
[2]

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. arXiv:1811.04918 (2020), 84 pages

work page arXiv 2020
[3]

A Convergence Theory for Deep Learning via Over-Parameterization

Allen-Zhu, Z., Li, Y., and Song, Z. A Convergence Theory for Deep Learning via Over- Parameterization. arXiv:1811.03962 (2019), 53 pages

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

Breaking the curse of dimensionality with convex neural net works

Bach, F. Breaking the curse of dimensionality with convex neural net works. Journal of Machine Learning Research 18, 19 (2017), 1–53

work page 2017
[5]

Non-strongly-convex smooth stochastic approximation wit h convergence rateO(1/n )

Bach, F., and Moulines, E. Non-strongly-convex smooth stochastic approximation wit h convergence rateO(1/n ). In Advances in Neural Information Processing Systems (2013), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., vol. 2 6, Curran Associates, Inc., pp. 773–781. 36

work page 2013
[7]

A proof of convergence for gradient descent in the training of artiﬁcial neural networks for con stant target functions

Cheridito, P., Jentzen, A., Riekert, A., and Rossmannek, F. A proof of convergence for gradient descent in the training of artiﬁcial neural networks for con stant target functions. arXiv:2102.09924 (2021), 23 pages

work page arXiv 2021
[8]

Non-convergence of stochastic gradient descent in the training of deep neural networks

Cheridito, P., Jentzen, A., and Rossmannek, F. Non-convergence of stochastic gradient descent in the training of deep neural networks. Journal of Complexity (2020), 101540

work page 2020
[9]

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Chizat, L., and Bach, F. On the Global Convergence of Gradient Descent for Over-para meterized Models using Optimal Transport. arXiv:1805.09545 (2018), 32 pages

work page internal anchor Pith review Pith/arXiv arXiv 2018
[10]

Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes

Dereich, S., and Kassing, S. Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes. arXiv:2102.09385 (2021), 21 pages

work page arXiv 2021
[11]

Gradient Descent Finds Global Minima of Deep Neural Networks

Du, S. S., Lee, J. D., Li, H., W ang, L., and Zhai, X. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 (2019), 45 pages

work page internal anchor Pith review Pith/arXiv arXiv 2019
[12]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient Descent Provably Optimizes Over- parameterized Neural Networks. arXiv:1810.02054 (2019), 19 pages

work page internal anchor Pith review Pith/arXiv arXiv 2019
[13]

A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gr adient descent dynamics

E, W., Ma, C., and Wu, L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gr adient descent dynamics. Science China Mathematics 63 , 7 (2020), 1235–1258

work page 2020
[14]

Convergence rates for the stochastic gradient descent method for non-convex objective functions

Fehrman, B., Gess, B., and Jentzen, A. Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21 (2020), Paper No. 136, 48

work page 2020
[15]

Which Neural Net Architectures Give Rise to Exploding and Va nishing Gradients? In Ad- vances in Neural Information Processing Systems (2018), S

Hanin, B. Which Neural Net Architectures Give Rise to Exploding and Va nishing Gradients? In Ad- vances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grau- man, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran A ssociates, Inc., pp. 582–591

work page 2018
[16]

How to Start Training: The Eﬀect of Initialization and Archi tecture

Hanin, B., and Rolnick, D. How to Start Training: The Eﬀect of Initialization and Archi tecture. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, C urran Associates, Inc., pp. 571–581

work page 2018
[17]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. arXiv:1806.07572 (2020), 19 pages

work page arXiv 2020
[18]

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowl y and fast decaying learning rates

Jentzen, A., and von Wurstemberger, P. Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowl y and fast decaying learning rates. Journal of Complexity 57 (2020), 101438

work page 2020
[19]

Overall error analysis for the training of deep neural netwo rks via stochastic gradient descent with random initialisation

Jentzen, A., and Welti, T. Overall error analysis for the training of deep neural netwo rks via stochastic gradient descent with random initialisation. arXiv:1910.00121 (2020), 51 pages

work page arXiv 1910
[20]

Stochastic Gradient Descent for Nonconvex Learning Withou t Bounded Gradient Assumptions

Lei, Y., Hu, T., Li, G., and Tang, K. Stochastic Gradient Descent for Nonconvex Learning Withou t Bounded Gradient Assumptions. IEEE Transactions on Neural Networks and Learning Systems 3 1, 10 (2020), 4394–4400

work page 2020
[21]

Learning Overparameterized Neural Networks via Stochasti c Gradient Descent on Structured Data

Li, Y., and Liang, Y. Learning Overparameterized Neural Networks via Stochasti c Gradient Descent on Structured Data. arXiv:1808.01204 (2019), 28 pages

work page arXiv 2019
[22]

Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Lang evin algorithms

Lovas, A., Lytras, I., R ´asonyi, M., and Sabanis, S. Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Lang evin algorithms. arXiv:2006.14514 (2020), 29 pages

work page arXiv 2006
[23]

Lu, L., Shin, Y., Su, Y., and Karniadakis, G. E. Dying ReLU and Initialization: Theory and Numerical Examples. Communications in Computational Physics 28 , 5 (2020), 1671–1706. 37

work page 2020
[24]

Non-Asymptotic Analysis of Stochastic Approximation Algo rithms for Machine Learning

Moulines, E., and Bach, F. Non-Asymptotic Analysis of Stochastic Approximation Algo rithms for Machine Learning. In Advances in Neural Information Processing Systems (2011), J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. , vol. 24, Curran Associates, Inc., pp. 451– 459

work page 2011
[25]

An overview of gradient descent optimization algorithms

Ruder, S. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2017), 14 pages

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

A., De, S., Xu, Z., Huang, W

Sankararaman, K. A., De, S., Xu, Z., Huang, W. R., and Goldste in, T. The Impact of Neural Network Overparameterization on Gradient Confus ion and Stochastic Gradient Descent. arXiv:1904.06963 (2020), 28 pages

work page arXiv 1904
[27]

Shin, Y., and Karniadakis, G. E. Trainability of ReLU networks and Data-dependent Initiali zation. Journal of Machine Learning for Modeling and Computing 1 , 1 (2020), 39–74

work page 2020
[28]

How SGD Selects the Global Minima in Over-parameterized Lea rning: A Dynamical Stability Perspective

Wu, L., Ma, C., and E, W. How SGD Selects the Global Minima in Over-parameterized Lea rning: A Dynamical Stability Perspective. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bi anchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc., pp. 8279–8288

work page 2018
[29]

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic Gradient Descent Optimizes Over-parameterize d Deep ReLU Networks. arXiv:1811.08888 (2018), 54 pages. 38

work page internal anchor Pith review Pith/arXiv arXiv 2018

[1] [1]

D., and Sabanis, S

Akyildiz, ¨O. D., and Sabanis, S. Nonasymptotic analysis of Stochastic Gradient Hamiltonia n Monte Carlo under local conditions for nonconvex optimization. arXiv:2002.05465 (2021), 26 pages

work page arXiv 2002

[2] [2]

Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. arXiv:1811.04918 (2020), 84 pages

work page arXiv 2020

[3] [3]

A Convergence Theory for Deep Learning via Over-Parameterization

Allen-Zhu, Z., Li, Y., and Song, Z. A Convergence Theory for Deep Learning via Over- Parameterization. arXiv:1811.03962 (2019), 53 pages

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

Breaking the curse of dimensionality with convex neural net works

Bach, F. Breaking the curse of dimensionality with convex neural net works. Journal of Machine Learning Research 18, 19 (2017), 1–53

work page 2017

[5] [5]

Non-strongly-convex smooth stochastic approximation wit h convergence rateO(1/n )

Bach, F., and Moulines, E. Non-strongly-convex smooth stochastic approximation wit h convergence rateO(1/n ). In Advances in Neural Information Processing Systems (2013), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., vol. 2 6, Curran Associates, Inc., pp. 773–781. 36

work page 2013

[6] [7]

A proof of convergence for gradient descent in the training of artiﬁcial neural networks for con stant target functions

Cheridito, P., Jentzen, A., Riekert, A., and Rossmannek, F. A proof of convergence for gradient descent in the training of artiﬁcial neural networks for con stant target functions. arXiv:2102.09924 (2021), 23 pages

work page arXiv 2021

[7] [8]

Non-convergence of stochastic gradient descent in the training of deep neural networks

Cheridito, P., Jentzen, A., and Rossmannek, F. Non-convergence of stochastic gradient descent in the training of deep neural networks. Journal of Complexity (2020), 101540

work page 2020

[8] [9]

On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

Chizat, L., and Bach, F. On the Global Convergence of Gradient Descent for Over-para meterized Models using Optimal Transport. arXiv:1805.09545 (2018), 32 pages

work page internal anchor Pith review Pith/arXiv arXiv 2018

[9] [10]

Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes

Dereich, S., and Kassing, S. Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes. arXiv:2102.09385 (2021), 21 pages

work page arXiv 2021

[10] [11]

Gradient Descent Finds Global Minima of Deep Neural Networks

Du, S. S., Lee, J. D., Li, H., W ang, L., and Zhai, X. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 (2019), 45 pages

work page internal anchor Pith review Pith/arXiv arXiv 2019

[11] [12]

Gradient Descent Provably Optimizes Over-parameterized Neural Networks

Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient Descent Provably Optimizes Over- parameterized Neural Networks. arXiv:1810.02054 (2019), 19 pages

work page internal anchor Pith review Pith/arXiv arXiv 2019

[12] [13]

A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gr adient descent dynamics

E, W., Ma, C., and Wu, L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gr adient descent dynamics. Science China Mathematics 63 , 7 (2020), 1235–1258

work page 2020

[13] [14]

Convergence rates for the stochastic gradient descent method for non-convex objective functions

Fehrman, B., Gess, B., and Jentzen, A. Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21 (2020), Paper No. 136, 48

work page 2020

[14] [15]

Which Neural Net Architectures Give Rise to Exploding and Va nishing Gradients? In Ad- vances in Neural Information Processing Systems (2018), S

Hanin, B. Which Neural Net Architectures Give Rise to Exploding and Va nishing Gradients? In Ad- vances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grau- man, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran A ssociates, Inc., pp. 582–591

work page 2018

[15] [16]

How to Start Training: The Eﬀect of Initialization and Archi tecture

Hanin, B., and Rolnick, D. How to Start Training: The Eﬀect of Initialization and Archi tecture. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, C urran Associates, Inc., pp. 571–581

work page 2018

[16] [17]

Neural Tangent Kernel: Convergence and Generalization in Neural Networks

Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. arXiv:1806.07572 (2020), 19 pages

work page arXiv 2020

[17] [18]

Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowl y and fast decaying learning rates

Jentzen, A., and von Wurstemberger, P. Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowl y and fast decaying learning rates. Journal of Complexity 57 (2020), 101438

work page 2020

[18] [19]

Overall error analysis for the training of deep neural netwo rks via stochastic gradient descent with random initialisation

Jentzen, A., and Welti, T. Overall error analysis for the training of deep neural netwo rks via stochastic gradient descent with random initialisation. arXiv:1910.00121 (2020), 51 pages

work page arXiv 1910

[19] [20]

Stochastic Gradient Descent for Nonconvex Learning Withou t Bounded Gradient Assumptions

Lei, Y., Hu, T., Li, G., and Tang, K. Stochastic Gradient Descent for Nonconvex Learning Withou t Bounded Gradient Assumptions. IEEE Transactions on Neural Networks and Learning Systems 3 1, 10 (2020), 4394–4400

work page 2020

[20] [21]

Learning Overparameterized Neural Networks via Stochasti c Gradient Descent on Structured Data

Li, Y., and Liang, Y. Learning Overparameterized Neural Networks via Stochasti c Gradient Descent on Structured Data. arXiv:1808.01204 (2019), 28 pages

work page arXiv 2019

[21] [22]

Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Lang evin algorithms

Lovas, A., Lytras, I., R ´asonyi, M., and Sabanis, S. Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Lang evin algorithms. arXiv:2006.14514 (2020), 29 pages

work page arXiv 2006

[22] [23]

Lu, L., Shin, Y., Su, Y., and Karniadakis, G. E. Dying ReLU and Initialization: Theory and Numerical Examples. Communications in Computational Physics 28 , 5 (2020), 1671–1706. 37

work page 2020

[23] [24]

Non-Asymptotic Analysis of Stochastic Approximation Algo rithms for Machine Learning

Moulines, E., and Bach, F. Non-Asymptotic Analysis of Stochastic Approximation Algo rithms for Machine Learning. In Advances in Neural Information Processing Systems (2011), J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. , vol. 24, Curran Associates, Inc., pp. 451– 459

work page 2011

[24] [25]

An overview of gradient descent optimization algorithms

Ruder, S. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2017), 14 pages

work page internal anchor Pith review Pith/arXiv arXiv 2017

[25] [26]

A., De, S., Xu, Z., Huang, W

Sankararaman, K. A., De, S., Xu, Z., Huang, W. R., and Goldste in, T. The Impact of Neural Network Overparameterization on Gradient Confus ion and Stochastic Gradient Descent. arXiv:1904.06963 (2020), 28 pages

work page arXiv 1904

[26] [27]

Shin, Y., and Karniadakis, G. E. Trainability of ReLU networks and Data-dependent Initiali zation. Journal of Machine Learning for Modeling and Computing 1 , 1 (2020), 39–74

work page 2020

[27] [28]

How SGD Selects the Global Minima in Over-parameterized Lea rning: A Dynamical Stability Perspective

Wu, L., Ma, C., and E, W. How SGD Selects the Global Minima in Over-parameterized Lea rning: A Dynamical Stability Perspective. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bi anchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc., pp. 8279–8288

work page 2018

[28] [29]

Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic Gradient Descent Optimizes Over-parameterize d Deep ReLU Networks. arXiv:1811.08888 (2018), 54 pages. 38

work page internal anchor Pith review Pith/arXiv arXiv 2018