pith. sign in

arxiv: 2102.11840 · v2 · pith:AMPODYNAnew · submitted 2021-02-23 · 💻 cs.LG · cs.NA· math.NA· math.PR

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

Pith reviewed 2026-05-24 14:08 UTC · model grok-4.3

classification 💻 cs.LG cs.NAmath.NAmath.PR
keywords overparameterized neural networksgradient descentlinear convergencepiecewise affine activationReLUmean squared errorshallow networksbatch gradient descent
0
0 comments X

The pith

Batch gradient descent on sufficiently wide shallow networks with non-affine piecewise affine activations converges linearly to zero mean squared error with high probability over random initialization when inputs are distinct.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that overparameterized shallow fully connected neural networks with activations such as ReLU reach exact interpolation of distinct training points under batch gradient descent. The mean squared error decreases at a linear rate once the network width exceeds a threshold and the learning rate stays below another threshold. The result applies to randomly initialized networks and holds with high probability. It explains the observed success of gradient descent on non-convex, non-smooth objectives in the overparameterized regime.

Core claim

Given that the activation function is not affine and the training input data are pairwise distinct, the mean squared error of a randomly initialized overparameterized shallow artificial neural network with piecewise affine activation optimized via batch gradient descent converges to zero at a linear convergence rate with high probability when the width is sufficiently large and the learning rate is sufficiently small.

What carries the argument

The gradient flow dynamics of batch gradient descent on the mean squared error loss for overparameterized networks with piecewise affine activations, which permit exact interpolation once width is large.

If this is right

  • The training error contracts geometrically at each gradient step once the width and step-size conditions hold.
  • The guarantee applies with high probability over the random initialization of weights.
  • The same linear rate extends to any piecewise affine activation that is not globally affine.
  • The result requires only shallow fully connected architectures and mean squared error loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar linear rates may hold for stochastic gradient descent variants if batch size is controlled.
  • The distinct-input condition could be relaxed by considering data with small perturbations.
  • The width threshold might be made explicit in terms of the number of training points and the activation's kink locations.

Load-bearing premise

The activation function is not affine and every pair of training inputs is distinct.

What would settle it

An explicit counterexample consisting of distinct inputs, a non-affine piecewise affine activation, a network width above the paper's threshold, and a learning rate below the paper's threshold where the mean squared error fails to decay linearly under batch gradient descent.

read the original abstract

In recent years, artificial neural networks have developed into a powerful tool for addressing a multitude of problems for which classical solution approaches reach their limits. However, it is still unclear why gradient descent optimization algorithms with random initialization, such as the well-known batch gradient descent, are able to achieve zero training loss in many situations, even though the objective function is non-convex and non-smooth. One of the most promising approaches to solving this issue in the field of supervised learning is the analysis of gradient descent optimization in the so-called overparameterized regime. In this article, we provide a further contribution to this area of research by considering overparameterized fully connected shallow artificial neural networks with piecewise affine activation, such as the rectified linear unit activation. Specifically, given that the activation function is not affine and the training input data are pairwise distinct, we show that, with high probability, the mean squared error of such a randomly initialized artificial neural network optimized via batch gradient descent converges to zero at a linear convergence rate as long as the width of the artificial neural network is sufficiently large and the learning rate is sufficiently small.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript proves that for overparameterized shallow fully-connected networks with non-affine piecewise-affine activations (e.g., ReLU) and pairwise-distinct training inputs, batch gradient descent with random initialization achieves linear convergence of the mean-squared training error to zero with high probability, provided the width is sufficiently large and the step size sufficiently small.

Significance. If the central theorem holds, the result strengthens the theoretical account of global convergence for gradient methods on non-convex, non-smooth losses in the overparameterized regime by giving explicit width and step-size conditions that keep the effective Gram matrix positive definite throughout training. The derivation supplies a concrete, falsifiable scaling that can be checked against numerical experiments.

minor comments (3)
  1. [Main theorem statement] The statement of the main theorem (presumably in §3) should include an explicit lower bound on the required width in terms of the number of data points and the minimal separation of the inputs, rather than the qualitative phrase “sufficiently large.”
  2. [§2 or §3] The proof sketch in the abstract and introduction refers to “the effective Gram matrix”; the precise definition of this matrix (likely appearing after the network is written in the piecewise-linear form) should be given before any claim about its positive-definiteness is used.
  3. [Numerical section] Figure 1 (or the first numerical illustration) plots training loss versus iteration; the caption should state the exact width, step size, and data dimension used so that the linear-rate regime can be directly compared with the theorem’s hypotheses.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were listed in the report, so we have no individual points to address.

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained

full rationale

The paper derives linear convergence of batch gradient descent for overparameterized shallow networks with non-affine piecewise-linear activations and distinct inputs. The argument proceeds from random initialization to positive-definiteness of the effective Gram matrix (ensured by width and non-degeneracy conditions) and then to a contraction mapping on the loss, all under explicit probabilistic and analytic bounds. No step reduces the claimed rate to a fitted quantity, a self-citation chain, or a definitional tautology; the result is obtained from first-principles analysis of the gradient flow and is externally falsifiable via the stated assumptions on activation and data.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on standard real-analysis and probability axioms for gradient descent and piecewise-linear functions; no free parameters, invented entities, or ad-hoc assumptions are visible in the abstract.

axioms (1)
  • standard math Standard properties of gradient descent iterates and piecewise affine functions
    Invoked implicitly to establish the linear convergence under the width and step-size conditions.

pith-pipeline@v0.9.0 · 5737 in / 1110 out tokens · 24468 ms · 2026-05-24T14:08:32.502868+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 6 internal anchors

  1. [1]

    D., and Sabanis, S

    Akyildiz, ¨O. D., and Sabanis, S. Nonasymptotic analysis of Stochastic Gradient Hamiltonia n Monte Carlo under local conditions for nonconvex optimization. arXiv:2002.05465 (2021), 26 pages

  2. [2]

    Learning and generalization in over- parameterized neural networks, going beyond two layers

    Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. arXiv:1811.04918 (2020), 84 pages

  3. [3]

    A Convergence Theory for Deep Learning via Over-Parameterization

    Allen-Zhu, Z., Li, Y., and Song, Z. A Convergence Theory for Deep Learning via Over- Parameterization. arXiv:1811.03962 (2019), 53 pages

  4. [4]

    Breaking the curse of dimensionality with convex neural net works

    Bach, F. Breaking the curse of dimensionality with convex neural net works. Journal of Machine Learning Research 18, 19 (2017), 1–53

  5. [5]

    Non-strongly-convex smooth stochastic approximation wit h convergence rateO(1/n )

    Bach, F., and Moulines, E. Non-strongly-convex smooth stochastic approximation wit h convergence rateO(1/n ). In Advances in Neural Information Processing Systems (2013), C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., vol. 2 6, Curran Associates, Inc., pp. 773–781. 36

  6. [7]

    A proof of convergence for gradient descent in the training of artificial neural networks for con stant target functions

    Cheridito, P., Jentzen, A., Riekert, A., and Rossmannek, F. A proof of convergence for gradient descent in the training of artificial neural networks for con stant target functions. arXiv:2102.09924 (2021), 23 pages

  7. [8]

    Non-convergence of stochastic gradient descent in the training of deep neural networks

    Cheridito, P., Jentzen, A., and Rossmannek, F. Non-convergence of stochastic gradient descent in the training of deep neural networks. Journal of Complexity (2020), 101540

  8. [9]

    On the Global Convergence of Gradient Descent for Over-parameterized Models using Optimal Transport

    Chizat, L., and Bach, F. On the Global Convergence of Gradient Descent for Over-para meterized Models using Optimal Transport. arXiv:1805.09545 (2018), 32 pages

  9. [10]

    Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes

    Dereich, S., and Kassing, S. Convergence of stochastic gradient descent schemes for Loj asiewicz- landscapes. arXiv:2102.09385 (2021), 21 pages

  10. [11]

    Gradient Descent Finds Global Minima of Deep Neural Networks

    Du, S. S., Lee, J. D., Li, H., W ang, L., and Zhai, X. Gradient Descent Finds Global Minima of Deep Neural Networks. arXiv:1811.03804 (2019), 45 pages

  11. [12]

    Gradient Descent Provably Optimizes Over-parameterized Neural Networks

    Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient Descent Provably Optimizes Over- parameterized Neural Networks. arXiv:1810.02054 (2019), 19 pages

  12. [13]

    A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gr adient descent dynamics

    E, W., Ma, C., and Wu, L. A comparative analysis of optimization and generalization properties of two-layer neural network and random feature models under gr adient descent dynamics. Science China Mathematics 63 , 7 (2020), 1235–1258

  13. [14]

    Convergence rates for the stochastic gradient descent method for non-convex objective functions

    Fehrman, B., Gess, B., and Jentzen, A. Convergence rates for the stochastic gradient descent method for non-convex objective functions. J. Mach. Learn. Res. 21 (2020), Paper No. 136, 48

  14. [15]

    Which Neural Net Architectures Give Rise to Exploding and Va nishing Gradients? In Ad- vances in Neural Information Processing Systems (2018), S

    Hanin, B. Which Neural Net Architectures Give Rise to Exploding and Va nishing Gradients? In Ad- vances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grau- man, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, Curran A ssociates, Inc., pp. 582–591

  15. [16]

    How to Start Training: The Effect of Initialization and Archi tecture

    Hanin, B., and Rolnick, D. How to Start Training: The Effect of Initialization and Archi tecture. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31, C urran Associates, Inc., pp. 571–581

  16. [17]

    Neural Tangent Kernel: Convergence and Generalization in Neural Networks

    Jacot, A., Gabriel, F., and Hongler, C. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. arXiv:1806.07572 (2020), 19 pages

  17. [18]

    Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowl y and fast decaying learning rates

    Jentzen, A., and von Wurstemberger, P. Lower error bounds for the stochastic gradient descent optimization algorithm: Sharp convergence rates for slowl y and fast decaying learning rates. Journal of Complexity 57 (2020), 101438

  18. [19]

    Overall error analysis for the training of deep neural netwo rks via stochastic gradient descent with random initialisation

    Jentzen, A., and Welti, T. Overall error analysis for the training of deep neural netwo rks via stochastic gradient descent with random initialisation. arXiv:1910.00121 (2020), 51 pages

  19. [20]

    Stochastic Gradient Descent for Nonconvex Learning Withou t Bounded Gradient Assumptions

    Lei, Y., Hu, T., Li, G., and Tang, K. Stochastic Gradient Descent for Nonconvex Learning Withou t Bounded Gradient Assumptions. IEEE Transactions on Neural Networks and Learning Systems 3 1, 10 (2020), 4394–4400

  20. [21]

    Learning Overparameterized Neural Networks via Stochasti c Gradient Descent on Structured Data

    Li, Y., and Liang, Y. Learning Overparameterized Neural Networks via Stochasti c Gradient Descent on Structured Data. arXiv:1808.01204 (2019), 28 pages

  21. [22]

    Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Lang evin algorithms

    Lovas, A., Lytras, I., R ´asonyi, M., and Sabanis, S. Taming neural networks with TUSLA: Non-convex learning via adaptive stochastic gradient Lang evin algorithms. arXiv:2006.14514 (2020), 29 pages

  22. [23]

    Lu, L., Shin, Y., Su, Y., and Karniadakis, G. E. Dying ReLU and Initialization: Theory and Numerical Examples. Communications in Computational Physics 28 , 5 (2020), 1671–1706. 37

  23. [24]

    Non-Asymptotic Analysis of Stochastic Approximation Algo rithms for Machine Learning

    Moulines, E., and Bach, F. Non-Asymptotic Analysis of Stochastic Approximation Algo rithms for Machine Learning. In Advances in Neural Information Processing Systems (2011), J. Shawe-Taylor, R. Zemel, P. Bartlett, F. Pereira, and K. Q. Weinberger, Eds. , vol. 24, Curran Associates, Inc., pp. 451– 459

  24. [25]

    An overview of gradient descent optimization algorithms

    Ruder, S. An overview of gradient descent optimization algorithms. arXiv:1609.04747 (2017), 14 pages

  25. [26]

    A., De, S., Xu, Z., Huang, W

    Sankararaman, K. A., De, S., Xu, Z., Huang, W. R., and Goldste in, T. The Impact of Neural Network Overparameterization on Gradient Confus ion and Stochastic Gradient Descent. arXiv:1904.06963 (2020), 28 pages

  26. [27]

    Shin, Y., and Karniadakis, G. E. Trainability of ReLU networks and Data-dependent Initiali zation. Journal of Machine Learning for Modeling and Computing 1 , 1 (2020), 39–74

  27. [28]

    How SGD Selects the Global Minima in Over-parameterized Lea rning: A Dynamical Stability Perspective

    Wu, L., Ma, C., and E, W. How SGD Selects the Global Minima in Over-parameterized Lea rning: A Dynamical Stability Perspective. In Advances in Neural Information Processing Systems (2018), S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bi anchi, and R. Garnett, Eds., vol. 31, Curran Associates, Inc., pp. 8279–8288

  28. [29]

    Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks

    Zou, D., Cao, Y., Zhou, D., and Gu, Q. Stochastic Gradient Descent Optimizes Over-parameterize d Deep ReLU Networks. arXiv:1811.08888 (2018), 54 pages. 38