Gradient Descent Finds Global Minima of Deep Neural Networks

Haochuan Li; Jason D. Lee; Liwei Wang; Simon S. Du; Xiyu Zhai

arxiv: 1811.03804 · v4 · pith:MQTFJ7CKnew · submitted 2018-11-09 · 💻 cs.LG · cs.AI· cs.CV· math.OC· stat.ML

Gradient Descent Finds Global Minima of Deep Neural Networks

Simon S. Du , Jason D. Lee , Haochuan Li , Liwei Wang , Xiyu Zhai This is my paper

classification 💻 cs.LG cs.AIcs.CVmath.OCstat.ML

keywords neuraldeepdescentgradientglobalnetworkstraininganalysis

0 comments

read the original abstract

Gradient descent finds a global minimum in training deep neural networks despite the objective function being non-convex. The current paper proves gradient descent achieves zero training loss in polynomial time for a deep over-parameterized neural network with residual connections (ResNet). Our analysis relies on the particular structure of the Gram matrix induced by the neural network architecture. This structure allows us to show the Gram matrix is stable throughout the training process and this stability implies the global optimality of the gradient descent algorithm. We further extend our analysis to deep residual convolutional neural networks and obtain a similar convergence result.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Limitations of Lazy Training of Two-layers Neural Networks
stat.ML 2019-06 unverdicted novelty 8.0

For quadratic targets in d dimensions, two-layer quadratic networks achieve lower risk when fully trained than in random features or neural tangent regimes if hidden units < d.
Estimating Dense-Packed Zone Height in Liquid-Liquid Separation: A Physics-Informed Neural Network Approach
cs.LG 2026-01 unverdicted novelty 6.0

A PINN pretrained on mechanistic synthetic data and fine-tuned experimentally is deployed in an EKF-style filter to estimate separator phase heights from flow rates alone.
Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation
cs.LG 2021-02 unverdicted novelty 4.0

Batch gradient descent achieves linear convergence to zero MSE with high probability for sufficiently wide shallow NNs with non-affine piecewise affine activations and distinct inputs.
Two-block vs. Multi-block ADMM: An empirical evaluation of convergence
stat.ML 2019-07 unverdicted novelty 4.0

Empirical study finds multi-block ADMM outperforms two-block ADMM on optimization and prediction in multi-task learning across all tested datasets and dual step sizes.