A Convergence Theory for Deep Learning via Over-Parameterization

· 2018 · cs.LG · arXiv 1811.03962

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $\textit{global minima}$ on the training objective of DNNs in $\textit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $\textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).

representative citing papers

The Statistical Cost of Adaptation in Multi-Source Transfer Learning

math.ST · 2026-05-10 · unverdicted · novelty 8.0

Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.

Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels

cs.LG · 2025-09-24 · unverdicted · novelty 7.0

Introduces alignment-sensitive effective span dimension (ESD) for learned-kernel spectral algorithms and proves minimax excess risk bounds of order sigma^2 * ESD, with gradient flow shown to reduce ESD.

LoRA: Low-Rank Adaptation of Large Language Models

cs.CL · 2021-06-17 · accept · novelty 7.0

Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.

On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains

stat.ML · 2023-05-04 · unverdicted · novelty 5.0

A method is given to determine eigenvalue decay rates of NTK and related kernels on general domains, leading to minimax optimality results for wide neural networks under smoothness assumptions on the target function.

Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation

cs.LG · 2021-02-23 · unverdicted · novelty 4.0

Batch gradient descent achieves linear convergence to zero MSE with high probability for sufficiently wide shallow NNs with non-affine piecewise affine activations and distinct inputs.

citing papers explorer

Showing 5 of 5 citing papers.

The Statistical Cost of Adaptation in Multi-Source Transfer Learning math.ST · 2026-05-10 · unverdicted · none · ref 148
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels cs.LG · 2025-09-24 · unverdicted · none · ref 2 · internal anchor
Introduces alignment-sensitive effective span dimension (ESD) for learned-kernel spectral algorithms and proves minimax excess risk bounds of order sigma^2 * ESD, with gradient flow shown to reduce ESD.
LoRA: Low-Rank Adaptation of Large Language Models cs.CL · 2021-06-17 · accept · none · ref 5
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains stat.ML · 2023-05-04 · unverdicted · none · ref 1 · internal anchor
A method is given to determine eigenvalue decay rates of NTK and related kernels on general domains, leading to minimax optimality results for wide neural networks under smoothness assumptions on the target function.
Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation cs.LG · 2021-02-23 · unverdicted · none · ref 3 · internal anchor
Batch gradient descent achieves linear convergence to zero MSE with high probability for sufficiently wide shallow NNs with non-affine piecewise affine activations and distinct inputs.

A Convergence Theory for Deep Learning via Over-Parameterization

fields

years

verdicts

representative citing papers

citing papers explorer