Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
A Convergence Theory for Deep Learning via Over-Parameterization
5 Pith papers cite this work. Polarity classification is still indexing.
abstract
Deep neural networks (DNNs) have demonstrated dominating performance in many fields; since AlexNet, networks used in practice are going wider and deeper. On the theoretical side, a long line of works has been focusing on training neural networks with one hidden layer. The theory of multi-layer networks remains largely unsettled. In this work, we prove why stochastic gradient descent (SGD) can find $\textit{global minima}$ on the training objective of DNNs in $\textit{polynomial time}$. We only make two assumptions: the inputs are non-degenerate and the network is over-parameterized. The latter means the network width is sufficiently large: $\textit{polynomial}$ in $L$, the number of layers and in $n$, the number of samples. Our key technique is to derive that, in a sufficiently large neighborhood of the random initialization, the optimization landscape is almost-convex and semi-smooth even with ReLU activations. This implies an equivalence between over-parameterized neural networks and neural tangent kernel (NTK) in the finite (and polynomial) width setting. As concrete examples, starting from randomly initialized weights, we prove that SGD can attain 100% training accuracy in classification tasks, or minimize regression loss in linear convergence speed, with running time polynomial in $n,L$. Our theory applies to the widely-used but non-smooth ReLU activation, and to any smooth and possibly non-convex loss functions. In terms of network architectures, our theory at least applies to fully-connected neural networks, convolutional neural networks (CNN), and residual neural networks (ResNet).
representative citing papers
Introduces alignment-sensitive effective span dimension (ESD) for learned-kernel spectral algorithms and proves minimax excess risk bounds of order sigma^2 * ESD, with gradient flow shown to reduce ESD.
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
A method is given to determine eigenvalue decay rates of NTK and related kernels on general domains, leading to minimax optimality results for wide neural networks under smoothness assumptions on the target function.
Batch gradient descent achieves linear convergence to zero MSE with high probability for sufficiently wide shallow NNs with non-affine piecewise affine activations and distinct inputs.
citing papers explorer
-
The Statistical Cost of Adaptation in Multi-Source Transfer Learning
Multi-source transfer learning incurs an intrinsic adaptation cost that can exceed one, with phase transitions separating regimes where bias-agnostic estimators match oracle performance from those where they cannot.
-
Alignment-Sensitive Minimax Rates for Spectral Algorithms with Learned Kernels
Introduces alignment-sensitive effective span dimension (ESD) for learned-kernel spectral algorithms and proves minimax excess risk bounds of order sigma^2 * ESD, with gradient flow shown to reduce ESD.
-
LoRA: Low-Rank Adaptation of Large Language Models
Adapting large language models by training only a low-rank decomposition BA added to frozen weight matrices matches full fine-tuning while cutting trainable parameters by orders of magnitude and adding no inference latency.
-
On the Eigenvalue Decay Rates of a Class of Neural-Network Related Kernel Functions Defined on General Domains
A method is given to determine eigenvalue decay rates of NTK and related kernels on general domains, leading to minimax optimality results for wide neural networks under smoothness assumptions on the target function.
-
Convergence rates for gradient descent in the training of overparameterized artificial neural networks with piecewise affine activation
Batch gradient descent achieves linear convergence to zero MSE with high probability for sufficiently wide shallow NNs with non-affine piecewise affine activations and distinct inputs.