Recognition: unknown
On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization
read the original abstract
Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization - linear neural networks, a well-studied model. Theoretical analysis, as well as experiments, show that here depth acts as a preconditioner which may accelerate convergence. Even on simple convex problems such as linear regression with $\ell_p$ loss, $p>2$, gradient descent can benefit from transitioning to a non-convex overparameterized objective, more than it would from some common acceleration schemes. We also prove that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.
This paper has not been read by Pith yet.
Forward citations
Cited by 5 Pith papers
-
Estimating Implicit Regularization in Deep Learning
Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.
-
A Theory of Saddle Escape in Deep Nonlinear Networks
Derives exact norm-imbalance identity for deep nonlinear nets, classifying activations into four classes and yielding escape time law τ★ = Θ(ε^{-(r-2)}) governed by bottleneck depth r.
-
A Theory of Saddle Escape in Deep Nonlinear Networks
An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottle...
-
Geometric and Spectral Alignment for Deep Neural Network II
The work establishes margin-verified certificates for physical alignment of residual Jacobian chains by bounding truncation errors and decomposing the Physical Alignment Matrix orthogonally under fitted effective-rank...
-
GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding
GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.