pith. machine review for the scientific record. sign in

arxiv: 1802.06509 · v2 · submitted 2018-02-19 · 💻 cs.LG

Recognition: unknown

On the Optimization of Deep Networks: Implicit Acceleration by Overparameterization

Authors on Pith no claims yet
classification 💻 cs.LG
keywords depthoptimizationaccelerationdeepeffectexpressivenessincreasinglinear
0
0 comments X
read the original abstract

Conventional wisdom in deep learning states that increasing depth improves expressiveness but complicates optimization. This paper suggests that, sometimes, increasing depth can speed up optimization. The effect of depth on optimization is decoupled from expressiveness by focusing on settings where additional layers amount to overparameterization - linear neural networks, a well-studied model. Theoretical analysis, as well as experiments, show that here depth acts as a preconditioner which may accelerate convergence. Even on simple convex problems such as linear regression with $\ell_p$ loss, $p>2$, gradient descent can benefit from transitioning to a non-convex overparameterized objective, more than it would from some common acceleration schemes. We also prove that it is mathematically impossible to obtain the acceleration effect of overparametrization via gradients of any regularizer.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Estimating Implicit Regularization in Deep Learning

    stat.ML 2026-05 unverdicted novelty 7.0

    Gradient matching empirically recovers implicit regularization effects such as l2 penalties from early stopping and dropout in neural networks.

  2. A Theory of Saddle Escape in Deep Nonlinear Networks

    cs.LG 2026-05 unverdicted novelty 7.0

    Derives exact norm-imbalance identity for deep nonlinear nets, classifying activations into four classes and yielding escape time law τ★ = Θ(ε^{-(r-2)}) governed by bottleneck depth r.

  3. A Theory of Saddle Escape in Deep Nonlinear Networks

    cs.LG 2026-05 conditional novelty 7.0

    An exact norm-imbalance identity classifies activations into four classes and reduces deep nonlinear training flow to a scalar ODE that predicts saddle escape time scaling as ε to the power of minus (r-2) for r bottle...

  4. Geometric and Spectral Alignment for Deep Neural Network II

    cs.LG 2026-05 unverdicted novelty 6.0

    The work establishes margin-verified certificates for physical alignment of residual Jacobian chains by bounding truncation errors and decomposing the Physical Alignment Matrix orthogonally under fitted effective-rank...

  5. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

    cs.CL 2020-06 unverdicted novelty 6.0

    GShard supplies automatic sharding and conditional computation support that enabled training a 600-billion-parameter multilingual translation model on thousands of TPUs with superior quality.