Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

Peter L. Bartlett; Philip M. Long; Steven N. Evans

arxiv: 1804.05012 · v2 · pith:SP7AUMLSnew · submitted 2018-04-13 · 💻 cs.LG · cs.AI· cs.NE· math.ST· stat.ML· stat.TH

Representing smooth functions as compositions of near-identity functions with implications for deep network optimization

Peter L. Bartlett , Steven N. Evans , Philip M. Long This is my paper

classification 💻 cs.LG cs.AIcs.NEmath.STstat.MLstat.TH

keywords functionsnear-identitycriticalresiduallipschitznetworknonlinearpoints

0 comments

read the original abstract

We show that any smooth bi-Lipschitz $h$ can be represented exactly as a composition $h_m \circ ... \circ h_1$ of functions $h_1,...,h_m$ that are close to the identity in the sense that each $\left(h_i-\mathrm{Id}\right)$ is Lipschitz, and the Lipschitz constant decreases inversely with the number $m$ of functions composed. This implies that $h$ can be represented to any accuracy by a deep residual network whose nonlinear layers compute functions with a small Lipschitz constant. Next, we consider nonlinear regression with a composition of near-identity nonlinear maps. We show that, regarding Fr\'echet derivatives with respect to the $h_1,...,h_m$, any critical point of a quadratic criterion in this near-identity region must be a global minimizer. In contrast, if we consider derivatives with respect to parameters of a fixed-size residual network with sigmoid activation functions, we show that there are near-identity critical points that are suboptimal, even in the realizable case. Informally, this means that functional gradient methods for residual networks cannot get stuck at suboptimal critical points corresponding to near-identity layers, whereas parametric gradient methods for sigmoidal residual networks suffer from suboptimal critical points in the near-identity region.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Chaining Meets Chain Rule: Multilevel Entropic Regularization and Training of Neural Nets
cs.LG 2019-06 unverdicted novelty 6.0

Derives algorithm-dependent generalization bounds for neural nets using multilevel entropic regularization and proposes a Metropolis-simulated multi-scale Gibbs training procedure tested on a two-layer net for MNIST.