The Shattered Gradients Problem: If resnets are the answer, then what is the question?

Brian McWilliams; David Balduzzi; JP Lewis; Kurt Wan-Duo Ma; Lennox Leary; Marcus Frean

arxiv: 1702.08591 · v2 · pith:4UJ3D23Lnew · submitted 2017-02-28 · 💻 cs.NE · cs.LG· stat.ML

The Shattered Gradients Problem: If resnets are the answer, then what is the question?

David Balduzzi , Marcus Frean , Lennox Leary , JP Lewis , Kurt Wan-Duo Ma , Brian McWilliams This is my paper

classification 💻 cs.NE cs.LGstat.ML

keywords gradientsproblemarchitecturesinitializationnetworksskip-connectionsbatchdeep

0 comments

read the original abstract

A long-standing obstacle to progress in deep learning is the problem of vanishing and exploding gradients. Although, the problem has largely been overcome via carefully constructed initializations and batch normalization, architectures incorporating skip-connections such as highway and resnets perform much better than standard feedforward architectures despite well-chosen initialization and batch normalization. In this paper, we identify the shattered gradients problem. Specifically, we show that the correlation between gradients in standard feedforward networks decays exponentially with depth resulting in gradients that resemble white noise whereas, in contrast, the gradients in architectures with skip-connections are far more resistant to shattering, decaying sublinearly. Detailed empirical evidence is presented in support of the analysis, on both fully-connected networks and convnets. Finally, we present a new "looks linear" (LL) initialization that prevents shattering, with preliminary experiments showing the new initialization allows to train very deep networks without the addition of skip-connections.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Switchable Normalization for Learning-to-Normalize Deep Representation
cs.CV 2019-07 unverdicted novelty 7.0

Switchable Normalization learns per-layer weights to combine channel, layer, and minibatch normalizers, claiming robustness to batch size and better results than fixed normalizers on ImageNet, COCO, CityScapes, ADE20K...