Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Fenglin Liu; Xuancheng Ren; Xu Sun; Yuexian Zou; Zhiyuan Zhang

arxiv: 2105.07205 · v1 · pith:POLKQ7PHnew · submitted 2021-05-15 · 💻 cs.LG · cs.CL· cs.CV

Rethinking Skip Connection with Layer Normalization in Transformers and ResNets

Fenglin Liu , Xuancheng Ren , Zhiyuan Zhang , Xu Sun , Yuexian Zou This is my paper

classification 💻 cs.LG cs.CLcs.CV

keywords connectionskipnormalizationinputlayerscaleneuralperformance

0 comments

read the original abstract

Skip connection, is a widely-used technique to improve the performance and the convergence of deep neural networks, which is believed to relieve the difficulty in optimization due to non-linearity by propagating a linear component through the neural network layers. However, from another point of view, it can also be seen as a modulating mechanism between the input and the output, with the input scaled by a pre-defined value one. In this work, we investigate how the scale factors in the effectiveness of the skip connection and reveal that a trivial adjustment of the scale will lead to spurious gradient exploding or vanishing in line with the deepness of the models, which could be addressed by normalization, in particular, layer normalization, which induces consistent improvements over the plain skip connection. Inspired by the findings, we further propose to adaptively adjust the scale of the input by recursively applying skip connection with layer normalization, which promotes the performance substantially and generalizes well across diverse tasks including both machine translation and image classification datasets.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geometry Reinforced Efficient Attention Tuning Equipped with Normals for Robust Stereo Matching
cs.CV 2026-04 unverdicted novelty 5.0

GREATEN fuses surface normals with image features via gated contextual-geometric fusion and efficient sparse attentions to cut stereo matching errors by up to 30% on real datasets when trained solely on synthetic data.