Stand-Alone Self-Attention in Vision Models

Anselm Levskaya; Ashish Vaswani; Irwan Bello; Jonathon Shlens; Niki Parmar; Prajit Ramachandran

arxiv: 1906.05909 · v1 · pith:JFISADU3new · submitted 2019-06-13 · 💻 cs.CV

Stand-Alone Self-Attention in Vision Models

Prajit Ramachandran , Niki Parmar , Ashish Vaswani , Irwan Bello , Anselm Levskaya , Jonathon Shlens This is my paper

classification 💻 cs.CV

keywords self-attentionvisionconvolutionsfewermodelstand-alonemodelsbaseline

0 comments

read the original abstract

Convolutions are a fundamental building block of modern computer vision systems. Recent approaches have argued for going beyond convolutions in order to capture long-range dependencies. These efforts focus on augmenting convolutional models with content-based interactions, such as self-attention and non-local means, to achieve gains on a number of vision tasks. The natural question that arises is whether attention can be a stand-alone primitive for vision models instead of serving as just an augmentation on top of convolutions. In developing and testing a pure self-attention vision model, we verify that self-attention can indeed be an effective stand-alone layer. A simple procedure of replacing all instances of spatial convolutions with a form of self-attention applied to ResNet model produces a fully self-attentional model that outperforms the baseline on ImageNet classification with 12% fewer FLOPS and 29% fewer parameters. On COCO object detection, a pure self-attention model matches the mAP of a baseline RetinaNet while having 39% fewer FLOPS and 34% fewer parameters. Detailed ablation studies demonstrate that self-attention is especially impactful when used in later layers. These results establish that stand-alone self-attention is an important addition to the vision practitioner's toolbox.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Reformer: The Efficient Transformer
cs.LG 2020-01 accept novelty 8.0

Reformer matches standard Transformer accuracy on long sequences while using far less memory and running faster via LSH attention and reversible residual layers.