Recognition: unknown
DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks
read the original abstract
Variants dropout methods have been designed for the fully-connected layer, convolutional layer and recurrent layer in neural networks, and shown to be effective to avoid overfitting. As an appealing alternative to recurrent and convolutional layers, the fully-connected self-attention layer surprisingly lacks a specific dropout method. This paper explores the possibility of regularizing the attention weights in Transformers to prevent different contextualized feature vectors from co-adaption. Experiments on a wide range of tasks show that DropAttention can improve performance and reduce overfitting.
This paper has not been read by Pith yet.
Forward citations
Cited by 2 Pith papers
-
Explicit Dropout: Deterministic Regularization for Transformer Architectures
Explicit dropout reformulates stochastic dropout as deterministic loss penalties for Transformers, matching or exceeding standard performance with independent control per component.
-
Language models recognize dropout and Gaussian noise applied to their activations
Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.