DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

Lin Zehui , Pengfei Liu , Luyao Huang , Junkun Chen , Xipeng Qiu , Xuanjing Huang

Authors on Pith no claims yet

classification 💻 cs.CL

keywords layerfully-connectedconvolutionaldropattentiondropoutmethodnetworksoverfitting

read the original abstract

Variants dropout methods have been designed for the fully-connected layer, convolutional layer and recurrent layer in neural networks, and shown to be effective to avoid overfitting. As an appealing alternative to recurrent and convolutional layers, the fully-connected self-attention layer surprisingly lacks a specific dropout method. This paper explores the possibility of regularizing the attention weights in Transformers to prevent different contextualized feature vectors from co-adaption. Experiments on a wide range of tasks show that DropAttention can improve performance and reduce overfitting.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Explicit Dropout: Deterministic Regularization for Transformer Architectures
cs.LG 2026-04 unverdicted novelty 6.0

Explicit dropout reformulates stochastic dropout as deterministic loss penalties for Transformers, matching or exceeding standard performance with independent control per component.
Language models recognize dropout and Gaussian noise applied to their activations
cs.AI 2026-04 unverdicted novelty 6.0

Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.