pith. machine review for the scientific record. sign in

arxiv: 1907.11065 · v2 · submitted 2019-07-25 · 💻 cs.CL

Recognition: unknown

DropAttention: A Regularization Method for Fully-Connected Self-Attention Networks

Authors on Pith no claims yet
classification 💻 cs.CL
keywords layerfully-connectedconvolutionaldropattentiondropoutmethodnetworksoverfitting
0
0 comments X
read the original abstract

Variants dropout methods have been designed for the fully-connected layer, convolutional layer and recurrent layer in neural networks, and shown to be effective to avoid overfitting. As an appealing alternative to recurrent and convolutional layers, the fully-connected self-attention layer surprisingly lacks a specific dropout method. This paper explores the possibility of regularizing the attention weights in Transformers to prevent different contextualized feature vectors from co-adaption. Experiments on a wide range of tasks show that DropAttention can improve performance and reduce overfitting.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Explicit Dropout: Deterministic Regularization for Transformer Architectures

    cs.LG 2026-04 unverdicted novelty 6.0

    Explicit dropout reformulates stochastic dropout as deterministic loss penalties for Transformers, matching or exceeding standard performance with independent control per component.

  2. Language models recognize dropout and Gaussian noise applied to their activations

    cs.AI 2026-04 unverdicted novelty 6.0

    Language models detect, localize, and distinguish dropout from Gaussian noise applied to their activations, often with high accuracy.