Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
Bert: Pre-training of deep bidirectional transformers for language understanding
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2024 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
When Attention Sink Emerges in Language Models: An Empirical View
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.