Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.
What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, pp.\ 22964--22984
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2024 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
When Attention Sink Emerges in Language Models: An Empirical View
Attention sinks emerge in language models from softmax-induced token dependence on attention scores and do not appear when using sigmoid attention without normalization in models up to 1B parameters.