Multi-query attention shares keys and values across heads in Transformers, greatly reducing memory bandwidth for faster decoding with only minor quality loss.
Neural machine translation by jointly learning to align and translate
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.NE 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Fast Transformer Decoding: One Write-Head is All You Need
Multi-query attention shares keys and values across heads in Transformers, greatly reducing memory bandwidth for faster decoding with only minor quality loss.