Multi-query attention shares keys and values across heads in Transformers, greatly reducing memory bandwidth for faster decoding with only minor quality loss.
CoRR , volume =
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.NE 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Fast Transformer Decoding: One Write-Head is All You Need
Multi-query attention shares keys and values across heads in Transformers, greatly reducing memory bandwidth for faster decoding with only minor quality loss.