Sharing attention weights in adjacent Transformer layers yields 1.3X inference speedup with negligible BLEU loss on ten WMT and NIST tasks.
Attention is all you need
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2019 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Sharing Attention Weights for Fast Transformer
Sharing attention weights in adjacent Transformer layers yields 1.3X inference speedup with negligible BLEU loss on ten WMT and NIST tasks.