Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
Conference on Empirical Methods in Natural Language Processing , year=
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2022 1verdicts
ACCEPT 1representative citing papers
citing papers explorer
-
Fast Inference from Transformers via Speculative Decoding
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.