arxiv: 2211.17192 · v2 · pith:7FGC55WLnew · submitted 2022-11-30 · 💻 cs.LG · cs.CL

Fast Inference from Transformers via Speculative Decoding

Yaniv Leviathan , Matan Kalman , Yossi Matias This is my paper

Pith reviewed 2026-05-17 22:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords speculative decodingtransformer inferenceautoregressive modelsfast samplingT5 accelerationparallel verificationdraft modelexact sampling

0 comments

The pith

Speculative decoding accelerates large autoregressive models by verifying multiple draft tokens in one parallel run of the target model while preserving the exact output distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard decoding in models like Transformers generates one token per serial model call, which becomes slow for longer sequences. The paper shows that a smaller, faster draft model can propose several candidate tokens ahead of time. The large target model then evaluates the entire candidate sequence in a single parallel forward pass and accepts the longest matching prefix according to its own probabilities. When the draft is accurate on enough steps, this advances the generation by multiple tokens per expensive call. The approach requires no retraining or architecture changes and produces identical outputs to ordinary decoding, as shown on T5-XXL with measured speedups of 2X to 3X.

Core claim

By running a fast approximation model to generate a short speculative sequence and then evaluating that sequence under the target model in parallel, exact samples from the target distribution can be produced while often accepting more than one token per invocation of the large model.

What carries the argument

Speculative decoding algorithm, which uses a draft model to propose candidate tokens and verifies them against the target model's output distribution in a single batched step.

If this is right

Existing off-the-shelf models can be accelerated without retraining or architecture modifications.
Generated outputs remain identical to those from standard autoregressive decoding.
The method exploits the fact that many language-modeling steps are easier subtasks that smaller models approximate well.
Speedups of 2X-3X are achieved on T5-XXL relative to the standard T5X implementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same draft-and-verify pattern could apply to other autoregressive generation tasks such as music or protein sequences.
Pairing the technique with model compression methods might compound the observed speedups.
Training draft models specifically to maximize acceptance rate rather than standalone accuracy could raise the average tokens advanced per step.

Load-bearing premise

A sufficiently accurate and faster draft model exists that can produce enough accepted tokens to offset the cost of the verification step.

What would settle it

Measure the average number of accepted tokens per speculative step on representative prompts; if this average falls below approximately 1.5, the method produces no net speedup.

read the original abstract

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Speculative decoding gives a clean 2-3x inference speedup on large transformers by pairing a draft model with parallel verification and rejection sampling that keeps the exact target distribution.

read the letter

Hey, This paper's main idea is using a fast draft model to speculate on the next few tokens, then checking them all in one parallel pass with the big model and using rejection sampling to keep the exact same output distribution. On T5-XXL it delivers 2-3x faster inference with identical results and no retraining. The new part is how they set up the sampling to preserve the distribution during the parallel verification. It's a first-principles construction that doesn't rely on prior techniques in the way described. They handle the presentation cleanly, with a clear algorithm and solid empirical numbers from actual runs rather than theory alone. Credit for showing it works on an off-the-shelf large model. The soft spot is the dependence on a good draft model. The speedup only happens if enough tokens get accepted to cover the verification cost. Their experiments pick a draft that works, but there's not much on how to find or train one for arbitrary cases. That's probably the main practical hurdle. Minor point: the baseline is the standard T5X code, which is fine, but optimized implementations might narrow the gap a bit. This would be worth discussing in a reading group for the details of the rejection rule. It's aimed at folks doing inference optimization for large models in real applications. I'd say send it for peer review since the evidence supports the claims and the method is usable as is.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces speculative decoding, an algorithm to accelerate autoregressive sampling from large Transformer models without altering the output distribution. A smaller draft model generates candidate tokens in parallel; these are verified by the target model in a single forward pass, with a corrected sampling step that accepts or rejects tokens to ensure exact equivalence to standard autoregressive decoding from the target. The authors report 2-3x speedups on T5-XXL relative to the standard T5X implementation, with identical outputs and no retraining or architectural changes required.

Significance. If the central construction holds, the work provides a practical, general-purpose technique for reducing the serial bottleneck in Transformer inference by exploiting easier subtasks approximable by faster models. The explicit algorithmic guarantee of distribution preservation (via the acceptance probability derived from the target conditional) combined with direct empirical measurement on T5-XXL constitutes a reproducible and falsifiable contribution to efficient large-model deployment.

minor comments (2)

[Abstract] Abstract: the reported 2X-3X range would be more precise if the paper stated the exact hardware, batch size, and baseline T5X implementation details used for the timing measurements.
[§3] The description of the draft-model selection process could include a short discussion of how alignment between draft and target distributions affects the expected number of accepted tokens.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. The provided summary accurately captures the core ideas and empirical results of our work on speculative decoding.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is an explicit algorithmic construction (draft-model token generation followed by parallel target-model verification with a distribution-preserving sampling correction) whose correctness follows directly from the definition of the target model's conditional distribution. The claimed speedup is obtained by empirical measurement on T5-XXL rather than by any fitted parameter, self-referential equation, or load-bearing self-citation. The requirement for a faster draft model is stated as an explicit precondition and is satisfied in the reported experiments; no step reduces the result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method assumes the existence of a faster approximation model whose outputs can be verified in parallel batches; no new physical constants, fitted scalars, or invented particles are introduced. The only background assumptions are standard properties of autoregressive sampling and the ability to run forward passes on both models.

axioms (2)

standard math Autoregressive models define a conditional distribution over the next token given previous tokens.
Invoked in the description of exact sampling from the target model.
domain assumption A smaller model can approximate easier subtasks within the overall language-modeling distribution.
Stated as observation (1) in the abstract; required for the draft model to produce useful proposals.

pith-pipeline@v0.9.0 · 5452 in / 1527 out tokens · 44093 ms · 2026-05-17T22:47:38.700405+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.DAlembert.Inevitability bilinear_family_forced unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.
IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding
cs.LG 2026-05 unverdicted novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Speculative Decoding for Autoregressive Video Generation
cs.CV 2026-04 conditional novelty 7.0

A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...
Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding
cs.CV 2026-04 unverdicted novelty 7.0

Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
cs.LG 2026-04 unverdicted novelty 7.0

Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting
cs.CV 2026-03 unverdicted novelty 7.0

Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
cs.LG 2024-01 conditional novelty 7.0

Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
cs.LG 2026-05 unverdicted novelty 6.0

Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
Micro Language Models Enable Instant Responses
cs.CL 2026-04 conditional novelty 6.0

Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.
Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon
cs.LG 2026-04 unverdicted novelty 6.0

Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.
DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models
cs.LG 2026-04 unverdicted novelty 6.0

DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
cs.CL 2023-05 unverdicted novelty 6.0

Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.
31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding
cs.AR 2026-05 unverdicted novelty 5.0

A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.
Complexity Horizons of Compressed Models in Analog Circuit Analysis
cs.AI 2026-05 unverdicted novelty 5.0

Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling
cs.LG 2026-04 unverdicted novelty 5.0

SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.
LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems
cs.LG 2026-01 unverdicted novelty 3.0

A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 16 Pith papers · 9 internal anchors

[1]

Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

work page 2020
[2]

ArXiv , year=

LaMDA: Language Models for Dialog Applications , author=. ArXiv , year=

work page
[3]

ArXiv , year=

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. ArXiv , year=

work page
[4]

ArXiv , year=

PaLM: Scaling Language Modeling with Pathways , author=. ArXiv , year=

work page
[5]

ArXiv , year=

Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding , author=. ArXiv , year=

work page
[6]

ArXiv , year=

Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding , author=. ArXiv , year=

work page
[7]

ArXiv , year=

The Efficiency Misnomer , author=. ArXiv , year=

work page
[8]

ArXiv , year=

Distilling the Knowledge in a Neural Network , author=. ArXiv , year=

work page
[9]

ArXiv , year=

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , author=. ArXiv , year=

work page
[10]

Neural Information Processing Systems , year=

Sparse is Enough in Scaling Transformers , author=. Neural Information Processing Systems , year=

work page
[11]

ArXiv , year=

Primer: Searching for Efficient Transformers for Language Modeling , author=. ArXiv , year=

work page
[12]

Annual Meeting of the Association for Computational Linguistics , year=

The Right Tool for the Job: Matching Model and Instance Complexities , author=. Annual Meeting of the Association for Computational Linguistics , year=

work page
[13]

Conference on Empirical Methods in Natural Language Processing , year=

Consistent Accelerated Inference via Confident Adaptive Transformers , author=. Conference on Empirical Methods in Natural Language Processing , year=

work page
[14]

ArXiv , year=

Controlling Computation versus Quality for Neural Sequence Models , author=. ArXiv , year=

work page
[15]

Cognitive Computation , volume=

Why should we add early exits to neural networks? , author=. Cognitive Computation , volume=. 2020 , publisher=

work page 2020
[16]

IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

Dynamic Neural Networks: A Survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

work page
[17]

International Conference on Learning Representations , year=

Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models , author=. International Conference on Learning Representations , year=

work page
[18]

ArXiv , year=

Depth-Adaptive Transformer , author=. ArXiv , year=

work page
[19]

Annual Meeting of the Association for Computational Linguistics , year=

Adaptive Attention Span in Transformers , author=. Annual Meeting of the Association for Computational Linguistics , year=

work page
[20]

Interspeech , year=

One billion word benchmark for measuring progress in statistical language modeling , author=. Interspeech , year=

work page
[21]

ArXiv , year=

Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , author=. ArXiv , year=

work page
[22]

ArXiv , year=

Fast Transformer Decoding: One Write-Head is All You Need , author=. ArXiv , year=

work page
[23]

ArXiv , year=

Scaling Up Models and Data with t5x and seqio , author=. ArXiv , year=

work page
[24]

The Journal of Machine Learning Research , volume=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

work page 2020
[25]

ArXiv , year=

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. ArXiv , year=

work page
[26]

Warren , journal=

Burton, F. Warren , journal=. Speculative computation, parallelism, and functional programming , year=

work page
[27]

and Patterson, David A

Hennessy, John L. and Patterson, David A. , biburl =. Computer Architecture: A Quantitative Approach , username =

work page
[28]

ArXiv , year=

Adaptive Computation Time for Recurrent Neural Networks , author=. ArXiv , year=

work page
[29]

Advances in Neural Information Processing Systems , volume=

Blockwise parallel decoding for deep autoregressive models , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

Advances in neural information processing systems , volume=

Attention is all you need , author=. Advances in neural information processing systems , volume=

work page
[31]

ArXiv , year=

Accelerating Large Language Model Decoding with Speculative Sampling , author=. ArXiv , year=

work page
[32]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[33]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[34]

M. J. Kearns , title =

work page
[35]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[36]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[37]

Suppressed for Anonymity , author=

work page
[38]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[39]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[40]

Controlling computation versus quality for neural sequence models

Bapna, A., Arivazhagan, N., and Firat, O. Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, 2020

work page arXiv 2002
[41]

Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

work page 2020
[42]

Burton, F. W. Speculative computation, parallelism, and functional programming. IEEE Transactions on Computers, C-34 0 (12): 0 1190--1193, 1985. doi:10.1109/TC.1985.6312218

work page doi:10.1109/tc.1985.6312218 1985
[43]

T., and Robinson, T

Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. T., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013

work page 2013
[44]

Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. M. Accelerating large language model decoding with speculative sampling. ArXiv, abs/2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

PaLM: Scaling Language Modeling with Pathways

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B. C., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., ...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

The efficiency misnomer

Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., and Tay, Y. The efficiency misnomer. ArXiv, abs/2110.12894, 2021

work page arXiv 2021
[47]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[48]

Depth-adaptive transformer

Elbayad, M., Gu, J., Grave, E., and Auli, M. Depth-adaptive transformer. ArXiv, abs/1910.10073, 2019

work page arXiv 1910
[49]

Dynamic neural networks: A survey

Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44: 0 7436--7456, 2021

work page 2021
[50]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. ArXiv, abs/1606.08415, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[51]

Hennessy, J. L. and Patterson, D. A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Amsterdam, 5 edition, 2012. ISBN 978-0-12-383872-8

work page 2012
[52]

Distilling the Knowledge in a Neural Network

Hinton, G. E., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[53]

Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. ArXiv, abs/1609.07061, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[54]

Sparse is enough in scaling transformers

Jaszczur, S., Chowdhery, A., Mohiuddin, A., Kaiser, L., Gajewski, W., Michalewski, H., and Kanerva, J. Sparse is enough in scaling transformers. In Neural Information Processing Systems, 2021

work page 2021
[55]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020

work page 2020
[56]

Roberts, A., Chung, H. W., Levskaya, A., Mishra, G., Bradbury, J., Andor, D., Narang, S., Lester, B., Gaffney, C., Mohiuddin, A., Hawthorne, C., Lewkowycz, A., Salcianu, A., van Zee, M., Austin, J., Goodman, S., Soares, L. B., Hu, H., Tsvyashchenko, S., Chowdhery, A., Bastings, J., Bulian, J., Garc \'i a, X., Ni, J., Chen, A., Kenealy, K., Clark, J., Lee,...

work page arXiv 2022
[57]

Why should we add early exits to neural networks? Cognitive Computation, 12 0 (5): 0 954--966, 2020

Scardapane, S., Scarpiniti, M., Baccarelli, E., and Uncini, A. Why should we add early exits to neural networks? Cognitive Computation, 12 0 (5): 0 954--966, 2020

work page 2020
[58]

Consistent accelerated inference via confident adaptive transformers

Schuster, T., Fisch, A., Jaakkola, T., and Barzilay, R. Consistent accelerated inference via confident adaptive transformers. In Conference on Empirical Methods in Natural Language Processing, 2021

work page 2021
[59]

Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020

work page 2020
[60]

Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1911
[61]

So and Wojciech Mańke and Hanxiao Liu and Zihang Dai and Noam Shazeer and Quoc V

So, D. R., Ma'nke, W., Liu, H., Dai, Z., Shazeer, N. M., and Le, Q. V. Primer: Searching for efficient transformers for language modeling. ArXiv, abs/2109.08668, 2021

work page arXiv 2021
[62]

Blockwise parallel decoding for deep autoregressive models

Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[63]

Adaptive attention span in transformers

Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A. Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, 2019

work page 2019
[64]

Instantaneous grammatical error correction with shallow aggressive decoding

Sun, X., Ge, T., Wei, F., and Wang, H. Instantaneous grammatical error correction with shallow aggressive decoding. ArXiv, abs/2106.04970, 2021

work page arXiv 2021
[65]

LaMDA: Language Models for Dialog Applications

Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N. M., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang, C.-C., Krivokon, I. A., Rusch, W. J., Pickett, M., Meier-Hel...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[66]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[67]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B. C., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., and Wu, Y. Scaling autoregressive models for content-rich text-to-image generation. ArXiv, abs/2206.10789, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022