pith. machine review for the scientific record. sign in

arxiv: 2211.17192 · v2 · pith:7FGC55WLnew · submitted 2022-11-30 · 💻 cs.LG · cs.CL

Fast Inference from Transformers via Speculative Decoding

Pith reviewed 2026-05-17 22:47 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords speculative decodingtransformer inferenceautoregressive modelsfast samplingT5 accelerationparallel verificationdraft modelexact sampling
0
0 comments X

The pith

Speculative decoding accelerates large autoregressive models by verifying multiple draft tokens in one parallel run of the target model while preserving the exact output distribution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard decoding in models like Transformers generates one token per serial model call, which becomes slow for longer sequences. The paper shows that a smaller, faster draft model can propose several candidate tokens ahead of time. The large target model then evaluates the entire candidate sequence in a single parallel forward pass and accepts the longest matching prefix according to its own probabilities. When the draft is accurate on enough steps, this advances the generation by multiple tokens per expensive call. The approach requires no retraining or architecture changes and produces identical outputs to ordinary decoding, as shown on T5-XXL with measured speedups of 2X to 3X.

Core claim

By running a fast approximation model to generate a short speculative sequence and then evaluating that sequence under the target model in parallel, exact samples from the target distribution can be produced while often accepting more than one token per invocation of the large model.

What carries the argument

Speculative decoding algorithm, which uses a draft model to propose candidate tokens and verifies them against the target model's output distribution in a single batched step.

If this is right

  • Existing off-the-shelf models can be accelerated without retraining or architecture modifications.
  • Generated outputs remain identical to those from standard autoregressive decoding.
  • The method exploits the fact that many language-modeling steps are easier subtasks that smaller models approximate well.
  • Speedups of 2X-3X are achieved on T5-XXL relative to the standard T5X implementation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same draft-and-verify pattern could apply to other autoregressive generation tasks such as music or protein sequences.
  • Pairing the technique with model compression methods might compound the observed speedups.
  • Training draft models specifically to maximize acceptance rate rather than standalone accuracy could raise the average tokens advanced per step.

Load-bearing premise

A sufficiently accurate and faster draft model exists that can produce enough accepted tokens to offset the cost of the verification step.

What would settle it

Measure the average number of accepted tokens per speculative step on representative prompts; if this average falls below approximately 1.5, the method produces no net speedup.

read the original abstract

Inference from large autoregressive models like Transformers is slow - decoding K tokens takes K serial runs of the model. In this work we introduce speculative decoding - an algorithm to sample from autoregressive models faster without any changes to the outputs, by computing several tokens in parallel. At the heart of our approach lie the observations that (1) hard language-modeling tasks often include easier subtasks that can be approximated well by more efficient models, and (2) using speculative execution and a novel sampling method, we can make exact decoding from the large models faster, by running them in parallel on the outputs of the approximation models, potentially generating several tokens concurrently, and without changing the distribution. Our method can accelerate existing off-the-shelf models without retraining or architecture changes. We demonstrate it on T5-XXL and show a 2X-3X acceleration compared to the standard T5X implementation, with identical outputs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The manuscript introduces speculative decoding, an algorithm to accelerate autoregressive sampling from large Transformer models without altering the output distribution. A smaller draft model generates candidate tokens in parallel; these are verified by the target model in a single forward pass, with a corrected sampling step that accepts or rejects tokens to ensure exact equivalence to standard autoregressive decoding from the target. The authors report 2-3x speedups on T5-XXL relative to the standard T5X implementation, with identical outputs and no retraining or architectural changes required.

Significance. If the central construction holds, the work provides a practical, general-purpose technique for reducing the serial bottleneck in Transformer inference by exploiting easier subtasks approximable by faster models. The explicit algorithmic guarantee of distribution preservation (via the acceptance probability derived from the target conditional) combined with direct empirical measurement on T5-XXL constitutes a reproducible and falsifiable contribution to efficient large-model deployment.

minor comments (2)
  1. [Abstract] Abstract: the reported 2X-3X range would be more precise if the paper stated the exact hardware, batch size, and baseline T5X implementation details used for the timing measurements.
  2. [§3] The description of the draft-model selection process could include a short discussion of how alignment between draft and target distributions affects the expected number of accepted tokens.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive review and the recommendation to accept. The provided summary accurately captures the core ideas and empirical results of our work on speculative decoding.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central contribution is an explicit algorithmic construction (draft-model token generation followed by parallel target-model verification with a distribution-preserving sampling correction) whose correctness follows directly from the definition of the target model's conditional distribution. The claimed speedup is obtained by empirical measurement on T5-XXL rather than by any fitted parameter, self-referential equation, or load-bearing self-citation. The requirement for a faster draft model is stated as an explicit precondition and is satisfied in the reported experiments; no step reduces the result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method assumes the existence of a faster approximation model whose outputs can be verified in parallel batches; no new physical constants, fitted scalars, or invented particles are introduced. The only background assumptions are standard properties of autoregressive sampling and the ability to run forward passes on both models.

axioms (2)
  • standard math Autoregressive models define a conditional distribution over the next token given previous tokens.
    Invoked in the description of exact sampling from the target model.
  • domain assumption A smaller model can approximate easier subtasks within the overall language-modeling distribution.
    Stated as observation (1) in the abstract; required for the draft model to produce useful proposals.

pith-pipeline@v0.9.0 · 5452 in / 1527 out tokens · 44093 ms · 2026-05-17T22:47:38.700405+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

  2. Speculative Decoding for Autoregressive Video Generation

    cs.CV 2026-04 conditional novelty 7.0

    A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% qu...

  3. Why and When Visual Token Pruning Fails? A Study on Relevant Visual Information Shift in MLLMs Decoding

    cs.CV 2026-04 unverdicted novelty 7.0

    Visual token pruning in MLLMs fails on complex reasoning due to Relevant Visual Information Shift during decoding, but the DSTP framework fixes it training-free across models.

  4. Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.

  5. Drift-AR: Single-Step Visual Autoregressive Generation via Anti-Symmetric Drifting

    cs.CV 2026-03 unverdicted novelty 7.0

    Drift-AR achieves 3.8-5.5x speedup in AR-diffusion image models by using entropy to enable entropy-informed speculative decoding and single-step (1-NFE) anti-symmetric drifting decoding.

  6. Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    cs.LG 2024-01 conditional novelty 7.0

    Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.

  7. Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

    cs.LG 2026-05 unverdicted novelty 6.0

    Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.

  8. Micro Language Models Enable Instant Responses

    cs.CL 2026-04 conditional novelty 6.0

    Ultra-compact 8-30M parameter models start contextually grounded responses on-device while cloud models seamlessly continue them, enabling responsive AI on power-constrained hardware.

  9. Open-TQ-Metal: Fused Compressed-Domain Attention for Long-Context LLM Inference on Apple Silicon

    cs.LG 2026-04 unverdicted novelty 6.0

    Fused compressed-domain int4 attention on Apple Silicon delivers 48x speedup and 3.2x KV cache compression for 128K-context 70B models while matching FP16 token predictions.

  10. DualDiffusion: A Speculative Decoding Strategy for Masked Diffusion Models

    cs.LG 2026-04 unverdicted novelty 6.0

    DualDiffusion combines a lightweight drafter using approximations with a full verifier to reduce generation steps in masked diffusion models while keeping accuracy on MMLU and GSM8K.

  11. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints

    cs.CL 2023-05 unverdicted novelty 6.0

    Uptraining multi-head transformer checkpoints to grouped-query attention models achieves near multi-head quality at multi-query inference speeds using 5% additional compute.

  12. 31.1 A 14.08-to-135.69Token/s ReRAM-on-Logic Stacked Outlier-Free Large-Language-Model Accelerator with Block-Clustered Weight-Compression and Adaptive Parallel-Speculative-Decoding

    cs.AR 2026-05 unverdicted novelty 5.0

    A ReRAM-on-logic stacked chip delivers 14.08-135.69 tokens/s LLM inference with block-clustered compression and adaptive parallel speculative decoding, yielding 4.46-7.17x speedup over standard methods.

  13. Complexity Horizons of Compressed Models in Analog Circuit Analysis

    cs.AI 2026-05 unverdicted novelty 5.0

    Prerequisite graphs map compressed LLM performance boundaries in analog circuit analysis to allow selecting the smallest viable model for a given task complexity.

  14. EdgeFM: Efficient Edge Inference for Vision-Language Models

    cs.CV 2026-04 unverdicted novelty 5.0

    EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...

  15. SOLARIS: Speculative Offloading of Latent-bAsed Representation for Inference Scaling

    cs.LG 2026-04 unverdicted novelty 5.0

    SOLARIS speculatively precomputes user-item latent representations to decouple large-model inference from real-time serving, delivering 0.67% revenue gain when deployed in Meta's ad system.

  16. LLMOrbit: A Circular Taxonomy of Large Language Models -From Scaling Walls to Agentic AI Systems

    cs.LG 2026-01 unverdicted novelty 3.0

    A survey taxonomy of LLMs identifies three scaling crises and six efficiency paradigms while tracing the shift from generation to tool-using agents.

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · cited by 16 Pith papers · 9 internal anchors

  1. [1]

    Brown, Tom B. and Mann, Benjamin and Ryder, Nick and Subbiah, Melanie and Kaplan, Jared and Dhariwal, Prafulla and Neelakantan, Arvind and Shyam, Pranav and Sastry, Girish and Askell, Amanda and Agarwal, Sandhini and Herbert-Voss, Ariel and Krueger, Gretchen and Henighan, Tom and Child, Rewon and Ramesh, Aditya and Ziegler, Daniel M. and Wu, Jeffrey and W...

  2. [2]

    ArXiv , year=

    LaMDA: Language Models for Dialog Applications , author=. ArXiv , year=

  3. [3]

    ArXiv , year=

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation , author=. ArXiv , year=

  4. [4]

    ArXiv , year=

    PaLM: Scaling Language Modeling with Pathways , author=. ArXiv , year=

  5. [5]

    ArXiv , year=

    Lossless Speedup of Autoregressive Translation with Generalized Aggressive Decoding , author=. ArXiv , year=

  6. [6]

    ArXiv , year=

    Instantaneous Grammatical Error Correction with Shallow Aggressive Decoding , author=. ArXiv , year=

  7. [7]

    ArXiv , year=

    The Efficiency Misnomer , author=. ArXiv , year=

  8. [8]

    ArXiv , year=

    Distilling the Knowledge in a Neural Network , author=. ArXiv , year=

  9. [9]

    ArXiv , year=

    Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations , author=. ArXiv , year=

  10. [10]

    Neural Information Processing Systems , year=

    Sparse is Enough in Scaling Transformers , author=. Neural Information Processing Systems , year=

  11. [11]

    ArXiv , year=

    Primer: Searching for Efficient Transformers for Language Modeling , author=. ArXiv , year=

  12. [12]

    Annual Meeting of the Association for Computational Linguistics , year=

    The Right Tool for the Job: Matching Model and Instance Complexities , author=. Annual Meeting of the Association for Computational Linguistics , year=

  13. [13]

    Conference on Empirical Methods in Natural Language Processing , year=

    Consistent Accelerated Inference via Confident Adaptive Transformers , author=. Conference on Empirical Methods in Natural Language Processing , year=

  14. [14]

    ArXiv , year=

    Controlling Computation versus Quality for Neural Sequence Models , author=. ArXiv , year=

  15. [15]

    Cognitive Computation , volume=

    Why should we add early exits to neural networks? , author=. Cognitive Computation , volume=. 2020 , publisher=

  16. [16]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

    Dynamic Neural Networks: A Survey , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , year=

  17. [17]

    International Conference on Learning Representations , year=

    Wisdom of Committees: An Overlooked Approach To Faster and More Accurate Models , author=. International Conference on Learning Representations , year=

  18. [18]

    ArXiv , year=

    Depth-Adaptive Transformer , author=. ArXiv , year=

  19. [19]

    Annual Meeting of the Association for Computational Linguistics , year=

    Adaptive Attention Span in Transformers , author=. Annual Meeting of the Association for Computational Linguistics , year=

  20. [20]

    Interspeech , year=

    One billion word benchmark for measuring progress in statistical language modeling , author=. Interspeech , year=

  21. [21]

    ArXiv , year=

    Bridging Nonlinearities and Stochastic Regularizers with Gaussian Error Linear Units , author=. ArXiv , year=

  22. [22]

    ArXiv , year=

    Fast Transformer Decoding: One Write-Head is All You Need , author=. ArXiv , year=

  23. [23]

    ArXiv , year=

    Scaling Up Models and Data with t5x and seqio , author=. ArXiv , year=

  24. [24]

    The Journal of Machine Learning Research , volume=

    Exploring the limits of transfer learning with a unified text-to-text transformer , author=. The Journal of Machine Learning Research , volume=. 2020 , publisher=

  25. [25]

    ArXiv , year=

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , author=. ArXiv , year=

  26. [26]

    Warren , journal=

    Burton, F. Warren , journal=. Speculative computation, parallelism, and functional programming , year=

  27. [27]

    and Patterson, David A

    Hennessy, John L. and Patterson, David A. , biburl =. Computer Architecture: A Quantitative Approach , username =

  28. [28]

    ArXiv , year=

    Adaptive Computation Time for Recurrent Neural Networks , author=. ArXiv , year=

  29. [29]

    Advances in Neural Information Processing Systems , volume=

    Blockwise parallel decoding for deep autoregressive models , author=. Advances in Neural Information Processing Systems , volume=

  30. [30]

    Advances in neural information processing systems , volume=

    Attention is all you need , author=. Advances in neural information processing systems , volume=

  31. [31]

    ArXiv , year=

    Accelerating Large Language Model Decoding with Speculative Sampling , author=. ArXiv , year=

  32. [32]

    Langley , title =

    P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

  33. [33]

    T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

  34. [34]

    M. J. Kearns , title =

  35. [35]

    Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

  36. [36]

    R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

  37. [37]

    Suppressed for Anonymity , author=

  38. [38]

    Newell and P

    A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

  39. [39]

    A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

  40. [40]

    Controlling computation versus quality for neural sequence models

    Bapna, A., Arivazhagan, N., and Firat, O. Controlling computation versus quality for neural sequence models. ArXiv, abs/2002.07106, 2020

  41. [41]

    Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, ...

  42. [42]

    Burton, F. W. Speculative computation, parallelism, and functional programming. IEEE Transactions on Computers, C-34 0 (12): 0 1190--1193, 1985. doi:10.1109/TC.1985.6312218

  43. [43]

    T., and Robinson, T

    Chelba, C., Mikolov, T., Schuster, M., Ge, Q., Brants, T., Koehn, P. T., and Robinson, T. One billion word benchmark for measuring progress in statistical language modeling. In Interspeech, 2013

  44. [44]

    Chen, C., Borgeaud, S., Irving, G., Lespiau, J.-B., Sifre, L., and Jumper, J. M. Accelerating large language model decoding with speculative sampling. ArXiv, abs/2302.01318, 2023

  45. [45]

    PaLM: Scaling Language Modeling with Pathways

    Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., Barham, P., Chung, H. W., Sutton, C., Gehrmann, S., Schuh, P., Shi, K., Tsvyashchenko, S., Maynez, J., Rao, A., Barnes, P., Tay, Y., Shazeer, N. M., Prabhakaran, V., Reif, E., Du, N., Hutchinson, B. C., Pope, R., Bradbury, J., Austin, J., Isard, M., Gur-Ari, G., Yin, P., Duke, T., ...

  46. [46]

    The efficiency misnomer

    Dehghani, M., Arnab, A., Beyer, L., Vaswani, A., and Tay, Y. The efficiency misnomer. ArXiv, abs/2110.12894, 2021

  47. [47]

    BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

    Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805, 2019

  48. [48]

    Depth-adaptive transformer

    Elbayad, M., Gu, J., Grave, E., and Auli, M. Depth-adaptive transformer. ArXiv, abs/1910.10073, 2019

  49. [49]

    Dynamic neural networks: A survey

    Han, Y., Huang, G., Song, S., Yang, L., Wang, H., and Wang, Y. Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44: 0 7436--7456, 2021

  50. [50]

    Gaussian Error Linear Units (GELUs)

    Hendrycks, D. and Gimpel, K. Bridging nonlinearities and stochastic regularizers with gaussian error linear units. ArXiv, abs/1606.08415, 2016

  51. [51]

    Hennessy, J. L. and Patterson, D. A. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, Amsterdam, 5 edition, 2012. ISBN 978-0-12-383872-8

  52. [52]

    Distilling the Knowledge in a Neural Network

    Hinton, G. E., Vinyals, O., and Dean, J. Distilling the knowledge in a neural network. ArXiv, abs/1503.02531, 2015

  53. [53]

    Quantized Neural Networks: Training Neural Networks with Low Precision Weights and Activations

    Hubara, I., Courbariaux, M., Soudry, D., El-Yaniv, R., and Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. ArXiv, abs/1609.07061, 2016

  54. [54]

    Sparse is enough in scaling transformers

    Jaszczur, S., Chowdhery, A., Mohiuddin, A., Kaiser, L., Gajewski, W., Michalewski, H., and Kanerva, J. Sparse is enough in scaling transformers. In Neural Information Processing Systems, 2021

  55. [55]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21 0 (1): 0 5485--5551, 2020

  56. [56]

    Roberts, A., Chung, H. W., Levskaya, A., Mishra, G., Bradbury, J., Andor, D., Narang, S., Lester, B., Gaffney, C., Mohiuddin, A., Hawthorne, C., Lewkowycz, A., Salcianu, A., van Zee, M., Austin, J., Goodman, S., Soares, L. B., Hu, H., Tsvyashchenko, S., Chowdhery, A., Bastings, J., Bulian, J., Garc \'i a, X., Ni, J., Chen, A., Kenealy, K., Clark, J., Lee,...

  57. [57]

    Why should we add early exits to neural networks? Cognitive Computation, 12 0 (5): 0 954--966, 2020

    Scardapane, S., Scarpiniti, M., Baccarelli, E., and Uncini, A. Why should we add early exits to neural networks? Cognitive Computation, 12 0 (5): 0 954--966, 2020

  58. [58]

    Consistent accelerated inference via confident adaptive transformers

    Schuster, T., Fisch, A., Jaakkola, T., and Barzilay, R. Consistent accelerated inference via confident adaptive transformers. In Conference on Empirical Methods in Natural Language Processing, 2021

  59. [59]

    Schwartz, R., Stanovsky, G., Swayamdipta, S., Dodge, J., and Smith, N. A. The right tool for the job: Matching model and instance complexities. In Annual Meeting of the Association for Computational Linguistics, 2020

  60. [60]

    Shazeer, N. M. Fast transformer decoding: One write-head is all you need. ArXiv, abs/1911.02150, 2019

  61. [61]

    So and Wojciech Mańke and Hanxiao Liu and Zihang Dai and Noam Shazeer and Quoc V

    So, D. R., Ma'nke, W., Liu, H., Dai, Z., Shazeer, N. M., and Le, Q. V. Primer: Searching for efficient transformers for language modeling. ArXiv, abs/2109.08668, 2021

  62. [62]

    Blockwise parallel decoding for deep autoregressive models

    Stern, M., Shazeer, N., and Uszkoreit, J. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018

  63. [63]

    Adaptive attention span in transformers

    Sukhbaatar, S., Grave, E., Bojanowski, P., and Joulin, A. Adaptive attention span in transformers. In Annual Meeting of the Association for Computational Linguistics, 2019

  64. [64]

    Instantaneous grammatical error correction with shallow aggressive decoding

    Sun, X., Ge, T., Wei, F., and Wang, H. Instantaneous grammatical error correction with shallow aggressive decoding. ArXiv, abs/2106.04970, 2021

  65. [65]

    LaMDA: Language Models for Dialog Applications

    Thoppilan, R., Freitas, D. D., Hall, J., Shazeer, N. M., Kulshreshtha, A., Cheng, H.-T., Jin, A., Bos, T., Baker, L., Du, Y., Li, Y., Lee, H., Zheng, H., Ghafouri, A., Menegali, M., Huang, Y., Krikun, M., Lepikhin, D., Qin, J., Chen, D., Xu, Y., Chen, Z., Roberts, A., Bosma, M., Zhou, Y., Chang, C.-C., Krivokon, I. A., Rusch, W. J., Pickett, M., Meier-Hel...

  66. [66]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. Advances in neural information processing systems, 30, 2017

  67. [67]

    Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

    Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., Vasudevan, V., Ku, A., Yang, Y., Ayan, B. K., Hutchinson, B. C., Han, W., Parekh, Z., Li, X., Zhang, H., Baldridge, J., and Wu, Y. Scaling autoregressive models for content-rich text-to-image generation. ArXiv, abs/2206.10789, 2022