pith. sign in

hub Canonical reference

Accelerating Large Language Model Decoding with Speculative Sampling

Canonical reference. 74% of citing Pith papers cite this work as background.

95 Pith papers citing it
Background 74% of classified citations
abstract

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

hub tools

citation-role summary

background 17 method 5 other 1

citation-polarity summary

claims ledger

  • abstract We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion p

co-cited works

representative citing papers

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

cs.CL · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.

Speculative Decoding for Autoregressive Video Generation

cs.CV · 2026-04-19 · conditional · novelty 7.0

A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.

MARS: Enabling Autoregressive Models Multi-Token Generation

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

cs.LG · 2026-03-06 · conditional · novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

citing papers explorer

Showing 50 of 95 citing papers.