hub Canonical reference

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, John Jumper · 2023 · cs.CL · arXiv 2302.01318

Canonical reference. 74% of citing Pith papers cite this work as background.

95 Pith papers citing it

Background 74% of classified citations

open full Pith review browse 95 citing papers arXiv PDF

abstract

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 17 method 5 other 1

citation-polarity summary

background 17 use method 5 unclear 1

claims ledger

abstract We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion p

co-cited works

representative citing papers

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Skim: Speculative Execution for Fast and Efficient Web Agents

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

cs.CL · 2026-05-08 · unverdicted · novelty 7.0 · 2 refs

SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.

Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL

cs.LG · 2026-05-07 · conditional · novelty 7.0

A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.

UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding

cs.CL · 2026-05-06 · unverdicted · novelty 7.0

UniVer frames tree-based speculative decoding as conditional optimal transport, proving it is lossless with optimal acceptance rates and delivering 4.2-8.5% longer accepted sequences than standard rejection sampling.

Component-Aware Self-Speculative Decoding in Hybrid Language Models

cs.CL · 2026-05-01 · unverdicted · novelty 7.0

Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.

An Empirical Study of Speculative Decoding on Software Engineering Tasks

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.

FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving

cs.DC · 2026-04-22 · unverdicted · novelty 7.0

FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.

WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference

cs.IT · 2026-04-20 · unverdicted · novelty 7.0

WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.

Speculative Decoding for Autoregressive Video Generation

cs.CV · 2026-04-19 · conditional · novelty 7.0

A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.

From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning

cs.CL · 2026-04-16 · unverdicted · novelty 7.0

SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.

MARS: Enabling Autoregressive Models Multi-Token Generation

cs.CL · 2026-04-08 · unverdicted · novelty 7.0

MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.

Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling

cs.LG · 2026-04-05 · unverdicted · novelty 7.0

Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.

EvoESAP: Non-Uniform Expert Pruning for Sparse MoE

cs.LG · 2026-03-06 · conditional · novelty 7.0

EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

cs.LG · 2026-02-06 · conditional · novelty 7.0

Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.

Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices

cs.DC · 2025-12-06 · conditional · novelty 7.0

Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.

VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping

cs.CV · 2025-11-17 · conditional · novelty 7.0

VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.

Efficient Autoregressive Inference for Transformer Probabilistic Models

stat.ML · 2025-10-10 · conditional · novelty 7.0

A causal autoregressive buffer enables efficient batched autoregressive sampling and joint density evaluation in set-based transformer models by caching context and attending to prior predictions.

citing papers explorer

Showing 50 of 95 citing papers.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding cs.CL · 2026-05-13 · unverdicted · none · ref 3 · 2 links · internal anchor
Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits cs.LG · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding cs.LG · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Skim: Speculative Execution for Fast and Efficient Web Agents cs.AI · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding cs.CL · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding cs.CL · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding cs.LG · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting cs.CL · 2026-05-08 · unverdicted · none · ref 2 · 2 links · internal anchor
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
Selective Rollout: Mid-Trajectory Termination for Multi-Sample Agent RL cs.LG · 2026-05-07 · conditional · none · ref 30 · internal anchor
A one-parameter early-termination gate based on mean pairwise prefix edit distance reduces wall-clock time by 10.7% and raises held-out success by 2.5 pp in GRPO on ALFWorld by cutting zero-advantage batch dilution.
UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding cs.CL · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
UniVer frames tree-based speculative decoding as conditional optimal transport, proving it is lossless with optimal acceptance rates and delivering 4.2-8.5% longer accepted sequences than standard rejection sampling.
Component-Aware Self-Speculative Decoding in Hybrid Language Models cs.CL · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.
An Empirical Study of Speculative Decoding on Software Engineering Tasks cs.SE · 2026-04-29 · unverdicted · none · ref 6 · internal anchor
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving cs.DC · 2026-04-22 · unverdicted · none · ref 8 · internal anchor
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference cs.IT · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
Speculative Decoding for Autoregressive Video Generation cs.CV · 2026-04-19 · conditional · none · ref 2 · internal anchor
A training-free speculative decoding method for block-based autoregressive video diffusion uses a quality router on worst-frame ImageReward scores to accept drafter proposals, achieving up to 2.09x speedup at 95.7% quality retention.
From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning cs.CL · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.
MARS: Enabling Autoregressive Models Multi-Token Generation cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling cs.LG · 2026-04-05 · unverdicted · none · ref 1 · internal anchor
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
EvoESAP: Non-Uniform Expert Pruning for Sparse MoE cs.LG · 2026-03-06 · conditional · none · ref 3 · internal anchor
EvoESAP uses evolutionary search guided by a speculative-decoding-inspired ESAP metric to discover non-uniform layer-wise sparsity allocations for MoE expert pruning, improving generation accuracy up to 19.6% at 50% sparsity.
When RL Meets Adaptive Speculative Training: A Unified Training-Serving System cs.LG · 2026-02-06 · conditional · none · ref 7 · internal anchor
Aurora unifies speculative decoder training and serving via asynchronous RL on inference traces, delivering 1.5x day-0 speedup on frontier models and 1.25x adaptation gains on distribution shifts.
Vec-LUT: Vector Table Lookup for Parallel Ultra-Low-Bit LLM Inference on Edge Devices cs.DC · 2025-12-06 · conditional · none · ref 5 · internal anchor
Vec-LUT delivers up to 4.2x speedup over prior LUT methods for parallel ultra-low-bit LLM inference on edge devices by unifying lookups across tokens and adding cache-aware tensor layouts.
VVS: Accelerating Speculative Decoding for Visual Autoregressive Generation via Partial Verification Skipping cs.CV · 2025-11-17 · conditional · none · ref 4 · internal anchor
VVS accelerates visual AR image generation by partially skipping verifications in speculative decoding, achieving 2.8x fewer target forward passes while preserving competitive quality.
Efficient Autoregressive Inference for Transformer Probabilistic Models stat.ML · 2025-10-10 · conditional · none · ref 1 · internal anchor
A causal autoregressive buffer enables efficient batched autoregressive sampling and joint density evaluation in set-based transformer models by caching context and attending to prior predictions.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs cs.CL · 2024-12-30 · unverdicted · none · ref 232 · internal anchor
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation cs.CV · 2024-06-10 · conditional · none · ref 7 · internal anchor
Scaled vanilla autoregressive models based on Llama achieve 2.18 FID on ImageNet 256x256 image generation, beating popular diffusion models without visual inductive biases.
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads cs.LG · 2024-01-19 · conditional · none · ref 5 · internal anchor
Medusa augments LLMs with multiple decoding heads and tree-based attention to predict and verify several tokens in parallel, yielding 2.2-3.6x inference speedup via two fine-tuning regimes.
Fast Inference from Transformers via Speculative Decoding cs.LG · 2022-11-30 · accept · none · ref 44 · internal anchor
Speculative decoding accelerates exact sampling from large autoregressive models by 2-3x on T5-XXL by running smaller approximation models in parallel to propose token sequences that the large model then verifies in batches while preserving the original output distribution.
FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration cs.CL · 2026-05-19 · unverdicted · none · ref 5 · internal anchor
FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.
OpenJarvis: Personal AI, On Personal Devices cs.LG · 2026-05-16 · unverdicted · none · ref 11 · internal anchor
OpenJarvis decomposes personal AI into Intelligence, Engine, Agents, Tools & Memory, and Learning primitives and applies LLM-guided spec search to produce on-device configurations that reach within 3.2 pp of cloud baselines on average across eight tasks.
STS: Efficient Sparse Attention with Speculative Token Sparsity cs.LG · 2026-05-15 · unverdicted · none · ref 3 · internal anchor
STS repurposes draft-model attention scores from speculative decoding to build token-and-head-wise sparsity masks, delivering 2.67x speedup at ~90% sparsity on NarrativeQA with negligible accuracy loss.
Performance-Driven Policy Optimization for Speculative Decoding with Adaptive Windowing cs.CL · 2026-05-14 · unverdicted · none · ref 2 · internal anchor
PPOW uses window-level RL with cost-aware speedup and proximity rewards plus adaptive divergence-aware windowing to reach 6.29-6.52 acceptance lengths and 3.39-4.36x speedups in speculative decoding.
N-vium: Mixture-of-Exits Transformer for Accelerated Exact Generation cs.LG · 2026-05-13 · unverdicted · none · ref 10 · internal anchor
N-vium achieves 57.9% wall-clock speedup over matched standard transformers at no perplexity cost by mixing exact predictions from multiple model depths.
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration cs.LG · 2026-05-11 · unverdicted · none · ref 4 · internal anchor
CATS achieves up to 5.08x wall-clock speedup for LLM generation on edge devices via memory-matched cascaded tree speculation, outperforming prior methods by 1.45x with no quality loss.
DECO: Sparse Mixture-of-Experts with Dense-Comparable Performance on End-Side Devices cs.LG · 2026-05-11 · unverdicted · none · ref 71 · 3 links · internal anchor
DECO is a sparse MoE architecture with ReLU-based routing, learnable expert scaling, and NormSiLU activation that matches dense Transformer performance at 20% expert activation and delivers 2.93x speedup on Jetson AGX Orin.
Breaking the Reward Barrier: Accelerating Tree-of-Thought Reasoning via Speculative Exploration cs.LG · 2026-05-11 · unverdicted · none · ref 6 · 2 links · internal anchor
SPEX delivers 1.2-3x speedup on ToT algorithms via speculative path selection, dynamic budget allocation, and adaptive early termination, reaching up to 4.1x when combined with token-level speculative decoding.
Attention Drift: What Autoregressive Speculative Decoding Models Learn cs.LG · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improves acceptance length up to 2x on perturbed templates and 1.18x on long-context data
Edit-Based Refinement for Parallel Masked Diffusion Language Models cs.CL · 2026-05-10 · unverdicted · none · ref 4 · internal anchor
ME-DLM augments parallel masked diffusion models with edit-distance-supervised refinements to raise quality on coding and math benchmarks while using far fewer diffusion steps.
Test-Time Speculation cs.CL · 2026-05-10 · unverdicted · none · ref 2 · 2 links · internal anchor
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding cs.CL · 2026-05-09 · unverdicted · none · ref 6 · internal anchor
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
Rethinking Dense Sequential Chains: Reasoning Language Models Can Extract Answers from Sparse, Order-Shuffling Chain-of-Thoughts cs.CL · 2026-05-08 · conditional · none · ref 9 · internal anchor
Reasoning language models extract answers from sparse, order-shuffled chain-of-thought traces with little accuracy loss.
CASCADE: Context-Aware Relaxation for Speculative Image Decoding cs.CV · 2026-05-08 · unverdicted · none · ref 6 · internal anchor
CASCADE formalizes semantic interchangeability and convergence in target model representations to enable context-aware acceptance relaxation in tree-based speculative decoding, delivering up to 3.6x speedup on text-to-image models without quality loss.
Policy-Guided Stepwise Model Routing for Cost-Effective Reasoning cs.AI · 2026-05-07 · unverdicted · none · ref 3 · internal anchor
A small RL-trained policy for stepwise model routing between LLM sizes improves the accuracy-cost tradeoff on math benchmarks over handcrafted strategies and matches large process reward model methods.
SpecKV: Adaptive Speculative Decoding with Compression-Aware Gamma Selection cs.LG · 2026-05-04 · conditional · none · ref 3 · internal anchor
SpecKV uses a small MLP trained on draft model confidence and entropy to dynamically choose the optimal speculation length gamma, achieving 56% better performance than fixed gamma=4 across various tasks and compression levels.
SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference cs.DC · 2026-05-04 · unverdicted · none · ref 24 · 2 links · internal anchor
SPECTRE achieves up to 2.28x speedup for large-model LLM serving by running speculative draft generation and target verification in parallel using idle tail-model services.
When Less is Enough: Efficient Inference via Collaborative Reasoning cs.LG · 2026-05-01 · conditional · none · ref 3 · internal anchor
A large model generates a compact reasoning signal that a small model uses to solve tasks, reducing the large model's output tokens by up to 60% on benchmarks like AIME and GPQA.
Position-Aware Drafting for Inference Acceleration in LLM-Based Generative List-Wise Recommendation cs.IR · 2026-04-30 · unverdicted · none · ref 11 · internal anchor
PAD-Rec augments standard draft models with item-position and step-position embeddings plus learnable gates, delivering up to 3.1x wall-clock speedup and 5% average gain over strong speculative-decoding baselines on four datasets while largely preserving recommendation quality.
Select to Think: Unlocking SLM Potential with Local Sufficiency cs.CL · 2026-04-29 · conditional · none · ref 2 · internal anchor
Small language models can achieve near large-model reasoning performance by learning to re-rank their own top-K token predictions after distilling selection from the large model.
Unifying Sparse Attention with Hierarchical Memory for Scalable Long-Context LLM Serving cs.LG · 2026-04-29 · unverdicted · none · ref 11 · internal anchor
SPIN co-designs sparse attention with hierarchical memory to achieve 1.66-5.66x higher throughput, 7-9x lower TTFT, and up to 58% lower TPOT than vLLM and original sparse implementations.
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding? cs.CL · 2026-04-29 · unverdicted · none · ref 1 · 2 links · internal anchor
KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.

Accelerating Large Language Model Decoding with Speculative Sampling

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer