super hub Canonical reference

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Geoffrey Irving, Jean-Baptiste Lespiau, John Jumper, Laurent Sifre, Sebastian Borgeaud · 2023 · cs.CL · arXiv 2302.01318

Canonical reference. 75% of citing Pith papers cite this work as background.

152 Pith papers citing it

Background 75% of classified citations

open full Pith review browse 152 citing papers more from Charlie Chen arXiv PDF

abstract

We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion parameter language model, achieving a 2-2.5x decoding speedup in a distributed setup, without compromising the sample quality or making modifications to the model itself.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 18 method 5 other 1

citation-polarity summary

background 18 use method 5 unclear 1

claims ledger

abstract We present speculative sampling, an algorithm for accelerating transformer decoding by enabling the generation of multiple tokens from each transformer call. Our algorithm relies on the observation that the latency of parallel scoring of short continuations, generated by a faster but less powerful draft model, is comparable to that of sampling a single token from the larger target model. This is combined with a novel modified rejection sampling scheme which preserves the distribution of the target model within hardware numerics. We benchmark speculative sampling with Chinchilla, a 70 billion p

authors

Charlie Chen Geoffrey Irving Jean-Baptiste Lespiau John Jumper Laurent Sifre Sebastian Borgeaud

co-cited works

representative citing papers

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding

cs.CL · 2026-05-13 · unverdicted · novelty 8.0 · 2 refs

Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.

The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits

cs.LG · 2026-05-08 · unverdicted · novelty 8.0

Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.

Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference

cs.DC · 2026-07-02 · unverdicted · novelty 7.0

Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.

Certified Speculative Execution for Untrusted AI Agents

cs.CR · 2026-06-30 · unverdicted · novelty 7.0

CGPA enables certified speculative execution of untrusted AI proposals in constrained sequential decisions via verifier rejection, conformal boundary gating, and solver deferral, yielding zero violations and regret within noise of the oracle.

When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding

cs.LG · 2026-06-29 · unverdicted · novelty 7.0

Develops theory for acceptance in speculative decoding under greedy/relaxed/tree criteria, with exact KL certificates and margin bounds, evaluated on Qwen3 models.

Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing

cs.CL · 2026-06-24 · unverdicted · novelty 7.0

LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.

MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling

cs.AI · 2026-06-11 · unverdicted · novelty 7.0

MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.

Hacking Generative Perplexity: Why Unconditional Text Evaluation Needs Distributional Metrics

cs.CL · 2026-06-07 · accept · novelty 7.0

Naive samplers beat published diffusion and flow models on gen-PPL with incoherent output, proving the metric unsound and motivating distributional evaluation suites.

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

cs.LG · 2026-06-05 · unverdicted · novelty 7.0

WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

cs.CV · 2026-06-04 · conditional · novelty 7.0

Parallel Jacobi Decoding accelerates autoregressive image models 4.8x-6.4x by using 2D spatial draft expansion and adjusted attention masks while keeping generation quality competitive.

D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models

cs.DC · 2026-06-03 · unverdicted · novelty 7.0

D^2SD uses two diffusion drafters in a prefix tree structure with confidence scores to select and recover alternative draft sequences, achieving higher acceptance rates in speculative decoding.

Speculative Sampling For Faster Molecular Dynamics

cs.LG · 2026-06-01 · unverdicted · novelty 7.0

LSD extends speculative sampling to second-order Langevin dynamics, achieving 3-9x speedup in MD while exactly sampling from the target distribution without relative error.

Cost-Aware Diffusion Draft Trees for Speculative Decoding

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

cs.CL · 2026-06-01 · unverdicted · novelty 7.0

Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

cs.LG · 2026-05-31 · unverdicted · novelty 7.0

OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.

TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding

cs.AI · 2026-05-30 · unverdicted · novelty 7.0

TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

cs.LG · 2026-05-30 · unverdicted · novelty 7.0

EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

cs.LG · 2026-05-28 · unverdicted · novelty 7.0

BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.

Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

cs.LG · 2026-05-19 · unverdicted · novelty 7.0

Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

Skim: Speculative Execution for Fast and Efficient Web Agents

cs.AI · 2026-05-15 · unverdicted · novelty 7.0

Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.

PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding

cs.CL · 2026-05-15 · unverdicted · novelty 7.0

PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.

Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.

SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.

Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding

cs.LG · 2026-05-08 · unverdicted · novelty 7.0

Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.

citing papers explorer

Showing 50 of 127 citing papers after filters.

Mistletoe: Stealthy Acceleration-Collapse Attacks on Speculative Decoding cs.CL · 2026-05-13 · unverdicted · none · ref 3 · 2 links · internal anchor
Mistletoe introduces a stealthy attack on speculative decoding that collapses acceleration by reducing average accepted length while preserving output semantics.
The Coupling Tax: How Shared Token Budgets Undermine Visible Chain-of-Thought Under Fixed Output Limits cs.LG · 2026-05-08 · unverdicted · none · ref 18 · internal anchor
Shared token budgets between visible chain-of-thought and answers create a coupling tax that makes non-thinking competitive on math benchmarks, with a truncation decomposition predicting the crossover and split budgets improving results.
Lynx: Progressive Speculative Quantization for accelerating KV Transfer in Long-Context Inference cs.DC · 2026-07-02 · unverdicted · none · ref 8 · internal anchor
Lynx partitions KV cache bits into anchor and residual streams for progressive transfer, enabling speculative decoding on partial data followed by verification to match BF16 accuracy at 4-bit-like TTFT.
Certified Speculative Execution for Untrusted AI Agents cs.CR · 2026-06-30 · unverdicted · none · ref 2 · internal anchor
CGPA enables certified speculative execution of untrusted AI proposals in constrained sequential decisions via verifier rejection, conformal boundary gating, and solver deferral, yielding zero violations and regret within noise of the oracle.
When Is a Draft Accepted? A Theory of Acceptance in Speculative Decoding cs.LG · 2026-06-29 · unverdicted · none · ref 5 · internal anchor
Develops theory for acceptance in speculative decoding under greedy/relaxed/tree criteria, with exact KL certificates and margin bounds, evaluated on Qwen3 models.
Efficient and Trainable Language Model Test-Time Scaling via Local Branch Routing cs.CL · 2026-06-24 · unverdicted · none · ref 5 · internal anchor
LBR performs token-level test-time scaling via local branch routing on hidden states, enabling end-to-end RL training and improving Pass@1 and Pass@32 on math benchmarks over CoT and RLVR baselines.
MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling cs.AI · 2026-06-11 · unverdicted · none · ref 4 · internal anchor
MARS is a margin-adversarial stopping rule for parallel LLM test-time scaling that saves 25-47% tokens while matching full-budget majority-vote accuracy by learning trace switch probabilities and applying adversarial bounds.
WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing cs.LG · 2026-06-05 · unverdicted · none · ref 5 · internal anchor
WhiFlash introduces token-level cross-paradigm routing between autoregressive and diffusion drafting models, with cache optimizations, to raise acceptance lengths and deliver up to 69.6% throughput gains over EAGLE-3.
D^2SD: Accelerating Speculative Decoding with Dual Diffusion Draft Models cs.DC · 2026-06-03 · unverdicted · none · ref 5 · internal anchor
D^2SD uses two diffusion drafters in a prefix tree structure with confidence scores to select and recover alternative draft sequences, achieving higher acceptance rates in speculative decoding.
Speculative Sampling For Faster Molecular Dynamics cs.LG · 2026-06-01 · unverdicted · none · ref 45 · internal anchor
LSD extends speculative sampling to second-order Langevin dynamics, achieving 3-9x speedup in MD while exactly sampling from the target distribution without relative error.
Cost-Aware Diffusion Draft Trees for Speculative Decoding cs.CL · 2026-06-01 · unverdicted · none · ref 2 · internal anchor
CaDDTree jointly selects tree structure and budget to maximize expected tokens per unit time in speculative decoding, proving unimodality under convex verification cost and matching oracle DDTree performance on Qwen models.
Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning cs.CL · 2026-06-01 · unverdicted · none · ref 10 · internal anchor
Chunk-Level Guided Generation uses off-the-shelf large LLMs to score fixed-length chunks from small models via likelihoods, matching trained PRM performance on math benchmarks without reward-model training.
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification cs.LG · 2026-05-31 · unverdicted · none · ref 3 · internal anchor
OmniOPD replaces token-level logit matching in on-policy distillation with Monte Carlo chunk-level semantic verification and a peak-entropy scheduler.
TAPS: Target-Aware Prefix Tree Selection for Diffusion-Drafted Speculative Decoding cs.AI · 2026-05-30 · unverdicted · none · ref 9 · internal anchor
TAPS converts diffusion marginal probabilities into path-conditioned acceptance estimates to select prefix-closed subtrees under a fixed verification budget, achieving up to 7.9x end-to-end speedup over autoregressive decoding.
EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing cs.LG · 2026-05-30 · unverdicted · none · ref 85 · internal anchor
EST-PRM stress-tests five PRM models on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench using three label-preserving transformations and reports model-specific vulnerability patterns.
Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting cs.LG · 2026-05-28 · unverdicted · none · ref 9 · internal anchor
BASTION is a budget-aware speculative decoding framework with adaptive tree-structured block diffusion drafting that reports up to 6.61x speedup and 39% improvement over block-diffusion baselines.
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding cs.LG · 2026-05-19 · unverdicted · none · ref 4 · internal anchor
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
Skim: Speculative Execution for Fast and Efficient Web Agents cs.AI · 2026-05-15 · unverdicted · none · ref 4 · internal anchor
Skim profiles website patterns offline to enable fast-path speculative execution for web agents, cutting median cost by 1.9x and latency by 33.4% with no accuracy loss on benchmarks.
PSD: Pushing the Pareto Frontier of Diffusion LLMs via Parallel Speculative Decoding cs.CL · 2026-05-15 · unverdicted · none · ref 11 · internal anchor
PSD is a training-free framework that jointly optimizes spatial unmasking and temporal speculative decoding in diffusion LLMs to reach up to 5.5x tokens per forward pass while preserving accuracy comparable to greedy decoding.
Factorization-Error-Free Discrete Diffusion Language Model via Speculative Decoding cs.CL · 2026-05-14 · unverdicted · none · ref 4 · internal anchor
FeF-DLLM achieves factorization-error-free generation in discrete diffusion language models via prefix-conditioned posterior factorization and speculative decoding, delivering 5.04 pp higher accuracy and 3.86x faster inference on GSM8K, MATH, HumanEval, and MBPP.
SlimSpec: Low-Rank Draft LM-Head for Accelerated Speculative Decoding cs.LG · 2026-05-11 · unverdicted · none · ref 2 · internal anchor
SlimSpec replaces the standard LM-head in draft models with a low-rank version to deliver 4-5x faster speculative decoding while preserving full vocabulary and competitive acceptance rates.
Future Validity is the Missing Statistic: From Impossibility to $\Phi$-Estimation for Grammar-Faithful Speculative Decoding cs.LG · 2026-05-08 · unverdicted · none · ref 2 · internal anchor
Speculative decoding under local grammar masking samples from the projected distribution μ^proj instead of the grammar-conditional μ*, and the future-validity function Φ corrects it via a Doob transform to achieve exact sampling from μ*.
SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting cs.CL · 2026-05-08 · unverdicted · none · ref 2 · 2 links · internal anchor
SpecBlock achieves 8-13% higher mean speedup than EAGLE-3 at 44-52% drafting cost via block-iterative drafting with hidden-state inheritance, dynamic rank-head branching, valid-prefix masking, and optional cost-aware bandit adaptation.
UniVer: A Unified Perspective for Multi-step and Multi-draft Speculative Decoding cs.CL · 2026-05-06 · unverdicted · none · ref 3 · internal anchor
UniVer frames tree-based speculative decoding as conditional optimal transport, proving it is lossless with optimal acceptance rates and delivering 4.2-8.5% longer accepted sequences than standard rejection sampling.
Component-Aware Self-Speculative Decoding in Hybrid Language Models cs.CL · 2026-05-01 · unverdicted · none · ref 2 · internal anchor
Component-aware self-speculative decoding achieves high acceptance rates in parallel hybrid models like Falcon-H1 but fails in sequential ones like Qwen3.5, with the gap tied to how components are integrated.
An Empirical Study of Speculative Decoding on Software Engineering Tasks cs.SE · 2026-04-29 · unverdicted · none · ref 6 · internal anchor
Speculative decoding accelerates LLM inference on SE tasks without accuracy loss, with model-based methods suiting code generation and model-free methods suiting repository-level repair and editing.
FASER: Fine-Grained Phase Management for Speculative Decoding in Dynamic LLM Serving cs.DC · 2026-04-22 · unverdicted · none · ref 8 · internal anchor
FASER delivers up to 53% higher throughput and 1.92x lower latency in dynamic LLM serving by adjusting speculative lengths per request, early pruning of rejects, and overlapping draft/verification phases via frontiers.
WISV: Wireless-Informed Semantic Verification for Distributed Speculative Decoding in Device-Edge LLM Inference cs.IT · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
WISV uses a channel-aware semantic acceptance policy on hidden representations to boost accepted sequence length by up to 60.8% and cut interaction rounds by 37.3% in distributed speculative decoding, with under 1% accuracy loss.
From Tokens to Steps: Verification-Aware Speculative Decoding for Efficient Multi-Step Reasoning cs.CL · 2026-04-16 · unverdicted · none · ref 1 · internal anchor
SpecGuard adds step-level verification to speculative decoding via attention grounding and log-probability scores, yielding 3.6% higher accuracy and 11% lower latency on reasoning benchmarks.
MARS: Enabling Autoregressive Models Multi-Token Generation cs.CL · 2026-04-08 · unverdicted · none · ref 1 · internal anchor
MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.
Cactus: Accelerating Auto-Regressive Decoding with Constrained Acceptance Speculative Sampling cs.LG · 2026-04-05 · unverdicted · none · ref 1 · internal anchor
Cactus uses constrained optimization to guarantee bounded divergence from the verifier LLM distribution during speculative sampling, raising acceptance rates without the distortion seen in typical acceptance sampling.
Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs cs.CL · 2024-12-30 · unverdicted · none · ref 232 · internal anchor
o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.
Spec-AUF: Accept-Until-Fail Training under Train-Inference Misalignment for Masked Block Drafters cs.AI · 2026-07-02 · unverdicted · none · ref 2 · internal anchor
Accept-Until-Fail training improves average accepted block length in speculative decoding from 2.40 to 2.61 by limiting cross-entropy support to the drafter's first predicted failure point.
Diffusion-GR2: Diffusion Generative Reasoning Re-ranker cs.IR · 2026-07-01 · unverdicted · none · ref 3 · internal anchor
Diffusion-GR2 converts an AR reasoning re-ranker to block-diffusion via CFT, OPD, and RL stages, recovering near-parity accuracy on Amazon Beauty with 2.4-3.5x decode speedup.
Before Thinking, Learn to Decide: Proactive Routing for Efficient Visual Reasoning cs.CL · 2026-06-29 · unverdicted · none · ref 4 · internal anchor
PRP introduces proactive routing via Draft Rating Learning and Joint Rating Learning to route queries early between draft and target models for efficient multimodal reasoning.
Speculative Pre-Positioning: Decoding Stateful Sessions to the Next Decision Point Off the Critical Path cs.LG · 2026-06-28 · unverdicted · none · ref 6 · internal anchor
Speculative pre-positioning decodes stateful sessions ahead with the target model to enable near-constant-time responses from cached distributions or pre-paid deltas at 87% precision for capable models.
Depth Exploration for LLM Decoding cs.LG · 2026-06-28 · unverdicted · none · ref 27 · internal anchor
DEX replaces single-depth selection with parallel exploration over multiple candidate depths, committing the final-depth token while collapsing reusable states to reduce per-token computation.
RLM-Cascade: Response-Level Speculative Decoding for Cost-Efficient LLM API Serving cs.LG · 2026-06-22 · unverdicted · none · ref 2 · internal anchor
RLM-Cascade applies response-level speculative decoding with a complexity router to reduce LLM API costs by 45.8% on agentic coding tasks while also lowering latency and matching or exceeding baseline quality.
JetSpec: Breaking the Scaling Ceiling of Speculative Decoding with Parallel Tree Drafting cs.CL · 2026-06-16 · unverdicted · none · ref 6 · internal anchor
JetSpec trains a causal draft head to produce branch-consistent trees aligned with target autoregressive scores, achieving up to 9.64x speedup on MATH-500 and outperforming prior SD baselines on Qwen3 models.
SpecGen: Accelerating Agentic Kernel Optimization with Speculative Generation cs.DC · 2026-06-16 · unverdicted · none · ref 6 · internal anchor
SpecGen introduces speculative generation to fork non-reasoning kernel candidates during LLM reasoning traces, enabling early termination and parallel profiling to reduce end-to-end optimization time on H200 GPUs.
Accelerating Speculative Diffusions via Block Verification cs.LG · 2026-06-11 · unverdicted · none · ref 9 · internal anchor
A new residual-sampling scheme for diffusion models permits block verification and yields up to 6.3% speedup via a heuristic self-speculative drafter that needs no training.
From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion cs.CV · 2026-06-10 · unverdicted · none · ref 63 · internal anchor
A 1D token interface with Selective Token Editing improves multimodal image fusion by modeling global appearance factors separately from local 2D structures, yielding best overall performance on four benchmarks.
Teaching Diffusion to Speculate Left-to-Right cs.CL · 2026-06-10 · unverdicted · none · ref 21 · internal anchor
Three training interventions for diffusion drafters raise accepted draft length 21-76% over uniform baseline on reasoning, code, and dialogue tasks.
CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference cs.LG · 2026-06-09 · unverdicted · none · ref 5 · internal anchor
CLP is a lightweight linear predictor for safe multi-token spans in LLM decoding that delivers 1.14x-1.29x speedup on Qwen2.5 models with zero measured quality degradation.
K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling cs.LG · 2026-06-09 · unverdicted · none · ref 9 · internal anchor
K-Forcing introduces progressive self-forcing distillation to train a conditional push-forward model that jointly decodes k future tokens per forward pass, yielding 2.4-3.5x speedup at k=4 with modest quality loss on LM1B and OpenWebText.
PathRelax: Parallel-Path Relaxed Speculative Jacobi Decoding for Accelerating Auto-Regressive Text-to-Image Generation cs.CV · 2026-06-09 · unverdicted · none · ref 4 · internal anchor
PathRelax achieves ~4x speedup in autoregressive text-to-image generation on three benchmarks by expanding draft search with parallel paths and cross-path relaxed verification.
AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis cs.LG · 2026-06-08 · unverdicted · none · ref 5 · internal anchor
A system that auto-generates statically verified megakernels for LLM forward passes, retargets across NVIDIA architectures from one source, matches reference outputs exactly, and self-improves via an agent loop.
TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech cs.SD · 2026-06-08 · unverdicted · none · ref 3 · internal anchor
TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.
Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models cs.LG · 2026-06-07 · unverdicted · none · ref 4 · internal anchor
Sparrow uses a dynamic sparsity schedule keyed to the lower tail of sparse-to-dense actor-policy mismatch to enable stable and faster rollouts in long-context RL for LLMs.
Multi-SPIN: Multi-Access Speculative Inference for Cooperative Token Generation at the Edge cs.DC · 2026-06-03 · unverdicted · none · ref 12 · internal anchor
Multi-SPIN extends speculative inference to multi-user edge systems via joint optimization of draft lengths and bandwidth allocation, yielding up to 88% higher sum token goodput than baselines in Llama-2 and Qwen experiments.

Accelerating Large Language Model Decoding with Speculative Sampling

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer