DFlash: Block Diffusion for Flash Speculative Decoding
read the original abstract
Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.
This paper has not been read by Pith yet.
Forward citations
Cited by 15 Pith papers
-
Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding
Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.
-
Test-Time Speculation
Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.
-
FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration
FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.
-
Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion
Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.
-
Attention Drift: What Autoregressive Speculative Decoding Models Learn
Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improv...
-
Test-Time Speculation
TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.
-
PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding
PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.
-
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.
-
When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?
KV cache reuse improves long-range draft acceptance rates in speculative decoding but delivers only marginal end-to-end speedups because shallow drafters cannot accurately estimate target queries and receive sparse gr...
-
Accelerating Speculative Decoding with Block Diffusion Draft Trees
DDTree builds a draft tree from a block diffusion drafter using a best-first heap on its output probabilities and verifies the tree in one target-model pass via an ancestor-only attention mask, increasing average acce...
-
SMART: When is it Actually Worth Expanding a Speculative Tree?
SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.
-
D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting
D-PACE derives per-position weights from a surrogate of expected accepted draft length to shift training focus toward currently limiting positions, yielding measured gains in wall-clock speedup and emitted length acro...
-
DMax: Aggressive Parallel Decoding for dLLMs
DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.