pith. sign in

arxiv: 2602.06036 · v2 · pith:A35AM6EAnew · submitted 2026-02-05 · 💻 cs.CL

DFlash: Block Diffusion for Flash Speculative Decoding

classification 💻 cs.CL
keywords decodingdflashdiffusionmodelmodelsspeculativeautoregressivedraft
0
0 comments X
read the original abstract

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding mitigates this bottleneck by using a fast draft model whose outputs are verified in parallel by the target LLM; however, existing methods still rely on autoregressive drafting, which remains sequential and limits practical speedups. Diffusion LLMs offer a promising alternative by enabling parallel generation, but current diffusion models typically underperform compared with autoregressive models. In this paper, we introduce DFlash, a speculative decoding framework that employs a lightweight block diffusion model for parallel drafting. By generating draft tokens in a single forward pass and conditioning the draft model on context features extracted from the target model, DFlash enables efficient drafting with high-quality outputs and higher acceptance rates. Experiments show that DFlash achieves over 6x lossless acceleration across a range of models and tasks, delivering up to 2.5x higher speedup than the state-of-the-art speculative decoding method EAGLE-3.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 15 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Draft Less, Retrieve More: Hybrid Tree Construction for Speculative Decoding

    cs.LG 2026-05 unverdicted novelty 7.0

    Graft combines pruning and retrieval in a sequential mechanism to build hybrid draft trees for speculative decoding, delivering up to 5.41× speedup and 21.8% better average speedup than EAGLE-3 on large models.

  2. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  3. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 conditional novelty 7.0

    DMax uses On-Policy Uniform Training and Soft Parallel Decoding to enable aggressive parallelism in dLLMs, raising TPF on GSM8K from 2.04 to 5.47 and on MBPP from 2.71 to 5.86 while preserving accuracy.

  4. FlexDraft: Flexible Speculative Decoding via Attention Tuning and Bonus-Guided Calibration

    cs.CL 2026-05 unverdicted novelty 6.0

    FlexDraft is a lossless speculative decoding framework that adapts to batch sizes via attention tuning on final layers, MLP-based bonus calibration, and dynamic parallel/sequential decoding.

  5. Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

    cs.LG 2026-05 unverdicted novelty 6.0

    Orthrus unifies autoregressive LLMs and diffusion models via shared KV cache and consensus to enable up to 7.8x parallel token generation speedup with O(1) memory overhead and lossless results.

  6. Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

    cs.LG 2026-05 unverdicted novelty 6.0

    Orthrus unifies autoregressive and diffusion views on a shared KV cache to deliver lossless parallel token generation with up to 7.8x speedup and O(1) memory overhead.

  7. Attention Drift: What Autoregressive Speculative Decoding Models Learn

    cs.LG 2026-05 unverdicted novelty 6.0

    Drafter models in speculative decoding suffer progressive attention drift caused by monotonically growing hidden-state magnitudes along the residual path; post-norm plus per-state RMSNorm reduces this drift and improv...

  8. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 6.0

    TTS adapts speculator models online via target model verifications to improve acceptance lengths by up to 72% over prior methods, with gains increasing for longer generations.

  9. PARD-2: Target-Aligned Parallel Draft Model for Dual-Mode Speculative Decoding

    cs.CL 2026-05 unverdicted novelty 6.0

    PARD-2 uses Confidence-Adaptive Token optimization to align draft model training with acceptance length in speculative decoding, enabling dual-mode operation and up to 6.94x lossless speedup on Llama3.1-8B.

  10. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

    cs.CL 2026-04 unverdicted novelty 6.0

    KV cache reuse improves long-range draft acceptance in speculative decoding but delivers only marginal end-to-end speedups due to drafter limitations.

  11. When Hidden States Drift: Can KV Caches Rescue Long-Range Speculative Decoding?

    cs.CL 2026-04 unverdicted novelty 6.0

    KV cache reuse improves long-range draft acceptance rates in speculative decoding but delivers only marginal end-to-end speedups because shallow drafters cannot accurately estimate target queries and receive sparse gr...

  12. Accelerating Speculative Decoding with Block Diffusion Draft Trees

    cs.CL 2026-04 unverdicted novelty 6.0

    DDTree builds a draft tree from a block diffusion drafter using a best-first heap on its output probabilities and verifies the tree in one target-model pass via an ancestor-only attention mask, increasing average acce...

  13. SMART: When is it Actually Worth Expanding a Speculative Tree?

    cs.DC 2026-04 unverdicted novelty 6.0

    SMART uses marginal benefit-cost analysis to dynamically build efficient speculative trees, achieving 15-20% additional speedup in LLM and MLLM inference.

  14. D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

    cs.LG 2026-05 unverdicted novelty 5.0

    D-PACE derives per-position weights from a surrogate of expected accepted draft length to shift training focus toward currently limiting positions, yielding measured gains in wall-clock speedup and emitted length acro...

  15. DMax: Aggressive Parallel Decoding for dLLMs

    cs.LG 2026-04 unverdicted novelty 5.0

    DMax enables faster parallel decoding in diffusion language models by using on-policy training to recover from errors and soft embedding interpolations for iterative revision, boosting tokens per forward pass roughly ...