pith. sign in

arxiv: 2606.02091 · v2 · pith:JN5VWVZAnew · submitted 2026-06-01 · 💻 cs.CL

DFlare: Scaling Up Draft Capacity for Block Diffusion Speculative Decoding

Pith reviewed 2026-06-28 14:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords speculative decodingblock diffusiondraft modellayer-wise fusionLLM inferencespeedupQwen models
0
0 comments X

The pith

DFlare replaces DFlash's shared fusion with layer-wise combinations so each draft layer receives its own mix of target layers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to scale draft models for block diffusion speculative decoding by removing a shared bottleneck. Instead of all draft layers using one fused representation from a few target layers, each draft layer now learns its own combination from a broader set of target layers. This change adds expressiveness at low cost and allows training deeper draft models on more data. The result is higher wall-clock speedups on models like Qwen3-4B and GPT-OSS-20B across reasoning, code, and conversation tasks. A sympathetic reader would care because it directly improves the speed of running large language models without changing the target model itself.

Core claim

DFlare flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism where each draft layer attends to its own learnable combination of a broad set of target layers, injecting richer target knowledge and giving every draft layer a distinct input, which enables scaling the draft model to deeper architectures with consistent gains when also increasing training data to 2.4M samples.

What carries the argument

The lightweight layer-wise fusion mechanism that lets each draft layer attend to its own learnable combination of target layers.

If this is right

  • Draft models can be scaled to greater depth while maintaining training stability.
  • Wall-clock speedups reach 5.52x on Qwen3-4B and similar gains on larger models.
  • Performance improves over prior methods by 5-11% on the tested benchmarks.
  • Target model knowledge is utilized more effectively across draft layers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the layer-wise approach generalizes, similar per-layer conditioning could improve other draft-based acceleration techniques.
  • Training larger draft models this way might reduce the need for very large target models in some inference scenarios.
  • Future work could test whether the fusion weights reveal which target layers matter most for different draft depths.

Load-bearing premise

The gains are driven by the per-layer expressiveness of the fusion rather than simply by training on more data or using deeper models.

What would settle it

A controlled experiment training DFlare without the layer-specific combinations and measuring whether speedups drop back to DFlash levels on the same benchmarks and models.

Figures

Figures reproduced from arXiv: 2606.02091 by Dawei Zhu, Eugene J.Yu, Guanghua Yu, Jianchen Zhu, Jiangshan Duo, Jiebin Zhang, Song Liu, Sujian Li, Weimin Xiong, Yifan Song, Zhenghan Yu, Zheng Li.

Figure 1
Figure 1. Figure 1: Left: wall-clock speedup (×) of different speculative decoding methods on Qwen3-8B across five benchmarks under greedy decoding; DFLARE consistently achieves the highest speedup, outperforming DFlash by a significant margin on every benchmark. Right: speedup and acceptance length of DFLARE on Qwen3-8B as the training data scales from 270k to 2.4M samples; DFLARE delivers consistent and substantial improvem… view at source ↗
Figure 2
Figure 2. Figure 2: The overview of our DFLARE method. DFLARE utilizes adaptive layer fusion of target hidden states and heterogeneous KV projections to enhance per-layer expressiveness. patterns to deep semantic representations. To ef￾fectively inject this multi-granularity knowledge into the draft model, DFLARE introduces Adap￾tive Layer Fusion: a lightweight, layer-specific mechanism that allows each draft layer to learn i… view at source ↗
Figure 3
Figure 3. Figure 3: The impact of draft model layers (Left) and target model layers (Right) on the performance of DFlash and [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Block diffusion speculative decoding accelerates LLM inference by predicting all tokens within a block simultaneously for the target model to verify in parallel. Predicting an entire block at once requires a sufficiently capable draft model and effective utilization of the target model's internal knowledge. However, the state-of-the-art method DFlash constrains all draft layers to share a single fused representation derived from only a few target layers, limiting per-layer expressiveness and hindering further scaling of draft capacity. In this paper, we present \modelname, which flares out the narrow conditioning bottleneck of DFlash through a lightweight layer-wise fusion mechanism: each draft layer attends to its own learnable combination of a broad set of target layers at negligible overhead, simultaneously injecting richer target knowledge and providing every draft layer with a distinct input. This enhanced per-layer expressiveness enables scaling the draft model to deeper architectures with consistent gains. We further scale training data from 800K to 2.4M samples to fully exploit the enlarged capacity. On six benchmarks spanning mathematical reasoning, code generation, and conversation, \modelname attains average wall-clock speedups of 5.52x on Qwen3-4B, 5.46x on Qwen3-8B, and 3.91x on GPT-OSS-20B, improving over DFlash by roughly 11\%, 8\%, and 5\% respectively. Our code is available at https://github.com/Tencent/AngelSlim.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces DFlare, an extension of block diffusion speculative decoding that replaces DFlash's shared fused representation with a lightweight layer-wise fusion mechanism. Each draft layer now attends to its own learnable combination of a broad set of target layers, enabling deeper draft architectures and richer per-layer conditioning. The authors also triple the training data (800K to 2.4M samples) and report average wall-clock speedups of 5.52× on Qwen3-4B, 5.46× on Qwen3-8B, and 3.91× on GPT-OSS-20B, which are 11%, 8%, and 5% higher than DFlash on six benchmarks.

Significance. If the per-layer fusion is shown to be the primary driver rather than the data increase, the method would provide a low-overhead way to scale draft capacity in speculative decoding, which could meaningfully improve inference throughput for large models. The open code release is a positive factor for reproducibility.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the 11%/8%/5% gains over DFlash are presented as resulting from the layer-wise fusion, yet the manuscript simultaneously triples the training data (800K→2.4M). No ablation is described that holds data volume fixed while varying only the fusion mechanism (shared vs. layer-wise) or holds the architecture fixed while varying only data volume. This prevents confident attribution of the reported speedups to the claimed architectural change.
  2. [Abstract] Abstract: the speedups are given as point estimates with no error bars, standard deviations across runs, or statistical tests. Without these, it is impossible to assess whether the 5–11% margins over DFlash are reliable or could be explained by run-to-run variance.
  3. [Abstract] Abstract: the evaluation is limited to three target models and a single (unspecified) sampling temperature. No results are shown for additional temperatures, different target architectures, or verification that the gains survive changes in the target model's internal representations.
minor comments (1)
  1. [Abstract] The abstract states that the fusion operates “at negligible overhead,” but no concrete FLOPs or latency measurements for the fusion module itself are provided to support this claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. Below we respond point-by-point to the major concerns, committing to revisions that strengthen attribution and reporting while noting the scope of feasible additions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the 11%/8%/5% gains over DFlash are presented as resulting from the layer-wise fusion, yet the manuscript simultaneously triples the training data (800K→2.4M). No ablation is described that holds data volume fixed while varying only the fusion mechanism (shared vs. layer-wise) or holds the architecture fixed while varying only data volume. This prevents confident attribution of the reported speedups to the claimed architectural change.

    Authors: We agree that an explicit ablation isolating the fusion mechanism from data scaling would improve attribution. The layer-wise fusion enables deeper drafts that benefit from additional data, but to clarify the architectural contribution we will add, in the revision, an ablation that trains both DFlash and DFlare on the original 800K samples and reports the resulting speedups. revision: yes

  2. Referee: [Abstract] Abstract: the speedups are given as point estimates with no error bars, standard deviations across runs, or statistical tests. Without these, it is impossible to assess whether the 5–11% margins over DFlash are reliable or could be explained by run-to-run variance.

    Authors: We acknowledge that variability metrics are needed to assess reliability. In the revised manuscript we will report standard deviations obtained from three independent training and evaluation runs for each model and include paired statistical tests comparing DFlare against DFlash. revision: yes

  3. Referee: [Abstract] Abstract: the evaluation is limited to three target models and a single (unspecified) sampling temperature. No results are shown for additional temperatures, different target architectures, or verification that the gains survive changes in the target model's internal representations.

    Authors: The evaluation uses three models spanning 4B–20B parameters and six benchmarks. We will add speedups at temperatures 0.5, 0.7 and 1.0 for all models in the revision. Extending to further architectures or internal-representation probes would require substantial new compute; we will note this scope limitation explicitly. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical speedups measured on fixed benchmarks

full rationale

The paper describes an architectural change (layer-wise fusion) plus data scaling, then reports measured wall-clock speedups on six benchmarks. No equations, fitted parameters, or self-citations are presented that reduce the claimed gains to a definition or self-referential construction. The central result is an external empirical measurement rather than a derivation that collapses into its inputs by construction. This is the normal case of a self-contained empirical claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the method is described as a lightweight architectural change plus data scaling.

pith-pipeline@v0.9.1-grok · 5829 in / 1260 out tokens · 20604 ms · 2026-06-28T14:46:44.952945+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

56 extracted references · 11 canonical work pages · 4 internal anchors

  1. [1]

    International Conference on Machine Learning , pages=

    Fast inference from transformers via speculative decoding , author=. International Conference on Machine Learning , pages=. 2023 , organization=

  2. [2]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Accelerating large language model decoding with speculative sampling , author=. arXiv preprint arXiv:2302.01318 , year=

  3. [3]

    Better & faster large language models via multi-token prediction , year =

    Gloeckle, Fabian and Idrissi, Badr Youbi and Rozi\`. Better & faster large language models via multi-token prediction , year =. Proceedings of the 41st International Conference on Machine Learning , articleno =

  4. [4]

    2025 , eprint=

    TiDAR: Think in Diffusion, Talk in Autoregression , author=. 2025 , eprint=

  5. [5]

    Proceedings of the 41st International Conference on Machine Learning , pages=

    MEDUSA: Simple LLM inference acceleration framework with multiple decoding heads , author=. Proceedings of the 41st International Conference on Machine Learning , pages=

  6. [6]

    2026 , eprint=

    P-EAGLE: Parallel-Drafting EAGLE with Scalable Training , author=. 2026 , eprint=

  7. [7]

    2026 , eprint=

    DFlash: Block Diffusion for Flash Speculative Decoding , author=. 2026 , eprint=

  8. [8]

    2025 , eprint=

    Speculative Diffusion Decoding: Accelerating Language Generation through Diffusion , author=. 2025 , eprint=

  9. [9]

    2025 , eprint=

    SpecDiff-2: Scaling Diffusion Drafter Alignment For Faster Speculative Decoding , author=. 2025 , eprint=

  10. [10]

    2025 , eprint=

    DiffuSpec: Unlocking Diffusion Language Models for Speculative Decoding , author=. 2025 , eprint=

  11. [11]

    International Conference on Machine Learning , year =

    Yuhui Li and Fangyun Wei and Chao Zhang and Hongyang Zhang , title =. International Conference on Machine Learning , year =

  12. [12]

    EAGLE-3: Scaling up Inference Acceleration of Large Language Models via Training-Time Test

    Eagle-3: Scaling up inference acceleration of large language models via training-time test , author=. arXiv preprint arXiv:2503.01840 , year=

  13. [13]

    2025 , eprint=

    Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding , author=. 2025 , eprint=

  14. [14]

    2025 , eprint=

    Fast-dLLM v2: Efficient Block-Diffusion LLM , author=. 2025 , eprint=

  15. [15]

    The Thirteenth International Conference on Learning Representations , year=

    Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models , author=. The Thirteenth International Conference on Learning Representations , year=

  16. [16]

    2025 , eprint=

    SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths , author=. 2025 , eprint=

  17. [17]

    2021 , eprint=

    Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

  18. [18]

    2021 , eprint=

    Measuring Mathematical Problem Solving With the MATH Dataset , author=. 2021 , eprint=

  19. [19]

    American Invitational Mathematics Examination (AIME) 2025 , author=

  20. [20]

    2021 , eprint=

    Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

  21. [21]

    2021 , eprint=

    Program Synthesis with Large Language Models , author=. 2021 , eprint=

  22. [22]

    2024 , eprint=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. 2024 , eprint=

  23. [23]

    2023 , eprint=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  24. [24]

    Hashimoto , title =

    Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

  25. [25]

    2025 , eprint=

    NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model , author=. 2025 , eprint=

  26. [26]

    Step 3.5 Flash: Open frontier-level intelligence with 11b active parameters, 2026

    Step 3.5 Flash: Open Frontier-Level Intelligence with 11B Active Parameters , author=. arXiv preprint arXiv:2602.10604 , year=

  27. [27]

    2023 , eprint=

    Enhancing Chat Language Models by Scaling High-quality Instructional Conversations , author=. 2023 , eprint=

  28. [28]

    2024 , eprint=

    EAGLE-2: Faster Inference of Language Models with Dynamic Draft Trees , author=. 2024 , eprint=

  29. [29]

    arXiv preprint arXiv:2509.22134 , year=

    Bridging Draft Policy Misalignment: Group Tree Optimization for Speculative Decoding , author=. arXiv preprint arXiv:2509.22134 , year=

  30. [30]

    arXiv preprint arXiv:2305.09781 , year=

    Specinfer: Accelerating generative large language model serving with tree-based speculative inference and verification , author=. arXiv preprint arXiv:2305.09781 , year=

  31. [31]

    GitHub repository , howpublished =

    Sahil Chaudhary , title =. GitHub repository , howpublished =. 2023 , publisher =

  32. [32]

    2024 , eprint=

    Hydra: Sequentially-Dependent Draft Heads for Medusa Decoding , author=. 2024 , eprint=

  33. [33]

    2025 , eprint=

    Learning Harmonized Representations for Speculative Sampling , author=. 2025 , eprint=

  34. [34]

    2025 , eprint=

    EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , author=. 2025 , eprint=

  35. [35]

    2024 , eprint=

    GliDe with a CaPE: A Low-Hassle Method to Accelerate Speculative Decoding , author=. 2024 , eprint=

  36. [36]

    2024 , eprint=

    Dynamic Depth Decoding: Faster Speculative Decoding for LLMs , author=. 2024 , eprint=

  37. [37]

    2025 , eprint=

    Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation , author=. 2025 , eprint=

  38. [38]

    2025 , eprint=

    Token-Driven GammaTune: Adaptive Calibration for Enhanced Speculative Decoding , author=. 2025 , eprint=

  39. [39]

    2024 , eprint=

    Dynamic Speculation Lookahead Accelerates Speculative Decoding of Large Language Models , author=. 2024 , eprint=

  40. [40]

    2025 , eprint=

    C2T: A Classifier-Based Tree Construction Method in Speculative Decoding , author=. 2025 , eprint=

  41. [41]

    2024 , eprint=

    SGLang: Efficient Execution of Structured Language Model Programs , author=. 2024 , eprint=

  42. [42]

    2025 , eprint=

    Scaling Laws for Speculative Decoding , author=. 2025 , eprint=

  43. [43]

    Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding

    Xia, Heming and Yang, Zhe and Dong, Qingxiu and Wang, Peiyi and Li, Yongqi and Ge, Tao and Liu, Tianyu and Li, Wenjie and Sui, Zhifang. Unlocking Efficiency in Large Language Model Inference: A Comprehensive Survey of Speculative Decoding. Findings of the Association for Computational Linguistics ACL 2024. 2024. doi:10.18653/v1/2024.findings-acl.456

  44. [44]

    Distillspec: Improving speculative decoding via knowledge distillation.arXiv preprint arXiv:2310.08461, 2023

    Distillspec: Improving speculative decoding via knowledge distillation , author=. arXiv preprint arXiv:2310.08461 , year=

  45. [45]

    Online speculative decoding

    Online speculative decoding , author=. arXiv preprint arXiv:2310.07177 , year=

  46. [46]

    2025 , eprint=

    GRIFFIN: Effective Token Alignment for Faster Speculative Decoding , author=. 2025 , eprint=

  47. [47]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  48. [48]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  49. [49]

    arXiv preprint arXiv:2602.21233 , year=

    AngelSlim: A more accessible, comprehensive, and efficient toolkit for large model compression , author=. arXiv preprint arXiv:2602.21233 , year=

  50. [50]

    2022 , eprint=

    Diffusion-LM Improves Controllable Text Generation , author=. 2022 , eprint=

  51. [51]

    2022 , eprint=

    Self-conditioned Embedding Diffusion for Text Generation , author=. 2022 , eprint=

  52. [52]

    2023 , eprint=

    Structured Denoising Diffusion Models in Discrete State-Spaces , author=. 2023 , eprint=

  53. [53]

    2022 , eprint=

    DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models , author=. 2022 , eprint=

  54. [54]

    2024 , eprint=

    Simple and Effective Masked Diffusion Language Models , author=. 2024 , eprint=

  55. [55]

    2025 , eprint=

    Simplified and Generalized Masked Diffusion for Discrete Data , author=. 2025 , eprint=

  56. [56]

    A Survey on Diffusion Language Models

    A survey on diffusion language models , author=. arXiv preprint arXiv:2508.10875 , year=